CN115587713A

CN115587713A - Marine ranch disaster decision method based on reinforcement learning

Info

Publication number: CN115587713A
Application number: CN202211386315.5A
Authority: CN
Inventors: 张大海; 夏梅娟; 宋革联
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-01-10

Abstract

The invention discloses a marine ranch disaster decision method based on reinforcement learning. The method comprises the following steps: the interactive environment module constructs a virtual pasture sea area of the marine pasture; the disaster judgment module judges whether a disaster occurs in the marine ranch, and the action space module is used for outputting a feedback result by adopting a preset post-disaster action on the sea area of the virtual ranch; the decision module outputs preliminary decision data; the disaster judgment module judges whether the marine ranch disaster is finished or not and outputs a judgment result; the reward updating module calculates a reward value; after being corrected, the data are sequentially input into a parameter optimization module and a decision module for updating and optimizing; repeatedly acquiring a disaster judgment module and a decision module after training; and the disaster judgment module judges that disasters happen to the marine ranch, the trained decision module outputs monitoring decision data, and the decision is made for the marine ranch with the disasters according to the monitoring decision data. The method and the system can improve the accuracy and flexibility of marine ranch disaster decision, solve the problems of laggard risk disaster decision technology and the like, and improve the management and control efficiency.

Description

Marine ranch disaster decision method based on reinforcement learning

Technical Field

The invention relates to a marine ranch disaster decision method, in particular to a marine ranch disaster decision method based on reinforcement learning.

Background

In the field of decision-making research of marine environmental disasters, an analytic hierarchy process, namely a multi-target decision-making analysis method integrating quantitative and qualitative analysis, is widely applied. The principle is that the problem is divided into layers, relevant factors are classified and decomposed to form a multi-layer structure model, and the factors are assigned layer by layer. The analytic hierarchy process simplifies the problem, decomposes the research problem to make it hierarchical and quantitative, thereby making the analysis and treatment of the problem easier. However, when a relevant scene of a marine ranch disaster decision related to a complex sea condition is involved, the analytic hierarchy process is very limited due to the fact that physical factors and laws which cannot be accurately quantified and layered exist in a marine environment.

One of the main research goals in the field of artificial intelligence is to implement fully autonomous agents. The intelligence can interact with the environment where the intelligence is located, learn the best behavior according to environmental feedback, and continuously improve the action strategy through repeated experiments. The advent of Deep Reinforcement Learning (DRL) provides a theoretical basis for the realization of this goal. As an important branch of the field of artificial intelligence research, it is considered as a key to realizing human-like intelligence, and has been receiving wide attention from both academia and industry.

The DRL is an end-to-end sensing and control system and has strong universality. The learning process can be described as: interacting the agent with the environment at each moment to obtain a high-dimensional observation, and sensing the observation by using a DL (DL) method to obtain a specific state characteristic representation; evaluating a value function of each action based on expected returns, and mapping the current state into a corresponding action through a certain strategy; the environment reacts to this action and gets the next observation.

By continuously cycling the above processes, the optimal strategy for achieving the target can be finally obtained. On one hand, the DRL has strong representation capability on strategies and states and can be used for simulating a complex decision process; on the other hand, reinforcement learning endows the intelligent agent with self-supervision learning ability, so that the intelligent agent can autonomously interact with the environment, and the intelligent agent continuously progresses in trial and error. However, in the construction of marine ranches, DRLs have not been used in any way.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a marine ranch disaster decision method based on reinforcement learning, and a marine ranch disaster decision algorithm based on reinforcement learning, so as to overcome the defects of low efficiency, poor flexibility, weak linkage and the like of dynamic decision planning of marine disasters related to marine ranches in the prior art.

The technical scheme adopted by the invention is as follows:

the marine ranch disaster decision method comprises the following steps:

the method comprises the following steps: obtaining historical pasture state data of a marine pasture before the current moment, inputting the historical pasture state data into a data processing module for data preprocessing, and obtaining historical pasture preprocessing state data; the method comprises the steps of inputting historical pasture state preprocessing data into an interactive environment module, constructing a virtual pasture sea area of a marine pasture in the interactive environment module, namely, taking the preprocessed historical pasture data as input, and constructing a sea area ecological simulation evaluation model based on an artificial neural network as a main body of the interactive environment module.

Step two: inputting the historical pasture state preprocessing data into a disaster judgment module, judging whether a disaster occurs in a marine pasture by the disaster judgment module, inputting a preset post-disaster action into an interactive environment module through an action space module when the disaster occurs in the marine pasture, and adopting the preset post-disaster action on a virtual pasture sea area, wherein the interactive environment module outputs a feedback result generated by the virtual pasture sea area; the acquired pasture state data are all lag data, namely the acquired historical pasture data at the moment before the current moment are actually historical pasture data N hours before the current moment of the marine pasture, namely the actual marine disaster of the marine pasture for N hours is judged by the historical pasture data at the moment before the current moment.

Step three: the method comprises the steps of obtaining real-time pasture state data of a marine pasture, inputting the real-time pasture state data into a data processing module for data preprocessing, obtaining real-time pasture preprocessing state data, inputting the real-time pasture preprocessing state data into a decision module, and outputting preliminary decision data by the decision module.

Step four: inputting the preliminary decision data into an interactive environment module, and outputting a predicted state value and a state variable quantity of a virtual pasture sea area by the interactive environment module; the method comprises the steps that historical pasture state preprocessing data, feedback results generated by a virtual pasture sea area, a prediction state value and state variation are input into a disaster judgment module, and the disaster judgment module judges whether a disaster of a marine pasture is finished or not so as to output a judgment result.

In specific implementation, the disaster judgment module summarizes the historical pasture state preprocessing data, the feedback result generated by the virtual pasture sea area, the prediction state value and the state variation into a parameter disaster-causing correlation formula by combining the early warning condition and the threshold value, and is used for judging whether the current pasture sea area and the virtual sea area environment are in a risk disaster state, namely judging whether the disaster is ended; specifically, it may be determined whether the predicted state value is still in the risk interval.

Step five: and inputting the judgment result output by the disaster judgment module, the predicted state value and the state variable quantity of the virtual pasture sea area into the reward updating module, and calculating the current reward value by the reward updating module.

Step six: correcting the judgment result output by the disaster judgment module and the prediction state value of the virtual pasture sea area according to the real-time pasture preprocessing state data; and inputting the corrected judgment result, the predicted state value, the preliminary decision data, the state variation of the marine ranch and the environmental estimation error into a parameter optimization module for processing, and inputting the processed output into a decision module for updating and optimizing.

During correction, the predicted state value of the sea area of the virtual pasture is corrected into real-time pasture preprocessing state data, meanwhile, the real-time state of the marine pasture is determined, and the judgment result output by the disaster judgment module is corrected into the real-time state of the marine pasture.

Step seven: repeating the first step to the sixth step to repeatedly train the disaster judgment module and the decision module until the reward value calculated and obtained by the reward updating module converges to the maximum value, stopping the training of the disaster judgment module and the decision module, and obtaining the trained disaster judgment module and the trained decision module; there is also a need to reduce the number of training rounds required on the basis of parameter optimization.

Step eight: the method comprises the steps of obtaining pasture monitoring state data of the marine ranch in real time, inputting the pasture monitoring state data into a data processing module for data preprocessing, obtaining pasture preprocessing monitoring state data, inputting the pasture preprocessing monitoring state data into a disaster judgment module which is trained, inputting the pasture preprocessing monitoring state data into a decision module which is trained when the disaster judgment module judges that the marine ranch has a disaster, outputting monitoring decision data after processing, and making a decision on the marine ranch which has the disaster according to the monitoring decision data.

The historical pasture state data and the real-time pasture state data of the marine pasture comprise sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like.

The sea area multi-parameter sensor data comprises data such as serial numbers, dates, time, conductance, chlorophyll, PH values, dissolved oxygen, sound velocity and the like; the turbidity sensor data comprises turbidity data; the flow velocity data comprises data such as layer number, depth, original flow velocity data, flow velocity in the x direction, flow velocity in the y direction, flow velocity in the z direction, synthetic flow velocity direction and the like; the ecological simulation forecast data comprises time, longitude, latitude, depth, water level, salinity, water temperature, east flow velocity, north flow velocity and the like.

The method comprises the steps that historical pasture state data are input into a data processing module to be subjected to data preprocessing, historical pasture preprocessing state data are obtained, specifically, sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like in the historical pasture state data are respectively input into the data processing module to be sequentially subjected to default value supplementing, random sampling, serialization and the like, and in specific implementation, default value supplementing is carried out on data with gaps; and each group of pasture state data is subjected to data compression according to the scale of the data set, the scale of the data set is mainly judged by depth value and data strip number, the small data set is subjected to vertical average processing, the large data set is subjected to data compression processing by a VAE (virtual access engine) model, and the processed output is jointly constructed into historical pasture preprocessing state data.

In the first step, the historical pasture state preprocessing data is input into the interactive environment module, the interactive environment module constructs a virtual pasture sea area of the marine pasture, and specifically, the interactive environment module constructs the virtual pasture sea area according to the historical pasture state preprocessing data, the throwing layout structure of each device in the marine pasture sink, a two-dimensional shallow water equation of the sea area where the marine pasture is located and the embedded second-order-moment turbulence closed submodel. The virtual pasture sea area can predict the environmental data of the next decision time according to different environmental data.

In the second step, the disasters of the marine ranch specifically include meteorological disasters, hydrological disasters and geological disasters, historical pasture state preprocessing data are input into the disaster judgment module, the disaster judgment module judges whether the marine ranch has disasters, the disaster judgment module specifically judges whether the marine ranch meets early warning conditions of the occurrence of the meteorological disasters, the hydrological disasters or the geological disasters according to the historical pasture state preprocessing data of the marine ranch, and if the historical pasture state preprocessing data of the marine ranch are met, the disaster judgment module judges whether the marine ranch is in the meteorological disasters, the hydrological disasters or the geological disasters.

The meteorological disaster, the hydrological disaster and the geological disaster all have specific early warning conditions, namely the early warning interval standard corresponding to the national standard, such as storm surge belonging to the hydrological disaster, and the over-warning tide level, the wind speed and the one-third tide height H in the historical pasture state preprocessing data _1/3 And when the marine ranches exceed the early warning condition, judging that the marine ranch is in storm surge disasters.

In the second step, the action space module comprises a plurality of preset post-disaster actions, each preset post-disaster action corresponds to a meta-action taken when a parameter value exceeding the early warning value is regulated and controlled, and the parameter value exceeding the early warning value is one of parameter values in historical pasture state data of the marine pasture, namely one of parameter values contained in sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like, such as wind speed, tide level and the like; the preset post-disaster actions comprise measuring the start-stop time, the start-stop duration, the moving direction, the moving speed and the like of the equipment with parameter values exceeding the early warning value.

The preset post-disaster action needs to be an operation which can be supported by the IOT equipment or the data acquisition equipment, and the integral action efficiency is attenuated by the hysteresis of the operation of the comprehensive equipment when the pre-disaster action is needed. When a plurality of parameters are abnormal, the disaster type is also judged, a single decision step length only selects one decision action from the action space, the action has a certain probability p of being random (exploration property), the rest probabilities 1-p are the actions with the maximum reward, and the decision round comprises all the actions with one or more decision step lengths to form an action sequence, which can also be called as a decision scheme.

The interactive environment module outputs a feedback result generated by the virtual pasture sea area, and specifically, the feedback result is pasture state data after the virtual pasture sea area takes preset post-disaster actions.

In the third step, the decision module is a deep Q network DQN, the deep Q network DQN specifically adopts a double memory model LSTM, and the double memory model LSTM comprises a short-term memory network and a long-term memory network which are sequentially connected; the short-term memory network consists of two components, including a deep Q network for learning the current task and an experience replay containing only the current task data; the long-term memory network consists of two components, namely a deep Q-network containing knowledge learned from the beginning to the present of all tasks and a generative confrontation network used to generate a representation of the experience of these reinforcement learning tasks. The decision module is constructed based on DQN (Deep Q-learning Network) of a Q-learning (Q-learning) algorithm of Deep learning, off-line training is carried out through an off-policy strategy, a value function is approximately constructed with a neural Network, and training of the Network is carried out by adopting a target Network and an experience replay method.

In the fourth step, the preliminary decision data is specifically an action sequence formed by one or more preset post-disaster actions in the action space module, the preliminary decision data is input into the interactive environment module, the interactive environment module outputs a predicted state value and a state variation of the virtual pasture sea area after the action sequence is taken, the predicted state value of the virtual pasture sea area is specifically pasture state data of the virtual pasture sea area after the action sequence is taken, and the state variation of the virtual pasture sea area is the variation of the pasture state data of the virtual pasture sea area before and after the action sequence is taken.

In the actual processing, the interactive environment module can only receive sea area state parameters as input, so the action sequence in the preliminary decision data needs to be converted into an increase and decrease sequence of parameter values exceeding the early warning value of the pasture sea area state data at the last decision stage, and then the increase and decrease sequence is input into the interactive environment module, and meanwhile, the parameter change rate difference corresponding to the action in different disaster scenes needs to be considered.

When a meteorological disaster, a hydrological disaster or a geological disaster occurs in a marine ranch, a plurality of parameter values in real-time pasture state data of the marine ranch exceed an early warning value, namely a plurality of parameter values in (micro) delay data exceed the early warning value, and then each preset post-disaster action for regulating and controlling the plurality of parameter values exceeding the early warning value is required to be taken, so that preliminary decision data is formed; for example, in a disaster scene of storm surge, the parameter values exceeding the early warning value are wind speed, tide level, wave height, flow speed and the like, at this time, each preset post-disaster action includes a device for measuring wind speed, a device for measuring tide level, a device for measuring wave height, a moving direction and a moving speed of a device for measuring flow speed and the like, and the actions together form preliminary decision data.

In the fourth step, the disaster judgment module judges whether the disaster of the marine ranch is ended or not so as to output a judgment result, when all parameter values exceeding the early warning value in the prediction state values of the sea area of the virtual ranch do not exceed the early warning value, the disaster of the marine ranch is judged to be ended, and when one or more parameter values exceeding the early warning value in all parameter values exceeding the early warning value in the prediction state values of the sea area of the virtual ranch still exceed the early warning value, the disaster of the marine ranch is judged not to be ended.

And fifthly, inputting the judgment result output by the disaster judgment module into the reward updating module, calculating the current reward value by the reward updating module, consuming decision step length time once the judgment result output by the disaster judgment module is that the disaster of the marine ranch is not finished, giving a negative feedback value according to the current decision step length time when the disaster judgment module judges that the disaster of the marine ranch is finished, and giving a positive feedback value according to the disaster type of the marine ranch when the disaster judgment module judges that the disaster of the marine ranch is finished. And when the total decision step length time consumed in the training process exceeds the response time length, judging that the training is finished, and continuing training the decision model on the disaster data until the model can eliminate/separate/reduce the risk disaster within the response time length.

The state change quantity of the marine ranch is specifically the change quantity between real-time pasture preprocessing state data and the real-time pasture preprocessing state data after the interactive environment module takes an action sequence; the environment prediction error of the marine ranch is specifically an error between a prediction state value of a virtual ranch sea area and real-time ranch preprocessing state data after an interaction environment module takes an action sequence.

The invention has the beneficial effects that:

the method can increase the standard decision data set of the marine disaster related to the marine ranch, improve the accuracy and flexibility of the pasture disaster decision, solve the problems of laggard marine risk disaster decision technology and the like, and improve the maner marine area control area and control efficiency; the system can be combined with a corresponding pasture decision-making system to realize the functions of auxiliary decision-making under manual supervision, optimal decision-making scheme provision, autonomous decision-making without manual response and the like.

Drawings

FIG. 1 is a schematic diagram of a pasture disaster decision model;

FIG. 2 is a graph of a Zealand quad architecture;

FIG. 3 is a block diagram of an interactive environment module implementation;

FIG. 4 is a schematic flow chart of the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

The marine ranch disaster decision method comprises the following steps:

The historical pasture state data and the real-time pasture state data of the marine pasture comprise sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like. The sea area multi-parameter sensor data comprises data such as serial number, date, time, conductance, chlorophyll, PH value, dissolved oxygen, sound velocity and the like; the turbidity sensor data comprises turbidity data; the flow velocity data comprises data such as layer number, depth, original flow velocity data, flow velocity in the x direction, flow velocity in the y direction, flow velocity in the z direction, synthetic flow velocity direction and the like; the ecological simulation forecast data comprises time, longitude, latitude, depth, water level, salinity, water temperature, east flow velocity, north flow velocity and the like.

Inputting historical pasture state data into a data processing module for data preprocessing to obtain historical pasture preprocessing state data, specifically, respectively inputting sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like in the historical pasture state data into the data processing module for sequentially carrying out treatments such as deficiency value supplement, random sampling, serialization and the like, and in specific implementation, carrying out deficiency value supplement on the data with the deficiency; and each group of pasture state data is subjected to data compression according to the scale of the data set, the scale of the data set is mainly judged by depth value and data strip number, the small data set is subjected to vertical average processing, the large data set is subjected to data compression processing by a VAE (virtual access engine) model, and the processed output is jointly constructed into historical pasture preprocessing state data.

In the first step, the historical pasture state preprocessing data is input into an interactive environment module, the interactive environment module constructs a virtual pasture sea area of the marine pasture, and specifically, the interactive environment module constructs the virtual pasture sea area according to the historical pasture state preprocessing data, the throwing layout structure of each device in the marine pasture sink, a two-dimensional shallow water equation of the sea area where the marine pasture is located and an embedded second-order-moment turbulence closed submodel. The virtual pasture sea area can predict the environmental data of the next decision time according to different environmental data.

And in the second step, the disasters of the marine ranch specifically comprise meteorological disasters, hydrological disasters and geological disasters, historical pasture state preprocessing data are input into the disaster judgment module, the disaster judgment module judges whether the marine ranch has disasters, specifically, the disaster judgment module judges whether the marine ranch meets early warning conditions of the meteorological disasters, the hydrological disasters or the geological disasters according to the historical pasture state preprocessing data of the marine ranch, and if the marine disasters meet the early warning conditions, the disaster judgment module judges that the marine ranch is in the meteorological disasters, the hydrological disasters or the geological disasters.

The method comprises the steps that various disasters contained in meteorological disasters, hydrological disasters and geological disasters have specific early warning conditions, namely early warning interval standards corresponding to national standards, for example, when storm surge belonging to the hydrological disasters exceeds the early warning conditions, and when the over-warning tide level, the wind speed and the one-third tide height H1/3 in historical pasture state preprocessing data exceed the early warning conditions, the marine pasture is judged to be in the storm surge disasters.

In the second step, the action space module comprises a plurality of preset post-disaster actions, each preset post-disaster action corresponds to a meta-action taken when a parameter value exceeding the early warning value is regulated and controlled, and the parameter value exceeding the early warning value is one of parameter values in historical pasture state data of the marine pasture, namely one of parameter values contained in sea area multi-parameter sensor data, turbidity sensor data, flow rate data, ecological simulation forecast data and the like, such as wind speed, tide level and the like; the preset post-disaster action comprises the measurement of the start-stop time, the start-stop duration, the moving direction, the moving speed and the like of the equipment with the parameter value exceeding the early warning value.

The preset post-disaster action needs to be an operation which can be supported by the IOT equipment or the data acquisition equipment, and the integral action efficiency is attenuated by the hysteresis of the operation of the comprehensive equipment when the pre-disaster action is needed. When a plurality of parameters are abnormal, the disaster type is also judged, a single decision step only selects one decision action from the action space, the action has a certain probability p of being random (exploration property), the rest probabilities 1-p are the actions with the maximum reward, and the decision round comprises an action sequence formed by all the actions of one or more decision steps, and can also be called a decision scheme.

The interactive environment module outputs a feedback result generated by the virtual pasture sea area, specifically to pasture state data after the virtual pasture sea area takes preset post-disaster action.

When a meteorological disaster, a hydrological disaster or a geological disaster occurs in a marine ranch, a plurality of parameter values in real-time pasture state data of the marine ranch exceed an early warning value, namely a plurality of parameter values in (micro) delay data exceed the early warning value, and then each preset post-disaster action for regulating and controlling the plurality of parameter values exceeding the early warning value is required to be taken, so that preliminary decision data is formed; for example, in a disaster scene of a storm surge, the parameter values exceeding the early warning value are wind speed, tide level, wave height, flow velocity and the like, at this time, each preset post-disaster action includes a device for measuring wind speed, a device for measuring tide level, a device for measuring wave height, a moving direction and a moving speed of a device for measuring flow velocity and the like, and the actions together form preliminary decision data.

And fifthly, inputting the judgment result output by the disaster judgment module into a reward updating module, calculating the current reward value by the reward updating module, wherein each time a judgment result is output by the disaster judgment module, the decision step length time is consumed once, when the disaster judgment module judges that the disaster of the marine ranch is not finished, giving a negative feedback value according to the current decision step length time, and when the disaster judgment module judges that the disaster of the marine ranch is finished, giving a positive feedback value according to the disaster type of the marine ranch. And when the total decision step length time consumed in the training process exceeds the response time length, judging that the training is finished, and continuing training the decision model on the disaster data until the model can eliminate/separate/reduce the risk disaster within the response time length.

The state change quantity of the marine ranch is specifically the change quantity between the real-time pasture preprocessing state data and the real-time pasture preprocessing state data after the interactive environment module takes the action sequence; the environment estimation error of the marine ranch is specifically an error between a predicted state value of a virtual ranch sea area and real-time ranch preprocessing state data after an interactive environment module takes an action sequence.

Step six: correcting the judgment result output by the disaster judgment module and the prediction state value of the virtual pasture sea area according to the real-time pasture preprocessing state data; and inputting the corrected judgment result, the prediction state value, the preliminary decision data, the state variation of the marine ranch and the environmental estimation error into a parameter optimization module for processing, and inputting the processed output into a decision module for updating and optimization.

The specific embodiment of the invention is as follows:

in the practical application process, the relevant parameters in the algorithm model can be specifically adjusted and optimized according to the real-time data returned by the on-site and near-shore equipment, and the specific steps comprise: acquiring a plurality of groups of real-time state data of a pasture; and inputting each group of pasture real-time state data into the trained algorithm model, and combining the change of the real-time state data after decision making to obtain the actual convergence speed so as to further update the relevant parameters of the algorithm model.

The sea area environment meets the complexity and uncertainty, and the sea area environment parameters can be used without feature extraction; the data returned by the monitoring equipment is time-sequenced, so that the characteristic of strengthening learning serialization decision is met; at the same time, the information obtained by the decision agent (agent) is completely consistent with the real decision maker and does not need supervision, meaning that the final decision that the agent may make is better than the human decision maker.

As shown in fig. 2, based on relevant characteristics of reinforcement learning, in combination with disaster types (namely three dimensions of meteorological disaster, hydrological disaster, geological disaster, and the like) which may exist in a marine ranch, real-time or micro-delay state, action, decay coefficient, initial and final states, reward, state transition probability matrix, and the like of an underwater agent (undersater-agent) are fully customized, and a markov decision process represented by a quadruplet < S, a, R, T > is modeled. The meaning of each of the quadruplets < S, a, R, T > is: S-State, the current State of the environment; A-Actor (or Agent), agent; R-Reward, return after decision making; T-Tracjectory, a decision process (Trajectory).

The method comprises the steps of constructing a pasture disaster decision model, firstly taking a sea area environment as an environment where agents interact with the sea area environment, and taking an observed environment as an input state when the environment is not completely known but partially observable. Thereafter, statistics are made of all actions that a real decision maker can take, such as start-stop or other actions of the relevant equipment, as a set of actions (Action). After the strategy function, the action value function and the like in the Agent are designed and realized, the Agent is trained. The end mark of the primary decision process (Episode) is disaster relief (namely, each index returns to a normal value) or exceeds the decision time upper limit, and if the action made by the Agent returns the abnormal parameter to the normal value, the obtained Reward value (Reward) is Positive; negative (Negative) if the disaster is aggravated. And repeating the training process until the Agent has stable performance.

The above process is summarized as follows:

1, agent: a model body;

environment: a virtual sea area environment;

action: an action set, such as start-stop of a related device;

trajectory (Episode): one sample (i.e., one decision).

There are two types of end markers:

1) Disaster relief (each index returns to a normal value);

2) Exceeding the upper limit of decision time;

5.Reward: value of profit from action

Evaluating the action made by the Agent:

1) Positive value positive: regression of abnormal parameters to normal values;

2) 0 or negative value negative: no change or abnormal exacerbation;

s1: state (state)

S101: sea area state transition matrix

Let the history of the state of the sea area of the pasture be h _t ＝{s ₁ ，s ₂ |，s ₃ ，...，s _t }(h _t Including all the states before the pasture sea area), s ₁ 、s ₂ 、s ₃ ，...，s _t The states of the pasture sea areas at the time points 1, 2 and 3 … t before the current time point are shown.

If a state transition of a marine environment is markov-compliant, that is, the next state of a pasture marine state depends only on its current state and is independent of the state preceding its current state, the following condition is satisfied:

p(s _t+1 |s _t )＝p(s _t+1 |h _t )

P(s _t+1 |s _t ,a _t )＝P(s _t+1 |h _t ,a _t )

wherein p () represents the state transition probability of the pasture sea area; s is _t+1 Representing the state of the pasture sea area at the previous t +1 moment of the current moment; h is _t Representing all historical states before the time t of the pasture sea area; a is _t Indicating the action taken at time t.

But in most cases are limited to equipment or sea conditions, some parameters in the sea area environment are not observable, but the part of the observation problem can still be transformed to satisfy the MDP process. Describing the State Transition probability o(s) of sea area by the State Transition Matrix (State Transition Matrix) P of sea area _t+1 ＝s′|s _t ＝s)：

Wherein s' represents the state of the pasture sea area at the next moment; s is _N The state of the pasture sea area at the previous N moments of the current moment is shown.

In this embodiment, when the sea area range is small, the number of the acquisition devices is small, and the acquired data amount is small after the mean value processing, the sampling processing and the like, historical state data of the sea area of the pasture can be used as the current state in the bellman equation, the iterative relationship between the current state and the future state is converted into a value function relationship, and the value function of each state is calculated by combining the value functions of all the states into an equation set. The method may go beyond step S103 of building and applying the sea area model.

Wherein V () represents a state cost function; r () represents a reward function; gamma denotes a discount factor.

S102: state space

The application scene of the embodiment can be any marine ranch risk disaster scene, such as typhoon waves, storm tides and the like. If the application scene is a storm surge scene, the current state data may be flow rate data (including number of layers, depth, raw flow rate data, x-direction flow rate, y-direction flow rate, z-direction flow rate, synthesized flow rate direction, etc.) and ecological simulation forecast data (including time, longitude, latitude, depth, water level, salinity, water temperature, east-direction flow rate, north-direction flow rate) at the current moment, and may be obtained by sensor feedback or ecological numerical model simulation. The embodiment does not limit the application scenario and the data acquisition mode, and can be constructed according to the scenario requirements in specific implementation.

S103: interactive environment

The types of equipment deployed in marine ranches are largely divided into three categories, namely aquaculture equipment, monitoring equipment, IOT equipment, etc., with aquaculture equipment being the most abundant, and in most ranches more than ninety percent.

In the boundary division, the equipment density has great influence on the model prediction and estimation of the parameter variation trend of the current water area. Dividing the image from the horizontal direction and the vertical direction, and applying three types of boundary conditions from large to small according to the density in the horizontal direction: closed lake water area conditions, semi-closed sea area conditions and open sea area conditions; in the vertical direction, reasonable layering is carried out according to the types of the thrown culture equipment and the types of cultured organisms, the single-layer culture equipment does not need to be divided into vertical boundaries, and the multiple layers need to be divided into three layers according to water-air contact surfaces, shallow water and deep sea. The interactive environment module is established based on semi-closed sea area and open Ocean parameter prediction parts in a Princeton Ocean Model (POM).

In this embodiment, the process of the interactive environment construction flow is shown in fig. 3. Setting a driving factor parameter according to a pasture planning scheme (comprising a culture type, a culture area and an appropriate environment parameter interval), sampling by combining pasture historical data, inputting the driving factor parameter into a neural network constructed based on an ANN (artificial neural network) model, and outputting the land utilization probability of the sea area as a part of the total probability; the other part of the total probability consists of the product of a transfer matrix, a neighborhood and self-adaptive inertia; defining random seeds, selecting a wheel disc, outputting a simulation result, outputting the result when the simulation result meets the requirement through value judgment based on a Markov prediction chain, and otherwise, adjusting the adaptive inertia proportion until the result meets the requirement.

S2: decision-making body (Agent/Actor)

The decision module is a deep Q network DQN, the deep Q network DQN specifically adopts a double-memory model LSTM, and the double-memory model LSTM comprises a short-term memory network and a long-term memory network which are sequentially connected; the short-term memory network consists of two components, including a deep Q network for learning the current task and an experience replay containing only the current task data; the long-term memory network consists of two components, namely a deep Q-network containing knowledge learned from the beginning to the present of all tasks and a generative confrontation network used to generate a representation of the experience of these reinforcement learning tasks. The decision module is constructed based on DQN (Deep Q-learning Network) of a Q-learning (Q-learning) algorithm of Deep learning, off-line training is carried out through an off-policy strategy, a value function is approximately constructed with a neural Network, and training of the Network is carried out by adopting a target Network and an experience replay method.

S3: reward (Reward)

S301: reward function

The most critical to agent behavior in the decision process is the reward function R. The reward function R is a desire to obtain a corresponding reward when a certain state is reached, wherein the reward obtained for a future state is typically multiplied by a discount factor y.

G _t ＝R _t+1 +γR _t+2 +γ ² R _t+3 +γ ³ R _t+4 ++γ ^T-t-1 R _T

Wherein G is _t Represents the total prize value at time t; r is _t+1 、R _t+2 、R _t+3 、R _t+4 ...R _T Respectively, the reward functions at time T +1, T +2, T +3 … T.

Since the simulated environment still cannot be completely consistent with the actual environment, there is uncertainty in the evaluation of the future state in the decision-making process. Meanwhile, in order to enable an agent to obtain the reward as soon as possible, rather than obtaining the reward at a certain point in the future, the discount factor attenuates the reward obtained in the future, indicating to the agent that the reward obtained at present is more important.

The value of the discount factor γ in this embodiment is 0.95, and the actual value can be adjusted according to the needs of a specific disaster application scenario. Particularly, when the value of gamma is 1, the reward obtained in the future is important as the reward obtained currently, which indicates that no discount is available for the reward in the future; when the value of gamma is 0, only the instant reward is considered, and the future reward is completely ignored.

After determining the state transition matrix of the pasture sea area based on the Markov chain, sampling the chain can obtain a string of tracks. The reward process can be understood as the superposition of a markov chain and a reward function R:

wherein, V _t () A state cost function representing time t;

indicating a desire; s represents the current state of the pasture sea area.

Specific calculation state cost function V _π The calculation of(s) needs to be performed by bellman's equation:

where S represents a set of states at the next time in the pasture sea area.

The evaluation of the state cost function can be performed in two different ways, namely, based on a Monte Carlo sampling (MC-based) method and based on a time-series difference (TD-based)) The method of (1). In the Monte Carlo sampling concept, after obtaining one MRP, a plurality of tracks can be sampled from a certain state, and the discount reward G which can be obtained by each track is calculated, so that the total reward G is obtained _t The value of this state can be obtained by dividing the number of tracks to perform a value approximation.

S302: special prize value

In this embodiment, the optimization objective is to release the current risk and disaster state within the shortest time or leave the current risk and disaster environment, so the reward function may be set to pass through one decision time step, and a fixed negative value is given as a penalty no matter whether a decision is made or not, and the reward function may be set according to the actual objective, for example, the reward function may be formed by equivalent weights such as a disaster-causing type, a disaster-causing factor deviation value, and decision feedback.

In this embodiment, taking a storm surge disaster scene as an example, the special reward value is set as shown in table 1 below, and the reward value is updated correspondingly every time a decision step is continued according to the level range of the disaster.

TABLE 1 reward value setting in storm surge disaster scenarios in examples

S4: single decision round (Trajectory/Episode) and repetitive training

S401: general training procedure

As shown in fig. 4, historical pasture state data of a marine pasture is monitored and captured by a python script and is input into a data processing module for data preprocessing, so as to obtain historical pasture preprocessing state data, specifically, the historical pasture state data is respectively input into the data processing module for sequentially performing default supplementation, random sampling and serialization, and in specific implementation, the data with the default is supplemented with the default; and each group of pasture state data is subjected to value taking according to depth, the small data set is subjected to vertical average processing, the large data set is subjected to data compression processing by adopting a VAE model, the processed output is constructed into historical pasture preprocessing state data together, and the historical pasture preprocessing state data are stored in a MongoDB database.

And inputting the historical pasture preprocessing state data into an interactive environment model of a sea area ecological simulation evaluation model according to a time sequence order for dynamic evaluation, and limiting the decision time step to be 1/1000 of the data updating time (the data updating time interval is 5 minutes in the disaster occurrence period, and the data updating time interval is 1 hour in other cases). Setting the initial reward as 0, and if the disaster is still ended after a decision time step length, rewarding-1; adjusting abnormal values in the historical state data, wherein an action space (increasing and decreasing values) is continuous (a value range needs to be calculated according to a parameter change curve, namely according to a sea area hydrodynamic parameter change rule), and the method mainly comprises the steps of utilizing thrown underwater equipment to regulate and control sea area parameters which can be monitored and adjusted; inputting the action into an interactive environment model, updating the prediction data, inputting the prediction data into a disaster judgment module, and calculating and judging whether the current risk disaster is relieved or not; if the real-time data transmitted in a delayed manner (due to hardware reasons) exist, the real-time data can be used for correcting the prediction data output in the model and updating the disaster state (state); reward value reward +500 (can be used for adjusting parameters) when the risk disaster is relieved; repeating the training until the reward value reward converges to the maximum value; under the condition of a live situation, sampling and serializing real-time data, inputting the real-time data into a trained algorithm model, simultaneously collecting decision feedback in the actual sea area environment, adjusting updating parameters, and verifying the correctness of the algorithm model; and repeating the experiment and carrying out strategy optimization.

During training, state data and a decision strategy in historical state data of a pasture can be used as input of a BNN Bayesian neural network, state variation and a reward value are used as output of the BNN Bayesian neural network, and the BNN Bayesian neural network is iteratively trained, wherein the state variation represents a difference value between next state data and previous state data. A Bayesian neural network can be obtained through training with a small amount of historical data, and virtual environment construction is carried out on the whole pasture sea area environment according to the Bayesian neural network, so that more learnable training data are provided for a reinforcement learning model.

In order to solve the problem that the deep Q Network may not converge in a small number of disaster scenes, and meanwhile, considering that real-time data returned in an actual sea area originally has a delay attribute, the embodiment adopts a fixed Q-targets method of delaying update parameters, so that the DQN has two networks, namely a prediction Network (Predict Q Network) and a Target Network (Target Q Network). The prediction network is used for predicting the Q value of each action corresponding to the current state, and the target network is used for predicting the Q value of each action of the next state or the next state. And parameters in the prediction network are updated in real time, and the target network judges whether to update according to the parameter updating result in the prediction network, so that reverse interference caused by abnormal environmental parameters, complex disaster-causing reasons and the like in a disaster scene is eliminated.

S402-S403: special scenario training and real-time decision scenario training

Because the off-policy offline training mode is adopted in the embodiment, and the experience replay method is combined, the state, the action, the reward and the next state tuple of each step are cached, batch training is performed for many times after a round is finished, the training speed and the stability of the DQN are improved, and the method is specifically realized: and maintaining a cache array with a specified size, randomly replacing the existing N cache pools with the newly generated N state, action, reward and next state tuples in each round, and then performing training for a plurality of times after the round is finished.

In this embodiment, an off-policy offline training mode is adopted for the marine ranch disaster decision algorithm model based on reinforcement learning, that is, an algorithm model for real-time use is completed according to the real-time method. When the method is formally used in an actual scene, more accurate decision data can be directly obtained by inputting current state data. When the decision scene type is storm surge disaster, the reinforcement learning model is trained under the longitude according to the reinforcement learning method in the embodiment, and when real-time decision is actually needed, only current state data needs to be input, and then decision data meeting an optimization target can be obtained. The decision-making goal of the trained algorithm model is to release the current risk disaster state in the shortest time or leave the current risk disaster environment, and then the decision-making data can guide how to make a decision so as to achieve the goal.

At each round of training, the generating countermeasure network is retrained to be able to produce samples that are representative of both the prior generating countermeasure network and the current task experience. As with the short-term deep Q network, the training method and the loss function thereof are only required to generate the countermeasure network in compliance with the standard. The decision model construction for practical application requires related research based on prefix topic, such as determining disaster-causing factors, monitoring specific operation of equipment, constructing a training set, an action set, and the like. In the actual training process, the sea area ecological numerical model constructed by combining the prefix topic is used as a virtual sea area environment, and the training speed and the decision effect of the model are greatly improved.

Reinforcement learning is one of the paradigms and methodologies of machine learning, and its basic idea is to influence the environment state by applying actions and to learn the optimal strategy to accomplish the goal by perceiving the response of the environment to the actions. The task of reinforcement learning is to learn how to map the current environmental state into an action in order to maximize the revenue signal. The agents, environments, policies, revenue signals, cost functions, and environmental models constitute the fundamental elements of reinforcement learning. The pasture disaster decision method based on reinforcement learning provided by the invention realizes how to regulate and control to solve the current problem when the intelligent body can autonomously judge historical data of various pasture sea area monitoring indexes and real-time data transmitted by micro-delay, and finally realizes the purpose of self-learning.

The invention combines an interactive environment model constructed by historical state data of the redundant marine ranch, determines an action space and a corresponding implementation strategy through disaster factor correlation analysis and action feedback calculation of a disaster classification system, and realizes real-time decision and strategy evaluation of the risk and disaster related to the marine ranch. The method provides a foundation for later-stage scientific research, pasture layout optimization and construction of a pasture disaster coping system, and promotes the sustainable development of the marine pasture.

Claims

1.A marine ranch disaster decision method based on reinforcement learning is characterized in that: the method comprises the following steps:

the method comprises the following steps: obtaining historical pasture state data of a marine pasture before the current moment, inputting the historical pasture state data into a data processing module for data preprocessing, and obtaining historical pasture preprocessing state data; inputting the historical pasture state preprocessing data into an interactive environment module, and constructing a virtual pasture sea area of a marine pasture in the interactive environment module;

step two: inputting the historical pasture state preprocessing data into a disaster judgment module, judging whether a disaster occurs in a marine pasture by the disaster judgment module, inputting a preset post-disaster action into an interactive environment module through an action space module when the disaster occurs in the marine pasture, and adopting the preset post-disaster action on a virtual pasture sea area, wherein the interactive environment module outputs a feedback result generated by the virtual pasture sea area;

step three: acquiring real-time pasture state data of a marine pasture, inputting the real-time pasture state data into a data processing module for data preprocessing, acquiring real-time pasture preprocessing state data, inputting the real-time pasture preprocessing state data into a decision module, and outputting preliminary decision data by the decision module;

step four: inputting the preliminary decision data into an interactive environment module, and outputting a predicted state value and a state variable quantity of a virtual pasture sea area by the interactive environment module; inputting the historical pasture state preprocessing data, feedback results generated by the virtual pasture sea area, predicted state values and state variable quantities into a disaster judgment module, and judging whether the disasters of the marine pasture are finished or not by the disaster judgment module so as to output judgment results;

step five: inputting the judgment result output by the disaster judgment module, the predicted state value and the state variation of the virtual pasture sea area into a reward updating module, and calculating the current reward value by the reward updating module;

step six: correcting the judgment result output by the disaster judgment module and the prediction state value of the virtual pasture sea area according to the real-time pasture preprocessing state data; inputting the corrected judgment result and the prediction state value, the preliminary decision data, the state variation of the marine ranch and the environmental prediction error into a parameter optimization module for processing, and inputting the processed output into a decision module for updating and optimizing;

step seven: repeating the first step to the sixth step to repeatedly train the disaster judgment module and the decision module until the reward value calculated and obtained by the reward updating module converges to the maximum value, stopping the training of the disaster judgment module and the decision module, and obtaining the trained disaster judgment module and the trained decision module;

2. The reinforcement learning-based marine ranch disaster decision method according to claim 1, characterized in that: the historical pasture state data and the real-time pasture state data of the marine pasture comprise sea area multi-parameter sensor data, turbidity sensor data, flow rate data and ecological simulation forecast data;

the historical pasture state data are input into a data processing module to be subjected to data preprocessing, historical pasture preprocessing state data are obtained, specifically, sea area multi-parameter sensor data, turbidity sensor data, flow rate data and ecological simulation forecast data in the historical pasture state data are respectively input into the data processing module to be sequentially subjected to deficiency value supplement, random sampling and serialization processing, and processed outputs are jointly constructed into the historical pasture preprocessing state data.

3. The method for marine ranch disaster decision based on reinforcement learning according to claim 1, characterized in that: in the first step, the historical pasture state preprocessing data is input into the interactive environment module, the interactive environment module constructs a virtual pasture sea area of the marine pasture, and specifically, the interactive environment module constructs the virtual pasture sea area according to the historical pasture state preprocessing data, the throwing layout structure of each device in the marine pasture sink, a two-dimensional shallow water equation of the sea area where the marine pasture is located and the embedded second-order-moment turbulence closed submodel.

4. The method for marine ranch disaster decision based on reinforcement learning according to claim 1, characterized in that: in the second step, the disasters of the marine ranch specifically include meteorological disasters, hydrological disasters and geological disasters, historical pasture state preprocessing data are input into the disaster judgment module, the disaster judgment module judges whether the marine ranch has disasters, the disaster judgment module specifically judges whether the marine ranch meets early warning conditions of the occurrence of the meteorological disasters, the hydrological disasters or the geological disasters according to the historical pasture state preprocessing data of the marine ranch, and if the historical pasture state preprocessing data of the marine ranch are met, the disaster judgment module judges whether the marine ranch is in the meteorological disasters, the hydrological disasters or the geological disasters.

5. The method for marine ranch disaster decision based on reinforcement learning according to claim 4, characterized in that: in the second step, the action space module comprises a plurality of preset post-disaster actions, each preset post-disaster action corresponds to a meta-action taken when a parameter value exceeding the early warning value is regulated, and the parameter value exceeding the early warning value is one of parameter values in historical pasture state data of the marine pasture, namely one of parameter values contained in sea area multi-parameter sensor data, turbidity sensor data, flow rate data and ecological simulation forecast data; presetting post-disaster actions including measuring the start-stop time, the start-stop duration, the moving direction and the moving speed of the equipment with parameter values exceeding the early warning value;

6. The method for marine ranch disaster decision based on reinforcement learning according to claim 1, characterized in that: in the third step, the decision module is a deep Q network DQN, the deep Q network DQN specifically adopts a dual-memory model LSTM, and the dual-memory model LSTM includes a short-term memory network and a long-term memory network which are connected in sequence.

7. The reinforcement learning-based marine ranch disaster decision method according to claim 1, characterized in that: in the fourth step, the preliminary decision data is specifically an action sequence formed by one or more preset post-disaster actions in the action space module, the preliminary decision data is input into the interactive environment module, the interactive environment module outputs a predicted state value and a state variation of the virtual pasture sea area after the action sequence is taken, the predicted state value of the virtual pasture sea area is specifically pasture state data of the virtual pasture sea area after the action sequence is taken, and the state variation of the virtual pasture sea area is the variation of the pasture state data of the virtual pasture sea area before and after the action sequence is taken.

8. The method for marine ranch disaster decision based on reinforcement learning according to claim 4, characterized in that: in the fourth step, the disaster judgment module judges whether the disaster of the marine ranch is ended or not so as to output a judgment result, when all parameter values exceeding the early warning value in the prediction state values of the sea area of the virtual ranch do not exceed the early warning value, the disaster of the marine ranch is judged to be ended, and when one or more parameter values exceeding the early warning value in all parameter values exceeding the early warning value in the prediction state values of the sea area of the virtual ranch still exceed the early warning value, the disaster of the marine ranch is judged not to be ended.

9. The reinforcement learning-based marine ranch disaster decision method according to claim 1, characterized in that: and fifthly, inputting the judgment result output by the disaster judgment module into the reward updating module, calculating the current reward value by the reward updating module, consuming decision step length time once the judgment result output by the disaster judgment module is that the disaster of the marine ranch is not finished, giving a negative feedback value according to the current decision step length time when the disaster judgment module judges that the disaster of the marine ranch is finished, and giving a positive feedback value according to the disaster type of the marine ranch when the disaster judgment module judges that the disaster of the marine ranch is finished.

10. The method of claim 7, wherein the method comprises: the state change quantity of the marine ranch is specifically the change quantity between real-time pasture preprocessing state data and the real-time pasture preprocessing state data after the interactive environment module takes an action sequence; the environment estimation error of the marine ranch is specifically an error between a predicted state value of a virtual ranch sea area and real-time ranch preprocessing state data after an interactive environment module takes an action sequence.