CN115829717A

CN115829717A - Wind control decision rule optimization method, system, terminal and storage medium

Info

Publication number: CN115829717A
Application number: CN202211182023.XA
Authority: CN
Inventors: 吴婧; 张志远; 洪镇宇
Original assignee: Xiamen International Bank Co ltd
Current assignee: Xiamen International Bank Co ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-03-21
Anticipated expiration: 2042-09-27
Also published as: CN115829717B

Abstract

The invention provides a method, a system, a terminal and a storage medium for optimizing a wind control decision rule, wherein the method comprises the following steps: respectively carrying out interactive exploration on the meta-intelligent agent and the simulation environment of each wind control decision system to obtain an interactive trajectory data set; performing model fitting training and model prediction control processing on the model to obtain an updating strategy; updating the preset wind control strategy according to the updating strategy, performing interactive exploration on the meta-intelligent agent and the simulation environment of each wind control decision system respectively and subsequent steps in a return mode according to the updated preset wind control strategy, performing parameter setting on the meta-intelligent agent, and performing interactive exploration on the meta-intelligent agent and the wind control decision system to be optimized to obtain a target exploration track; and optimizing the wind control decision system to be optimized according to the target exploration track. The method and the system can automatically optimize the wind control decision rule in the wind control decision system to be optimized, reduce the needed manpower and material resources and improve the efficiency of optimizing the wind control decision rule.

Description

Wind control decision rule optimization method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of wind control decision, in particular to a method, a system, a terminal and a storage medium for optimizing a wind control decision rule.

Background

In the field of financial science and technology, with the development of artificial intelligence technology, a large number of intelligent algorithm models are applied to wind control decision scenes, in order to pursue interpretability of decisions and controllability of processes, a rule-based wind control decision system is still a cornerstone of each application scene, and many leading-edge algorithm model technologies mainly help the wind control decision system to make decisions in an auxiliary mode under the current financial environment, so that optimization of wind control decision rules in the wind control decision system is more and more emphasized by people for improving accuracy of the wind control decision system.

In the existing optimization process of the wind control decision rule, the wind control decision rule is generally optimized after historical data is analyzed through manual experience of business experts, but the optimization direction suggestions of the wind control decision rule in the same wind control decision system are not uniform among different business experts, so that the optimization efficiency of the wind control decision rule is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system, a terminal and a storage medium for optimizing a wind control decision rule, and aims to solve the problem that the existing wind control decision rule is low in optimization efficiency.

The embodiment of the invention is realized in such a way that a method for optimizing a wind control decision rule comprises the following steps:

constructing a meta-task pool, collecting historical data of a wind control decision system in the meta-task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta-task pool to obtain a meta-task;

according to a preset wind control strategy, performing interactive exploration on a meta-intelligent agent and the simulation environment of a wind control decision system corresponding to each meta-task to obtain an interactive track data set, wherein the interactive track data set comprises the exploration track of the meta-intelligent agent in the simulation environment of each wind control decision system, and the exploration track is used for representing the state change process of the meta-intelligent agent in the simulation environment corresponding to the wind control decision system;

performing model fitting training on a model according to the interaction track data set, and performing model prediction control processing on the model after model fitting training to obtain an updating strategy;

updating the preset wind control strategy according to the updating strategy, and returning to execute the step of respectively carrying out interactive exploration on the meta-intelligent agent and each sample wind control decision system and the subsequent steps according to the updated preset wind control strategy until the model meets the convergence condition;

carrying out parameter setting on the meta-intelligent agent according to the converged parameters of the model, and carrying out interactive exploration on the meta-intelligent agent after the parameters are set and a wind control decision system to be optimized to obtain a target exploration track;

and optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

Further, the training of model fitting of the model according to the interaction trajectory data set includes:

extracting time points from the interaction track data set to obtain target time points, and determining a model training set and a model testing set in the interaction track data set according to the target time points;

and performing model fitting on the model according to the model training set, and performing parameter updating on the fitted model according to the model testing set to obtain the model after model fitting training.

Further, the determining a model training set and a model testing set in the interaction trajectory data set according to the target time point includes:

respectively acquiring track data between a first preset number of time points before each target time point and the target time point in each exploration track, and generating the model training set according to the acquired track data;

and respectively acquiring track data from each target time point to a second preset number of time points after the target time point in each exploration track, and generating the model test set according to the acquired track data.

Further, the performing model predictive control processing on the model after model fitting training to obtain an update strategy includes:

obtaining label data of the model training set, and performing feedback calculation on output data of the model training set according to the label data and the model to obtain a return function value;

and obtaining model parameters of the model after model fitting training, and performing model prediction control processing on the obtained model parameters and the return function values to obtain the updating strategy.

Further, the optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track includes:

updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

acquiring state information in a wind control decision rule in the wind control decision system to be optimized, wherein the state information comprises a user data value of user data in the wind control decision system to be optimized and a rule threshold value in the wind control decision rule;

and inputting the state information into the target meta-intelligent agent for rule optimization to obtain an optimization threshold, and updating the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold.

Furthermore, after the updating of the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold, the method further includes:

calculating an iterative return function value according to the rule threshold and the optimization threshold, and acquiring state information in a wind control decision rule in the wind control decision system to be optimized after rule optimization;

and updating parameters of the target meta-intelligent agent according to the acquired state information and the iteration return function value.

Further, the performing interactive exploration on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task according to the preset wind control strategy to obtain an interactive trajectory data set includes:

respectively acquiring the simulation environment of a wind control decision system corresponding to the meta-task extracted from the meta-task pool, and initializing the meta-agent;

and carrying out interactive exploration on the initialized meta-intelligent agent in the simulation environment of the wind control decision system corresponding to each meta-task to obtain the interactive trajectory data set.

Another objective of an embodiment of the present invention is to provide a system for optimizing a wind control decision rule, where the system includes:

the task extraction module is used for constructing a meta-task pool, collecting historical data of a wind control decision system in the meta-task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta-task pool to obtain a meta-task;

the interactive exploration module is used for interactively exploring a meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task according to a preset wind control strategy to obtain an interactive track data set, wherein the interactive track data set comprises an exploration track of the meta-intelligent agent in the simulation environment of each wind control decision system, and the exploration track is used for representing the state change process of the meta-intelligent agent in the simulation environment corresponding to the wind control decision system;

the model fitting module is used for performing model fitting training on a model according to the interaction track data set and performing model prediction control processing on the model after the model fitting training to obtain an updating strategy;

the strategy updating module is used for updating the preset wind control strategy according to the updating strategy and returning to execute the step of respectively carrying out interactive exploration on the meta-intelligent agent and each sample wind control decision system and the subsequent steps according to the updated preset wind control strategy until the model meets the convergence condition;

the parameter setting module is used for carrying out parameter setting on the meta-intelligent agent according to the converged parameters of the model and carrying out interactive exploration on the meta-intelligent agent after the parameter setting and the wind control decision system to be optimized to obtain a target exploration track;

and the rule optimization module is used for optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

It is another object of the embodiments of the present invention to provide a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method when executing the computer program.

It is a further object of embodiments of the present invention to provide a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned method steps.

The method comprises the steps of performing interactive exploration on a metaintelligent body and each sample wind control decision system respectively to obtain an exploration track of the metaintelligent body in each sample wind control decision system environment, performing model fitting training on a model based on a track data set to enable the model after the model fitting training to effectively output an optimal action decision aiming at each sample wind control decision system environment, performing model prediction control processing on the model after the model fitting training to obtain a wind control rule strategy aiming at the optimal action decision, obtaining an updating strategy, performing interactive exploration and subsequent steps on the metaintelligent body and each sample wind control decision system respectively through the updated preset wind control strategy to achieve the effect of model iterative training, performing parameter setting on the metaintelligent body based on a converged parameter of the model to enable the metaintelligent body after the parameter setting to effectively output the optimal action decision aiming at the environment of the wind control decision system to be optimized, performing interactive exploration on the metaintelligent body and the wind control decision system to be optimized based on the wind control decision of the environment change of the metaintelligent body to be optimized in the environment to be optimized decision system, and performing automatic optimization on the wind control decision of the metaintelligent system to be optimized. According to the method and the device, the wind control decision rule in the wind control decision system to be optimized can be automatically optimized, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, the needed manpower and material resources are reduced, and the optimization efficiency of the wind control decision rule is improved.

Drawings

Fig. 1 is a flowchart of a wind control decision rule optimization method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a wind control decision rule optimization method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a wind control decision rule optimization system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Referring to fig. 1, a flowchart of a wind control decision rule optimization method according to a first embodiment of the present invention is shown, where the wind control decision rule optimization method can be applied to any terminal device or system, and the wind control decision rule optimization method includes the steps of:

step S10, constructing a meta-task pool, collecting historical data of a wind control decision system in the meta-task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta-task pool to obtain a meta-task;

in this embodiment, first, historical data of each wind control decision system is collected, where the historical data includes historical threshold adjustment data, historical user characteristic data, historical decision data, historical admittance user behavior data, and the like. After the historical data of each wind control decision system is collected, the simulation environment of the wind control decision system is constructed by utilizing a Gym application programming interface of OpenAI and combining the historical data.

In this embodiment, after the simulation environment of the wind control decision system is constructed, the basic elements of the markov decision process are constructed. The sequence Decision problem is generally defined by a Markov Decision Process (MDP), which includes several elements: state (State, S), action (a), reward (R), state Transition Probability Distribution (T). For example, in a general wind control decision system, the state S may be a rule threshold of a wind control decision rule and user data, such as an age of a current user and an age rule threshold set by the current wind control decision rule, the input user data may be single or batch data, for the case of the single data, the single user feature data or the batch user extracted feature data is mainly included, for the batch data, a form of constructing a list is adopted, and multi-head input is performed by combining the current rule threshold. The action A is a numerical value of a rule threshold value adjusted by the wind control decision system, for example, an upper age limit set by the wind control decision system is increased, decreased or kept unchanged, and different step length lists can be set according to the concerned elements of the rule in the adjustment range;

e.g. a base step size of 1 for age, a base step size of 3 for access frequency, etc. The return R is a system return function designed according to an optimization target, and a label of sample data is needed for feedback calculation when the return R is calculated, if the goal of the wind control decision system is to reduce the bad customer rate, if a customer admitted by the wind control decision system triggers a judgment condition inconsistent with the label of historical sample data under the current rule threshold state, the system can obtain negative return feedback; on the contrary, if the judgment conditions are consistent, the wind control decision system can obtain positive feedback. If the S contains batch data, calculating return feedback according to each piece of data, and finally obtaining the comprehensive return of the batch users in an averaging, maximum and minimum mode. The state transition probability distribution P is the probability distribution that the current state is transferred to a new state S 'after the current state S passes through the action a, that is, P (S' | S, a), and when several elements of S, a, R, and P are known, the optimal solution of the MDP, that is, the optimal sequence decision Policy (Policy, recorded as pi), can be solved.

In the embodiment, after basic elements of a Markov decision process are constructed, a meta reinforcement learning framework is constructed, the meta reinforcement learning framework is composed of a reinforcement learning framework and a meta learning framework, the reinforcement learning framework is mainly composed of two structures of an Agent and an Environment, in an application scene of optimization of a wind control decision system, the Environment is the wind control decision system to be optimized to which the wind control decision rule optimization method is applied, and the Agent is a component for guiding the wind control decision system to be optimized.

In this step, the meta learning framework based on reinforcement learning is a reinforcement learning Task (Task, abbreviated as T) using N existing tasks with the same structure _i I =1, \ 8230;, N), training to obtain a meta-agent, so that the meta-agent can be quickly adapted to a new task only by a small amount of interaction with the new task environment when facing the new task T'. The meta task pool in the step is a reinforcement learning task T corresponding to each wind control decision system _i And the meta-task is obtained from a meta-task pool by adopting a random extraction mode, and corresponds to the simulation environment of the wind control decision system on which the reinforcement learning task depends.

Step S20, performing interactive exploration on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task respectively according to a preset wind control strategy to obtain an interactive track data set;

the method comprises the steps that a meta-intelligent body is interactively explored with simulation environments of wind control decision systems corresponding to various meta-tasks respectively to obtain interactive exploration tracks of the meta-intelligent body in the simulation environments of the wind control decision systems, the preset wind control strategy can be set as a current wind control decision strategy in the wind control decision systems to be optimized, a track data set comprises the exploration tracks of the meta-intelligent body in the simulation environments of the wind control decision systems, and the exploration tracks are used for representing state change processes of the meta-intelligent body in the simulation environments of the corresponding wind control decision systems; in the step, the meta-agent interacts with the simulation environment of each wind control decision system, and learns the optimal decision strategy (Policy, abbreviated as pi) of the Markov decision process in a trial and error manner, so that corresponding action A can be made according to the current state S, the return R accumulated for a long time is maximized, and the purpose of automatic iterative optimization is achieved.

In the wind control decision rule optimization application scenario, the meta task is to obtain a system optimization decision Agent (meta Agent) aiming at a wind control decision system corresponding to the meta task through a reinforcement learning technology, and the new task is to apply the Agent to a wind control decision system to be optimized, which is the same as or different from a training set.

In the meta-learning scenario, each task T needs to have the same structure, and different reinforcement learning tasks often have different MDPs, i.e., different S, a, R, and P settings, so in the meta-reinforcement learning framework, the MDP task structure settings need to be unified, and in the wind control decision rule optimization application scenario, assuming that there are n wind control decision systems, the set of all the wind control decision systems is E = { E = _i I =1, \ 8230;, n }. Wind control decision system E _i The set of all wind control decision rules is D _i The sets of wind control decision rules for different decision systems will not be identical, i.e. D _i ≠D _j I ≠ j, but in practical application, because all the wind control rule application scenarios are adopted, the rules of different wind control decision systems have certain overlap, for example, the rules of loan amount, number of outstanding bank notes, number of bank cards and the like, and the set of wind control decision rules common to all the wind control decision systems is set as C = { C = } _i I =1, \ 8230;, k }, and a wind control decision system E is set _i The specific wind control decision rule set is D _i ’＝{d _ij ，j＝1,…，m _i For wind control decision system E _i The set of all the wind control decision rules is D _i ＝{d _i1 ，…，d _im ，c ₁ ，…，c _k For all wind control decision systems E, all wind control decision rules thereofThe aggregate of (A) is D = { [ U ] D _i ＝{d _im ，i＝1，…，n，m＝1，…，m _i } { [ U ] C. Thus, for any wind control decision system E _i The set of wind control decision rules can be regarded as D, and D does not exist in the wind control decision system E _i The threshold range of the rule can be set to infinity or infinity, and the wind control decision system E _i The function of other rules can not influence, and simultaneously, the MDP task structures of all the wind control decision systems are unified.

Based on this, the meta task T _i Can be described as for the wind control decision system E _i Training to obtain a meta-Agent (Agent) so that the meta-Agent can make a threshold adjustment action A on the wind control decision rule D according to the threshold characteristics of all the wind control decision rules D and the state S formed by the current user characteristics, and the long-term accumulated return R is maximized, wherein the action A is defined as follows:

that is, action A is for what is not present in the sample wind control decision system E _i According to the rule in (1), threshold adjustment is not performed, and if the optimal solution of the MDP can be obtained, the optimal sequence decision strategy suitable for the MDP, namely the optimal decision agent, can be obtained.

Optionally, in this step, according to a preset wind control policy, performing interactive exploration on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task, to obtain an interactive trajectory data set, including:

Specifically, in this embodiment, firstly, the simulation environment of the to-be-optimized wind control decision system is utilized to enable the meta-agent to continuously perform iterative optimization on the meta-agent in the interaction process with the meta-agent, secondly, a feature engineering is required to be performed to extract the feature of the state S and use the feature as the input of a training algorithm, and the feature engineering generally sets the client feature and the rule feature according to the domain knowledge, such as the number of unclosed strokes of the client and the range of the number of unclosed strokes set by the rule. After the environment and features are set, training of the meta-agent is started.

The training of the meta-agent mainly comprises three steps: generating a sample, fitting a model and improving a strategy, wherein the three steps are continuously circulated in the training process until a preset training round number is reached:

generating samples, first randomly extracting k tasks in the meta-task pool, and utilizing each extracted task T _i Corresponding wind control decision system simulation environment E _i After the meta-intelligent agent is initialized, the system simulates an environment E according to a preset wind control strategy and a wind control decision system _i And carrying out interactive exploration. In the exploration process, the intelligent object can continuously reach a new state from one state after a certain action, namely, an exploration track tau can be generated _i ＝(s ₁ ,a ₂ ,…,s _n ,a _n ,s _n+1 ). The meta-intelligent agent sequentially generates exploration tracks tau in the extracted task environment ₁ ,…,τ _k Tracks to be generated sequentially

Connected to obtain a track data set tau for fitting of the current round model _E 。

Step S30, performing model fitting training on a model according to the interaction track data set, and performing model prediction control processing on the model after model fitting training to obtain an updating strategy;

in the embodiment, the MDP solvable condition is that all elements S, A, R and P are known, but in the actual situation, P is unknownThat is to say that the probability distribution for a transition to state S' in state S via action a is unknown, in which step a model is used to approximate this probability distribution, i.e. f _θ (s '| s, a) ≈ p (s' | s, a), an approximate probability distribution function f to be obtained _θ (S' | S, a) and the return function R are input into a Model Predictive Control (MPC) algorithm, namely, the optimal action A aiming at the state S is obtained, and the meta-intelligent agent can make an optimal action decision strategy pi aiming at the state of the current sample wind Control decision system in real time.

In this step, for the meta-task T _i Only the approximate probability distribution function needs to be obtained

The meta-agent G applicable to the task can be obtained _i The probability distribution function can be fitted by using a linear model, such as linear regression, or fitted by using a nonlinear model, such as a neural network, the selection of the model depends on the complexity of a task, and the probability distribution function is fitted by using the model fitting mode in the step, namely, the model is subjected to model fitting training by using a track data set, so that the trained model can effectively represent the approximate probability distribution function

Similarly, for all tasks T, an approximate probability distribution function also needs to be obtained

The meta-agent G is available such that the meta-agent can interact with the new task context a small amount, i.e. through phi (theta) ^* ,Data _adapt ) A parameter theta ^* Update to θ' to approximate the probability distribution function f _θ′ (s '| s, a) can be adapted to the current new task environment so that the metaagent G' based on the function can make optimal action decisions for the current environment.

Optionally, in this step, the performing model fitting training on the model according to the interaction trajectory data set includes:

performing model fitting on the model according to the model training set, and performing parameter updating on the fitted model according to the model testing set to obtain the model after model fitting training;

wherein the model f is trained by using a model training set _θ (s '| s, a) performing model fitting, updating the parameter theta of the fitted model by using a model test set to obtain theta', and updating the updated model f _θ′ (s '| s, a) and a return function R are input to the MPC controller to obtain an update strategy pi'.

Further, in this step, the determining a model training set and a model testing set in the interaction trajectory data set according to the target time point includes:

respectively acquiring track data between each target time point and a second preset number of time points after the target time point in each exploration track, and generating the model test set according to the acquired track data;

wherein the first predetermined number of time points and the second predetermined number of time points can be set according to the requirement, for example, in this step, the track data set τ is obtained _E Randomly extracting k time points

Obtaining a target time point, and taking a track tau of a time points before the target time point _E (t _i -a，t _i -1) as a model training set, the trace τ of b time points after the time point _E (t _i ，t _i + b) as modelAnd (6) trial collection.

obtaining model parameters of the model after model fitting training, and performing model prediction control processing on the obtained model parameters and the return function values to obtain the updating strategy; wherein, the model f is _θ′ The model parameters of (s '| s, a) and the value of the return function R are input to the MPC controller to obtain an update strategy pi'.

Step S40, updating the preset wind control strategy according to the updating strategy, and returning to execute the step of performing interactive exploration on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task respectively and the subsequent steps according to the updated preset wind control strategy until the model meets the convergence condition;

the method comprises the steps of carrying out model prediction control processing on a model after model fitting training to obtain a wind control rule strategy aiming at an optimal action decision, obtaining an updating strategy, and carrying out a step of carrying out interactive exploration on a metaintelligent body and a simulation environment of a wind control decision system corresponding to each metatask respectively and subsequent steps in a return mode through the updated preset wind control strategy to achieve the effect of model iterative training.

S50, performing parameter setting on the meta-intelligent agent according to the converged parameters of the model, and performing interactive exploration on the meta-intelligent agent after the parameter setting and a wind control decision system to be optimized to obtain a target exploration track;

the method comprises the steps that a meta-intelligent body is subjected to parameter setting based on parameters of a converged model, so that the meta-intelligent body after the parameter setting can effectively output an optimal action decision aiming at the environment of a wind control decision system to be optimized, and the meta-intelligent body after the parameter setting and the wind control decision system to be optimized are subjected to interactive exploration to obtain the state change of the meta-intelligent body in the environment of the wind control decision system to be optimized;

step S60, optimizing a wind control decision rule in the wind control decision system to be optimized according to the target exploration track;

the wind control decision rule in the wind control decision system to be optimized can be automatically optimized based on the state change of the meta-intelligent agent in the environment of the wind control decision system to be optimized.

In the embodiment, the wind control decision rule in the wind control decision system to be optimized is automatically optimized by means of reinforcement learning, the optimization target can be customized according to different business purposes, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, and the traditional method of manually analyzing the iteration rule by relying on expert experience is replaced.

Example two

Referring to fig. 2, it is a flowchart of a wind control decision rule optimization method according to a second embodiment of the present invention, where the method is used to further refine step S60, and includes the steps of:

step S61, updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

when the metaintelligent agent with the set parameters is applied to the wind control decision system to be optimized, the metaintelligent agent and the wind control decision system to be optimized are interacted in multiple steps, the target exploration track tau of the metaintelligent agent is collected, and the approximate probability distribution function f of the target exploration track tau to the metaintelligent agent is utilized _θ* Updating the parameters of (s '| s, a) to obtain a target parameter theta', thereby obtaining a new probability distribution function f _θ‘ (s '| s, a), namely, obtaining a meta-intelligent agent G' suitable for the wind control decision system to be optimized;

step S62, acquiring state information in a wind control decision rule in the wind control decision system to be optimized;

the state information comprises a user data value of user data in the wind control decision system to be optimized and a rule threshold value in the wind control decision rule;

step S63, inputting the state information into the target meta-agent for rule optimization to obtain an optimization threshold, and updating a wind control decision rule in the wind control decision system to be optimized according to the optimization threshold;

the wind control decision system to be optimized inputs the current state S to the target meta-agent, and then adjusts the rule threshold value in the current wind control decision rule according to the guidance of the target meta-agent, so as to achieve the dynamic self-iteration effect of the wind control decision rule in the wind control decision system to be optimized.

Optionally, in this step, after the wind control decision rule in the wind control decision system to be optimized is updated according to the optimization threshold, the method further includes:

updating parameters of the target meta-intelligent agent according to the acquired state information and the iteration return function value;

and inputting the return R obtained after the action A is taken and a new state S' into the target meta-intelligent agent, and enabling the target meta-intelligent agent to continuously perform reinforcement learning training on line and iteratively approximate a probability distribution function, so that the optimal decision strategy can be learned most effectively.

In this embodiment, the current state S of the wind control decision system to be optimized is input to the target meta-agent, and the action a is adjusted for the rule threshold in the current wind control decision rule, so as to achieve a dynamic self-iteration effect of the wind control decision rule in the wind control decision system to be optimized.

EXAMPLE III

Referring to fig. 3, a schematic structural diagram of a wind control decision rule optimization system 100 according to a third embodiment of the present invention is shown, including: the task extraction module 10, the interaction exploration module 11, the model fitting module 12, the strategy updating module 13, the parameter setting module 14 and the rule optimization module 15, wherein:

the task extraction module 10 is configured to construct a meta-task pool, collect historical data for the wind control decision system in the meta-task pool, construct a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extract the wind control decision system in the meta-task pool to obtain a meta-task;

and the interactive exploration module 11 is configured to perform interactive exploration on the meta-intelligent agent and the simulation environments of the wind control decision systems corresponding to the meta-tasks respectively according to a preset wind control strategy to obtain an interactive trajectory data set, where the interactive trajectory data set includes the exploration trajectories of the meta-intelligent agent in the simulation environments of the wind control decision systems, and the exploration trajectories are used to represent the state change process of the meta-intelligent agent in the simulation environments corresponding to the wind control decision systems.

Wherein, the interaction exploration module 11 is further configured to: respectively acquiring simulation environments of wind control decision systems corresponding to meta tasks extracted from a meta task pool, and initializing the meta intelligent agents;

And the model fitting module 12 is used for performing model fitting training on the model according to the interaction trajectory data set, and performing model prediction control processing on the model after model fitting training to obtain an updating strategy.

Wherein the model fitting module 12 is further configured to: extracting time points from the interaction track data set to obtain target time points, and determining a model training set and a model testing set in the interaction track data set according to the target time points;

Further, the model fitting module 12 is further configured to: respectively acquiring track data between a first preset number of time points before each target time point and the target time point in each exploration track, and generating the model training set according to the acquired track data;

Still further, model fitting module 12 is further configured to: obtaining label data of the model training set, and performing feedback calculation on output data of the model training set according to the label data and the model to obtain a return function value;

And the strategy updating module 13 is configured to update the preset wind control strategy according to the updating strategy, and return to execute the step of performing interactive exploration on the meta-intelligent agent and each sample wind control decision system respectively and the subsequent steps according to the updated preset wind control strategy until the model meets the convergence condition.

And the parameter setting module 14 is configured to perform parameter setting on the meta-intelligent agent according to the converged parameters of the model, and perform interactive exploration on the meta-intelligent agent after the parameter setting and the wind control decision system to be optimized to obtain a target exploration track.

And the rule optimization module 15 is configured to optimize the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

Wherein, the rule optimization module 15 is further configured to: updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

Further, the rule optimization module 15 is further configured to: calculating an iterative return function value according to the rule threshold and the optimization threshold, and acquiring state information in a wind control decision rule in the wind control decision system to be optimized after rule optimization;

In this embodiment, a meta-intelligent agent is interactively explored with a simulation environment of a wind control decision system corresponding to each meta-task to obtain an exploration track of the meta-intelligent agent in the simulation environment of the wind control decision system corresponding to each meta-task, a model is subjected to model fitting training based on a track data set, so that a model after model fitting training can effectively output an optimal action decision for each sample wind control decision system environment, model prediction control processing is performed on the model after model fitting training to obtain a wind control rule strategy for the optimal action decision, the update strategy is obtained, the step of interactively exploring the meta-intelligent agent with the simulation environment of the wind control decision system corresponding to each meta-task and subsequent steps are returned through the updated preset wind control strategy to achieve the effect of model iterative training, parameter setting is performed on the meta-intelligent agent based on the converged parameters of the model, so that the meta-intelligent agent after parameter setting can effectively output the optimal action for the environment of the wind control decision system, the meta-intelligent agent to be optimized with the wind control decision system to be optimized based on the environment change of the meta-intelligent agent, and the meta-intelligent agent to be optimized in the environment change of the wind control decision system to be optimized. According to the embodiment, the wind control decision rule in the wind control decision system to be optimized can be automatically optimized, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, the needed manpower and material resources are reduced, and the optimization efficiency of the wind control decision rule is improved.

Example four

Fig. 4 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program of a wind control decision rule optimization method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 22, implements the steps in the various embodiments of the wind control decision rule optimization methods described above.

Illustratively, the computer program 22 may be partitioned into one or more modules that are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.

The Processor 20 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be non-volatile or volatile. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for optimizing a wind control decision rule, the method comprising:

2. The wind control decision rule optimization method of claim 1, wherein the model fitting training of a model according to the interaction trajectory data set comprises:

and fitting the model according to the model training set, and updating parameters of the fitted model according to the model testing set to obtain the model after model fitting training.

3. The method of wind control decision rule optimization of claim 2, wherein the determining a model training set and a model testing set in the interaction trajectory data set according to the target time point comprises:

4. The wind control decision rule optimization method according to claim 2, wherein the model predictive control processing is performed on the model after model fitting training to obtain an update strategy, and the update strategy comprises:

5. The method as claimed in claim 1, wherein the optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration trajectory includes:

updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent body according to the target parameters to obtain a target meta-intelligent body;

6. The method for optimizing wind control decision rule according to claim 5, wherein after the updating the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold, the method further comprises:

7. The wind control decision rule optimization method according to any one of claims 1 to 6, wherein the performing interactive exploration on the meta-agents and the simulation environments of the wind control decision systems corresponding to the respective meta-tasks according to a preset wind control strategy to obtain a trajectory data set comprises:

and carrying out interactive exploration on the initialized meta-intelligent agents in the simulation environment of the wind control decision system corresponding to each meta-task to obtain the interactive track data set.

8. A wind control decision rule optimization system, the system comprising:

the model fitting module is used for performing model fitting training on a model according to the interaction track data set and performing model prediction control processing on the model after model fitting training to obtain an updating strategy;

the strategy updating module is used for updating the preset wind control strategy according to the updating strategy, and returning to execute the step of respectively carrying out interactive exploration on the meta-intelligent agent and each sample wind control decision system and the subsequent steps according to the updated preset wind control strategy until the model meets the convergence condition;

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 7.