CN115829717B

CN115829717B - Wind control decision rule optimization method, system, terminal and storage medium

Info

Publication number: CN115829717B
Application number: CN202211182023.XA
Authority: CN
Inventors: 吴婧; 张志远; 洪镇宇
Original assignee: Xiamen International Bank Co ltd
Current assignee: Xiamen International Bank Co ltd
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2023-09-19
Anticipated expiration: 2042-09-27
Also published as: CN115829717A

Abstract

The invention provides a wind control decision rule optimization method, a system, a terminal and a storage medium, wherein the method comprises the following steps: respectively carrying out interactive exploration on the meta-intelligent agent and the simulation environment of each wind control decision system to obtain an interactive track data set; performing model fitting training and model prediction control processing on the model to obtain an updating strategy; updating the preset wind control strategy according to the updated preset wind control strategy, returning to the step and the subsequent steps of performing interactive exploration on the element intelligent body and the simulation environment of each wind control decision system respectively, performing parameter setting on the element intelligent body, and performing interactive exploration on the element intelligent body and the wind control decision system to be optimized to obtain a target exploration track; and optimizing the wind control decision system to be optimized according to the target exploration track. The invention can automatically optimize the wind control decision rule in the wind control decision system to be optimized, reduces the required manpower and material resources and improves the optimization efficiency of the wind control decision rule.

Description

Wind control decision rule optimization method, system, terminal and storage medium

Technical Field

The invention relates to the technical field of wind control decision making, in particular to a wind control decision rule optimizing method, a system, a terminal and a storage medium.

Background

In the field of financial science and technology, along with the development of artificial intelligence technology, a large number of intelligent algorithm models are applied to wind control decision scenes, in order to pursue the interpretability of decisions and the controllability of processes, a rule-based wind control decision system is still a basic stone of each application scene, and a plurality of front-edge algorithm model technologies mainly assist the wind control decision system in an auxiliary mode to make decisions in the current financial environment, so that the accuracy of the wind control decision system is improved, and the optimization of wind control decision rules in the wind control decision system is more and more paid attention to.

In the conventional wind control decision rule optimization process, the wind control decision rule is optimized after historical data are analyzed through manual experience of service experts, but the optimization direction opinion of the wind control decision rule in the same wind control decision system is not uniform among different service experts, so that the optimization efficiency of the wind control decision rule is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system, a terminal and a storage medium for optimizing wind control decision rules, and aims to solve the problem that the conventional wind control decision rule is low in optimization efficiency.

The embodiment of the invention is realized in such a way that a wind control decision rule optimizing method comprises the following steps:

building a meta-task pool, collecting historical data of a wind control decision system in the meta-task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta-task pool to obtain meta-tasks;

according to a preset wind control strategy, respectively carrying out interactive exploration on the meta-intelligent body and the simulation environments of the wind control decision systems corresponding to each meta-task to obtain an interactive track data set, wherein the interactive track data set comprises exploration tracks of the meta-intelligent body in the simulation environments of the wind control decision systems, and the exploration tracks are used for representing the state change process of the meta-intelligent body in the simulation environments of the corresponding wind control decision systems;

performing model fitting training on the model according to the interaction track data set, and performing model prediction control processing on the model after the model fitting training to obtain an updating strategy;

updating the preset wind control strategy according to the updating strategy, and returning to execute the step and the subsequent step of performing interactive exploration on the meta-intelligent agent and each sample wind control decision system respectively according to the updated preset wind control strategy until the model meets convergence conditions;

Setting parameters of the meta-intelligent body according to the parameters of the converged model, and performing interactive exploration on the meta-intelligent body with the parameters set and a wind control decision system to be optimized to obtain a target exploration track;

and optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

Further, the training of model fitting according to the interaction track data set includes:

extracting time points from the interaction track data set to obtain a target time point, and determining a model training set and a model testing set in the interaction track data set according to the target time point;

and performing model fitting on the model according to the model training set, and performing parameter updating on the fitted model according to the model testing set to obtain the model after model fitting training.

Further, the determining the model training set and the model testing set in the interaction track data set according to the target time point includes:

respectively acquiring track data from a first preset number of time points to target time points before each target time point in each exploration track, and generating the model training set according to the acquired track data;

And respectively acquiring track data between each target time point and a second preset number of time points after the target time point in each exploration track, and generating the model test set according to the acquired track data.

Further, the model predictive control processing is performed on the model after model fitting training to obtain an update strategy, which includes:

acquiring label data of the model training set, and carrying out feedback calculation on output data of the model training set according to the label data and the model to obtain a return function value;

and obtaining model parameters of the model after model fitting training, and carrying out model prediction control processing on the obtained model parameters and the return function values to obtain the updating strategy.

Further, the optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track includes:

updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

acquiring state information in a wind control decision rule in the wind control decision system to be optimized, wherein the state information comprises a user data value of user data in the wind control decision system to be optimized and a rule threshold in the wind control decision rule;

And inputting the state information into the target element intelligent agent for rule optimization to obtain an optimization threshold value, and updating a wind control decision rule in the wind control decision system to be optimized according to the optimization threshold value.

Further, after updating the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold, the method further includes:

calculating an iterative return function value according to the rule threshold and the optimization threshold, and acquiring state information in a wind control decision rule in the wind control decision system to be optimized after rule optimization;

and carrying out parameter updating on the target element intelligent agent by using the acquired state information and the iteration return function value.

Furthermore, according to a preset wind control strategy, the interactive exploration is performed on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task to obtain an interactive track data set, which comprises the following steps:

respectively acquiring simulation environments of a wind control decision system corresponding to meta-tasks extracted from a meta-task pool, and initializing the meta-intelligent body;

and carrying out interactive exploration on the initialized meta-intelligent agent in a simulation environment of the wind control decision system corresponding to each meta-task to obtain the interactive track data set.

It is another object of an embodiment of the present invention to provide a system for optimizing a wind-controlled decision rule, the system comprising:

the task extraction module is used for constructing a meta task pool, collecting historical data for a wind control decision system in the meta task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta task pool to obtain a meta task;

the interaction exploration module is used for carrying out interaction exploration on the meta-intelligent bodies and the simulation environments of the wind control decision systems corresponding to the meta-tasks respectively according to a preset wind control strategy to obtain an interaction track data set, wherein the interaction track data set comprises exploration tracks of the meta-intelligent bodies in the simulation environments of the wind control decision systems, and the exploration tracks are used for representing the state change process of the meta-intelligent bodies in the simulation environments of the corresponding wind control decision systems;

the model fitting module is used for carrying out model fitting training on the model according to the interaction track data set, and carrying out model prediction control processing on the model after the model fitting training to obtain an updating strategy;

the strategy updating module is used for updating the preset wind control strategy according to the updating strategy, and returning to execute the step and the subsequent step of respectively carrying out interactive exploration on the meta-intelligent agent and each sample wind control decision system according to the updated preset wind control strategy until the model meets the convergence condition;

The parameter setting module is used for carrying out parameter setting on the meta-intelligent body according to the parameters of the converged model, and carrying out interactive exploration on the meta-intelligent body after parameter setting and a wind control decision system to be optimized to obtain a target exploration track;

and the rule optimization module is used for optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

It is a further object of an embodiment of the present invention to provide a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which processor implements the steps of the method as described above when executing the computer program.

It is a further object of embodiments of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

According to the embodiment of the invention, the element intelligent body is respectively subjected to interactive exploration with each sample wind control decision system to obtain the exploration track of the element intelligent body in each sample wind control decision system environment, model fitting training is carried out on the model based on the track data set, so that the model after model fitting training can effectively output an optimal action decision for each sample wind control decision system environment, the model after model fitting training is subjected to model prediction control processing to obtain a wind control rule strategy for the optimal action decision, the updating strategy is obtained, the steps of respectively carrying out interactive exploration on the element intelligent body and each sample wind control decision system and subsequent steps are carried out through the updated preset wind control strategy, the effect of model iterative training is achieved, the element intelligent body after parameter setting is carried out on the basis of the parameters of the converged model, the element intelligent body can effectively output an optimal action decision for the environment of the wind control decision system to be optimized, the element intelligent body after parameter setting is subjected to interactive exploration with the wind control decision system to obtain the element intelligent body in the environment to be optimized, and the wind control decision system can be automatically optimized based on the change of the state of the element intelligent body in the wind control system to be optimized according to the wind control environment. According to the embodiment, the wind control decision rule in the wind control decision system to be optimized can be automatically optimized, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, the required manpower and material resource is reduced, and the optimization efficiency of the wind control decision rule is improved.

Drawings

FIG. 1 is a flowchart of a method for optimizing a wind control decision rule according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for optimizing a wind control decision rule according to a second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a wind-controlled decision rule optimization system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

Example 1

Referring to fig. 1, a flowchart of an air control decision rule optimization method according to a first embodiment of the present invention is provided, where the air control decision rule optimization method may be applied to any terminal device or system, and the air control decision rule optimization method includes the steps of:

step S10, a meta-task pool is constructed, historical data are collected for a wind control decision system in the meta-task pool, a simulation environment of the wind control decision system is constructed according to the historical data of the wind control decision system, and the wind control decision system in the meta-task pool is extracted to obtain meta-tasks;

In this embodiment, first, historical data of each wind control decision system is collected, including historical threshold adjustment data, historical user feature data, historical decision data, historical admittance user behavior data, and so on. After the historical data of each wind control decision system is collected, a Gym application programming interface of OpenAI is utilized to combine the historical data to construct a simulation environment of the wind control decision system.

In this embodiment, after constructing the simulation environment of the wind-controlled decision system, the basic elements of the markov decision process are constructed. The sequence decision problem is typically defined by a markov decision process (Markov Decision Process, MDP), which standard markov decision process comprises the following elements: state (S), action (a), reward (R), state transition probability distribution (Transition Probability Distribution, T). For example, in a general wind control decision system, the state S may be a rule threshold of a wind control decision rule and user data, such as an age of a current user and an age rule threshold set by the current wind control decision rule, and the input user data may be single or batch data, where the single data case mainly includes single user feature data or batch user extraction feature data, and the batch data is in the form of a build list and is input multiple times in combination with the current rule threshold. The action A is the value of the rule threshold value regulated by the wind control decision system, if the age upper limit set by the wind control decision system is increased, decreased or kept unchanged, the regulating amplitude can be set into different step lists according to the elements concerned by the rule;

For example, a base step size for age is 1, a base step size for access frequency is 3, and so on. The return R is a system return function designed according to an optimization target, a label of sample data is needed for feedback calculation when the return R is calculated, for example, the goal of the wind control decision system is to reduce bad customer rate, and if a customer admitted by the wind control decision system triggers a judgment condition inconsistent with a label of historical sample data in a current rule threshold state, the system can obtain negative return feedback; conversely, if the discrimination conditions are consistent, the wind control decision system will get positive feedback. If the S contains batch data, the feedback needs to be calculated according to each data, and finally the comprehensive feedback of the batch users is obtained by means of averaging, maximization, minimization and the like. The state transition probability distribution P is the probability distribution that the current state transitions to a new state S 'after the action a in the current state S, i.e., P (S' |s, a), and when several elements of S, A, R, P are known, the optimal solution of MDP, i.e., the optimal sequence decision strategy (Policy, denoted as pi) can be solved.

In this embodiment, after constructing the basic elements of the markov decision process, a meta reinforcement learning framework is constructed, where the meta reinforcement learning framework is composed of a reinforcement learning framework and a meta learning framework, and the reinforcement learning framework is mainly composed of two structures of an Agent (Agent) and an Environment (Environment), where in an application scenario of wind control decision system optimization, the Environment is a wind control decision system to be optimized to which the wind control decision rule optimization method is applied, and the Agent is a component for guiding the wind control decision system to be optimized.

In this step, the reinforcement learning-based meta-learning framework uses N reinforcement learning tasks (Task, abbreviated as T, N tasks, i.e. T) _i I=1, …, N), the meta-agent is trained, so that when the meta-agent faces a new task T', only a small amount of interaction with the new task environment is needed, and the meta-agent can be quickly applied to the new task. The meta task pool in the step is composed of reinforcement learning tasks T corresponding to all wind control decision systems _i The meta-task is obtained from a meta-task pool by adopting a random extraction mode, and corresponds to the wind control decision system simulation environment on which the reinforcement learning task depends.

Step S20, according to a preset wind control strategy, respectively carrying out interactive exploration on the meta-intelligent agents and simulation environments of a wind control decision system corresponding to each meta-task to obtain an interactive track data set;

the method comprises the steps that through interactive exploration of a meta-agent and simulation environments of wind control decision systems corresponding to each meta-task, an interactive exploration track of the meta-agent in the simulation environments of the wind control decision systems is obtained, the preset wind control strategy can be set to be a current wind control decision strategy in the wind control decision system to be optimized, a track data set comprises exploration tracks of the meta-agent in the simulation environments of the wind control decision systems, and the exploration tracks are used for representing a state change process of the meta-agent in the simulation environments of the corresponding wind control decision systems; in the step, the meta-agent learns the optimal decision strategy (Policy, abbreviated pi) of the Markov decision process in a trial-and-error manner by interacting with the simulation environment of each wind control decision system, so that a corresponding action A can be made according to the current state S, and the return R accumulated for a long time is maximized, thereby achieving the purpose of automatic iterative optimization.

In the application scenario of wind control decision rule optimization, a meta-task is to obtain a system optimization decision Agent (meta-Agent) for a wind control decision system corresponding to the meta-task through reinforcement learning technology, and a new task is to apply the Agent on the same or different wind control decision system to be optimized as a training set.

In the meta-learning scenario, each task T needs to have the same structure, and different reinforcement learning tasks often have different MDPs, that is, different S, A, R, P settings, so in the meta-reinforcement learning framework, unified MDP task structure settings are needed, in the wind control decision rule optimization application scenario, if n wind control decision systems are provided, the set of all wind control decision systems is e= { E _i I=1, …, n }. Set wind control decision system E _i The set of all the wind control decision rules is D _i The wind control decision rule sets of different decision systems are not identical, namely D _i ≠D _j In practical application, because the rules are all application scenes of the wind control rules, the rules of different wind control decision systems overlap to a certain extent, such as rules of loan amount, number of outstanding strokes, number of bank cards and the like, and the wind control decision rule set common to all wind control decision systems is set as C= { C _i I=1, …, k }, provided with a pneumatic decision system E _i The specific wind control decision rule set is D _i ’＝{d _ij ，j＝1,…，m _i For the wind control decision system E _i The set of all the wind control decision rules is D _i ＝{d _i1 ，…，d _im ，c ₁ ，…，c _k For all the wind control decision systems E, the set of all the wind control decision rules of the wind control decision systems E is D= U D _i ＝{d _im ，i＝1，…，n，m＝1，…，m _i And U.C. Thus, for any wind control decision system E _i The set of the wind control decision rules can be regarded as D, and the D is not existed in the wind control decision system E _i The rule in (2) can set the threshold range to be infinitely large or infinitely small, and the wind control decision system E _i The effects of other rules have no influence, and the MDP task structures of all wind control decision systems are unified.

Based on this, meta-task T _i Can be described as for the pneumatic decision system E _i Training to obtain a meta-Agent (Agent) so that the meta-Agent can make a threshold adjustment on the wind control decision rule D according to the threshold characteristics of all the wind control decision rules D and the state S formed by the current user characteristicsAnd (3) finishing the action A so as to maximize the long-term accumulated return R, wherein the action A is defined as follows:

that is, action A is for a sample wind control decision system E that is not present _i If the optimal solution of the MDP can be obtained, the optimal sequence decision strategy applicable to the MDP, namely the optimal decision agent, can be obtained without making threshold adjustment.

Optionally, in this step, according to a preset wind control policy, the performing interactive exploration on the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task to obtain an interaction track data set includes:

Specifically, in this embodiment, firstly, the wind control decision system simulation environment to be optimized is utilized, so that the meta-intelligent agent can continuously and iteratively optimize itself in the process of interacting with the meta-intelligent agent, secondly, feature engineering is needed to be performed, the features of the state S are extracted and used as the input of a training algorithm, and the feature engineering generally sets client features and rule features according to domain knowledge, such as the number of clients which are not clear and the range of the number of the clients which are not clear and are set by rules. After setting the environment and the characteristics, training of the meta-agent is started.

The training of the meta-agent mainly comprises three steps: generating samples, fitting models and improving strategies, and continuously cycling the three steps in the training process until the preset training round number is reached:

Generating samples, first randomly extracting k tasks in a meta-task pool, and utilizing each extracted task T _i Corresponding wind control decision system simulation environment E _i After the meta-agent is initialized, the meta-agent simulates the environment E according to a preset wind control strategy and a wind control decision system _i And performing interactive exploration. During the exploration, the intelligent agent can continuously reach a new state after a certain action from the state, namely an exploration track tau can be generated _i ＝(s ₁ ,a ₂ ,…,s _n ,a _n ,s _n+1 ). The meta-agent sequentially generates exploration tracks tau in the extracted task environment ₁ ,…,τ _k The tracks to be sequentially generatedConnected to obtain a trajectory data set τ for the current round model fit _E 。

Step S30, model fitting training is carried out on the model according to the interaction track data set, and model prediction control processing is carried out on the model after the model fitting training, so that an updating strategy is obtained;

wherein, model fitting training is performed on the model based on the track data set, so that the model after model fitting training can effectively output an optimal action decision for each sample wind control decision system environment, in this embodiment, the condition that the MDP is solvable is S, A, R, P element is all known, but in the actual situation, P is unknown, that is, the probability distribution of moving to the state S' through the action a in the state S is unknown, in this step, the model is used to approximate the probability distribution, that is, f _θ (s '|s, a) ≡p (s' |s, a), the approximate probability distribution function f will be obtained _θ The (S' |s, a) and the return function R are input into a model predictive control (Model Predictive Control, MPC) algorithm, namely, the optimal action A aiming at the state S is obtained, and the meta-agent can make an optimal action decision strategy pi aiming at the state of the current sample wind control decision system in real time.

In this step, for meta-task T _i Only the approximate probability distribution function needs to be obtainedCan be obtained to be suitable for the taskMeta-agent G _i The probability distribution function may be fitted by using a linear model, such as linear regression, or a nonlinear model, such as a neural network, the choice of model being dependent on the complexity of the task, and the probability distribution function being fitted in this step by means of model fitting, i.e. model fitting training is performed on the model by means of a trajectory dataset, such that the trained model effectively characterizes the approximate probability distribution function->

Likewise, it is also necessary to obtain an approximate probability distribution function for all tasks TMeta-agent G is available so that it can interact with the new task environment in small amounts, i.e., by phi (theta ^* ,Data _adapt ) Will be a parameter theta ^* Update to θ' to approximate probability distribution function f _θ′ (s '|s, a) can be applied to the current new task environment so that the meta-agent G' based on the function can make an optimal action decision for the current environment.

Optionally, in this step, the training of model fitting according to the interaction track data set includes:

performing model fitting on the model according to the model training set, and performing parameter updating on the fitted model according to the model testing set to obtain the model after model fitting training;

wherein the model f is utilized by a model training set _θ (s '|s, a) performing model fitting, updating the parameter theta of the fitted model by using a model test set to obtain theta', and updating the updated model f _θ′ (s' |s, a) and a return function R are input to the MPC controlAnd obtaining an updating strategy pi'.

Further, in this step, the determining the model training set and the model testing set in the interaction track data set according to the target time point includes:

respectively acquiring track data between each target time point and a second preset number of time points after the target time point in each exploration track, and generating the model test set according to the acquired track data;

wherein the first and second preset number of time points may be set according to requirements, e.g. in this step, from a trajectory data set τ _E Randomly decimating k time pointsObtaining a target time point and obtaining the track tau of a time points before the target time point _E (t _i -a，t _i -1) as a model training set, the trajectory τ of the b time points after the time point _E (t _i ，t _i +b) as a model test set.

Further, the model prediction control processing is performed on the model after model fitting training to obtain an update strategy, which comprises the following steps:

Obtaining model parameters of the model after model fitting training, and carrying out model prediction control processing on the obtained model parameters and the return function values to obtain the updating strategy; wherein the model f _θ′ The model parameters of (s '|s, a) and the value of the return function R are input to the MPC controller to obtain an updating strategy pi'.

Step S40, updating the preset wind control strategy according to the updating strategy, and returning to execute the step and the subsequent steps of carrying out interactive exploration on the element intelligent body and the simulation environment of the wind control decision system corresponding to each element task respectively according to the updated preset wind control strategy until the model meets the convergence condition;

the method comprises the steps of performing model prediction control processing on a model after model fitting training to obtain an air control rule strategy aiming at optimal action decision, obtaining an updating strategy, performing interactive exploration on a meta-intelligent body and a simulation environment of an air control decision system corresponding to each meta-task respectively through the updated preset air control strategy, and performing subsequent steps, so as to achieve the effect of model iterative training.

Step S50, parameter setting is carried out on the meta-intelligent agent according to the parameters of the converged model, and the meta-intelligent agent after parameter setting and a wind control decision system to be optimized are subjected to interactive exploration to obtain a target exploration track;

the parameter setting is carried out on the meta-agent based on the parameters of the converged model, so that the meta-agent after the parameter setting can effectively output an optimal action decision aiming at the environment of the wind control decision system to be optimized, and the state change of the meta-agent in the environment of the wind control decision system to be optimized is obtained based on the interactive exploration of the meta-agent after the parameter setting and the wind control decision system to be optimized;

step S60, optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track;

the wind control decision rule in the wind control decision system to be optimized can be automatically optimized based on the state change of the meta-intelligent agent in the environment of the wind control decision system to be optimized.

According to the method, the wind control decision rule in the wind control decision system to be optimized is automatically optimized by reinforcement learning, the optimization target can be customized according to different business purposes, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, a traditional expert experience-dependent manual analysis iteration rule method is replaced, required manpower and material resources are greatly reduced, automatic strategy optimization iteration can be performed according to real-time data, the business target is met, the effectiveness is not lost, the iteration strategy is continuously evolved while new user data is continuously received, the required time and resources of the manual iteration strategy are also greatly reduced, meanwhile, the business target can be met for a long time, the method and the device can achieve an effective optimization effect by utilizing meta-reinforcement learning, and the optimization efficiency of the wind control decision rule in different wind control decision systems is improved.

Example two

Referring to fig. 2, a flowchart of a method for optimizing a wind control decision rule according to a second embodiment of the present invention is provided, and the method is used for further refining step S60, and includes the steps of:

step S61, carrying out parameter updating on the converged model according to the target exploration track to obtain target parameters, and carrying out parameter setting on the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

when the meta-intelligent body with the parameters set is applied to a wind control decision system to be optimized, after the meta-intelligent body interacts with the wind control decision system to be optimized for multiple steps, collecting a target exploration track tau of the meta-intelligent body, and utilizing the target exploration track tau to approximate a probability distribution function f of the meta-intelligent body _θ* Updating the parameters of (s '|s, a) to obtain a target parameter theta', thereby obtaining a new probability distribution function f _θ‘ (s′S, a), namely, a meta-agent G' suitable for the wind control decision system to be optimized is obtained;

step S62, obtaining state information in a wind control decision rule in the wind control decision system to be optimized;

the state information comprises user data values of user data in the wind control decision system to be optimized and rule thresholds in a wind control decision rule;

Step S63, inputting the state information into the target element intelligent agent for rule optimization to obtain an optimization threshold value, and updating a wind control decision rule in the wind control decision system to be optimized according to the optimization threshold value;

the wind control decision system to be optimized inputs the current state S to the target element intelligent agent, and then adjusts the rule threshold in the current wind control decision rule according to the guidance of the target element intelligent agent to achieve the dynamic self-iteration effect of the wind control decision rule in the wind control decision system to be optimized.

Optionally, in this step, after updating the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold, the method further includes:

the acquired state information and the iteration return function value are subjected to parameter updating on the target element intelligent agent;

the return R and the new state S' obtained after the action A is made are input to the target element intelligent agent, so that the target element intelligent agent continuously carries out reinforcement learning training on line, and the approximate probability distribution function is iterated, thereby the optimal decision strategy can be learned most effectively.

In this embodiment, the current state S of the wind control decision system to be optimized is input to the target element agent, and the rule threshold in the current wind control decision rule is adjusted to achieve the dynamic self-iteration effect of the wind control decision rule in the wind control decision system to be optimized, and the return R and the new state S' obtained after the action a is input to the target element agent, so that the automatic iteration of the target element agent can be effectively achieved.

Example III

Referring to fig. 3, a schematic structural diagram of a wind-control decision rule optimizing system 100 according to a third embodiment of the present invention includes: a task extraction module 10, an interaction exploration module 11, a model fitting module 12, a strategy updating module 13, a parameter setting module 14 and a rule optimizing module 15, wherein:

the task extraction module 10 is used for constructing a meta task pool, collecting historical data for a wind control decision system in the meta task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta task pool to obtain a meta task;

the interaction exploration module 11 is configured to interactively explore the meta-agent and the simulation environments of the wind control decision systems corresponding to the meta-tasks respectively according to a preset wind control strategy, so as to obtain an interaction track data set, where the interaction track data set includes exploration tracks of the meta-agent in the simulation environments of the wind control decision systems, and the exploration tracks are used for characterizing a state change process of the meta-agent in the simulation environments of the corresponding wind control decision systems.

Wherein, the interaction exploration module 11 is further used for: respectively acquiring simulation environments of a wind control decision system corresponding to meta-tasks extracted from a meta-task pool, and initializing the meta-intelligent body;

And the model fitting module 12 is used for carrying out model fitting training on the model according to the interaction track data set, and carrying out model prediction control processing on the model after the model fitting training to obtain an updating strategy.

Wherein the model fitting module 12 is further configured to: extracting time points from the interaction track data set to obtain a target time point, and determining a model training set and a model testing set in the interaction track data set according to the target time point;

Further, the model fitting module 12 is further configured to: respectively acquiring track data from a first preset number of time points to target time points before each target time point in each exploration track, and generating the model training set according to the acquired track data;

Still further, the model fitting module 12 is also configured to: acquiring label data of the model training set, and carrying out feedback calculation on output data of the model training set according to the label data and the model to obtain a return function value;

And the policy updating module 13 is configured to update the preset wind control policy according to the update policy, and return to executing the step of performing interactive exploration on the meta-agent and each sample wind control decision system respectively and the subsequent step according to the updated preset wind control policy until the model meets a convergence condition.

And the parameter setting module 14 is used for carrying out parameter setting on the meta-intelligent body according to the parameters of the converged model, and carrying out interactive exploration on the meta-intelligent body after parameter setting and the wind control decision system to be optimized to obtain a target exploration track.

And the rule optimization module 15 is used for optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track.

Wherein the rule optimizing module 15 is further configured to: updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent;

Further, the rule optimizing module 15 is further configured to: calculating an iterative return function value according to the rule threshold and the optimization threshold, and acquiring state information in a wind control decision rule in the wind control decision system to be optimized after rule optimization;

According to the embodiment, through carrying out interactive exploration on the element intelligent body and the simulation environments of the wind control decision system corresponding to each element task respectively, so as to obtain the exploration track of the element intelligent body in the simulation environments of the wind control decision system corresponding to each element task, carrying out model fitting training on the model based on the track data set, enabling the model after the model fitting training to effectively output an optimal action decision aiming at the environments of the wind control decision system of each sample, carrying out model predictive control processing on the model after the model fitting training so as to obtain a wind control rule strategy aiming at the optimal action decision, obtaining an updating strategy, carrying out interactive exploration on the element intelligent body and the simulation environments of the wind control decision system corresponding to each element task respectively through the updated preset wind control strategy, and carrying out subsequent steps so as to achieve the effect of carrying out parameter setting on the element intelligent body on the wind control decision system based on the parameters of the model after the parameter setting, enabling the element intelligent body after the parameter setting to effectively output the optimal action decision aiming at the environments of the wind control decision system to be optimized, carrying out interactive exploration on the element intelligent body after the parameter setting and the wind control decision system to obtain the wind control rule changing in the wind control system in the wind control state of the wind control system to be optimized. According to the embodiment, the wind control decision rule in the wind control decision system to be optimized can be automatically optimized, the optimal optimization strategy suitable for the wind control decision system can be adaptively learned on any wind control decision system, the required manpower and material resource is reduced, and the optimization efficiency of the wind control decision rule is improved.

Example IV

Fig. 4 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 4, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program of a wind-controlled decision rule optimization method, stored in said memory 21 and executable on said processor 20. The steps of the various embodiments of the wind control decision rule optimization methods described above are implemented by the processor 20 when executing the computer program 22.

Illustratively, the computer program 22 may be partitioned into one or more modules that are stored in the memory 21 and executed by the processor 20 to complete the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 21 may also be used for temporarily storing data that has been output or is to be output.

In addition, each functional module in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Wherein the computer readable storage medium may be nonvolatile or volatile. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of each method embodiment described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable storage medium may be appropriately scaled according to the requirements of jurisdictions in which such computer readable storage medium does not include electrical carrier signals and telecommunication signals, for example, according to jurisdictions and patent practices.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for optimizing a wind-controlled decision rule, the method comprising:

building a meta-task pool, collecting historical data of a wind control decision system in the meta-task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta-task pool to obtain meta-tasks; the historical data at least comprises historical threshold adjustment data, historical user characteristic data, historical decision data and historical admittance user behavior data;

updating the preset wind control strategy according to the updating strategy, and returning to the step of executing the interactive exploration of the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task respectively and the subsequent step according to the updated preset wind control strategy until the model meets the convergence condition;

optimizing the wind control decision rule in the wind control decision system to be optimized according to the target exploration track, wherein the method comprises the following steps:

2. The method of optimizing a wind control decision rule according to claim 1, wherein said model fitting training on the basis of the interaction trajectory dataset comprises:

fitting the model according to the model training set, and updating parameters of the fitted model according to the model testing set to obtain the model after model fitting training.

3. The method for optimizing a wind control decision rule according to claim 2, wherein the determining a model training set and a model testing set in the interaction trajectory data set according to the target time point comprises:

4. The method for optimizing a wind control decision rule according to claim 2, wherein the model predictive control processing is performed on the model after model fitting training to obtain an update strategy, comprising:

5. The method for optimizing the wind control decision rule according to claim 1, wherein after updating the wind control decision rule in the wind control decision system to be optimized according to the optimization threshold, further comprises:

6. The method for optimizing a wind control decision rule according to any one of claims 1 to 5, wherein the performing interactive exploration on the meta-agent and the simulation environment of the wind control decision system corresponding to each meta-task according to a preset wind control policy to obtain a track data set includes:

7. A system for optimizing a wind-controlled decision rule, the system comprising:

the task extraction module is used for constructing a meta task pool, collecting historical data for a wind control decision system in the meta task pool, constructing a simulation environment of the wind control decision system according to the historical data of the wind control decision system, and extracting the wind control decision system in the meta task pool to obtain a meta task; the historical data at least comprises historical threshold adjustment data, historical user characteristic data, historical decision data and historical admittance user behavior data;

the strategy updating module is used for updating the preset wind control strategy according to the updating strategy, and returning to the step of executing the interactive exploration of the meta-intelligent agent and the simulation environment of the wind control decision system corresponding to each meta-task respectively and the subsequent step according to the updated preset wind control strategy until the model meets the convergence condition;

The rule optimizing module is configured to optimize a wind control decision rule in the wind control decision system to be optimized according to the target exploration track, and includes: updating parameters of the converged model according to the target exploration track to obtain target parameters, and setting parameters of the meta-intelligent agent according to the target parameters to obtain a target meta-intelligent agent; acquiring state information in a wind control decision rule in the wind control decision system to be optimized, wherein the state information comprises a user data value of user data in the wind control decision system to be optimized and a rule threshold in the wind control decision rule; and inputting the state information into the target element intelligent agent for rule optimization to obtain an optimization threshold value, and updating a wind control decision rule in the wind control decision system to be optimized according to the optimization threshold value.

8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.