CN117077034A

CN117077034A - Visual analysis method and system for reinforcement learning model

Info

Publication number: CN117077034A
Application number: CN202311011010.0A
Authority: CN
Inventors: 王蕴哲
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-11-17

Abstract

The application relates to a visual analysis method and a visual analysis system for a reinforcement learning model, wherein the method comprises the following steps: selecting a plurality of past experience data for the time steps in each round and carrying out similarity analysis on the experience data; according to the analysis result of the similarity of the experience data, the training quality of the reinforcement learning model is evaluated, and if the training quality of the reinforcement learning model meets the expected value, the reinforcement learning model is continuously trained according to the current strategy; if the training quality of the reinforcement learning model does not accord with the expected value, optimizing and recommending the reinforcement learning model training; and respectively visually displaying the result of recommending the training optimization of the reinforcement learning model and the two-dimensional space structure diagram. The training method and the training device can perform optimization recommendation on the training process of the reinforcement learning model by comprehensively and deeply mining the experience data in the reinforcement learning training process.

Description

Visual analysis method and system for reinforcement learning model

Technical Field

The application relates to the technical field of reinforcement learning, in particular to a visual analysis method and a visual analysis system of a reinforcement learning model.

Background

Currently, visual analysis work of deep reinforcement learning is mainly focused on the field of games or robot navigation based on discrete actions, and lack of diversified applications in a wider field is mainly caused by the fact that a user does not understand and trust a model. In addition, most deep reinforcement learning at this stage uses an empirical playback technique, i.e., by utilizing previous empirical data to improve the efficiency and stability of the deep reinforcement learning algorithm. Therefore, what experience is selected, and the association between the selected experiences has a crucial influence on the performance of model training. However, existing work rarely performs comprehensive and deep mining analysis on empirical data.

Disclosure of Invention

Therefore, the application aims to solve the technical problem that the prior art rarely carries out comprehensive and deep mining analysis on the reinforcement learning experience data.

In order to solve the technical problems, the application provides a visual analysis method of a reinforcement learning model, which comprises the following steps:

acquiring state data trained by a reinforcement learning model, aggregating the state data, and performing space mapping on the aggregated data to obtain a two-dimensional space structure diagram, wherein the model training process comprises a plurality of rounds, each round comprises a plurality of time steps, and each time step generates corresponding state data and experience data;

selecting a plurality of past experience data for the time steps in each round and performing similarity analysis on the experience data, wherein the method comprises the following steps: in the spatial dimension, analyzing similarity in several empirical data in a single time step; in the time dimension, analyzing the similarity of experience data between adjacent time steps;

according to the analysis result of the similarity of the experience data, the training quality of the reinforcement learning model is evaluated, and if the training quality of the reinforcement learning model meets the expected value, the reinforcement learning model is continuously trained according to the current strategy; if the training quality of the reinforcement learning model does not accord with the expected value, optimizing and recommending the reinforcement learning model training;

and respectively visually displaying the result of recommending the training optimization of the reinforcement learning model and the two-dimensional space structure diagram.

In one embodiment of the present application, the method for obtaining state data trained by a reinforcement learning model, aggregating the state data, and spatially mapping the aggregated data to obtain a two-dimensional spatial structure map includes:

acquiring state data trained by the reinforcement learning model;

aggregating state data corresponding to time steps in each round in a time dimension;

carrying out feature importance analysis on the aggregated state data based on a random forest regression model to screen out preset number of features;

and mapping the data subjected to feature screening from the high-dimensional feature space to the two-dimensional visual space based on the t-SNE algorithm to obtain a two-dimensional space structure diagram.

In one embodiment of the present application, the method for aggregating state data corresponding to time steps in each round in the time dimension includes:

detecting abnormal points in the time steps by using a k-nearest neighbor algorithm in each round to locate the time steps in which the rewards suddenly change, wherein the rewards are generated when an agent performs actions in the training of the reinforcement learning model;

and clustering the rest time steps except the time steps with the rewarding suddenly changes by using a k-Means algorithm, and integrating the time steps clustered by the k-Means algorithm with the time steps with the rewarding suddenly changes to finish the aggregation of the state data.

In one embodiment of the present application, the method for performing feature importance analysis on aggregated state data based on a random forest regression model to screen out a preset number of features includes:

for round i, its corresponding aggregated state dataset is S _i ＝{s _i1 ,s _i2 ,s _i3 ,...,s _iT -wherein s _ij Representing the state corresponding to the jth time step in the round i, wherein j is more than or equal to 1 and less than or equal to T;

will s _ij From h-dimensional feature vectorsThe representation is: />

By being in said state dataSet S _i Fitting a random forest regression model, carrying out importance scoring and ranking according to the contribution degree of each feature to model prediction, and carrying out s according to the scoring and ranking results _ij Corresponding h-dimensional feature vectorAnd screening out the preset number of features.

In one embodiment of the application, the method of analyzing similarity in empirical data in a single time step in a spatial dimension includes:

in the space dimension, for a single time step, sampling a plurality of past experience data, constructing a node diagram based on the sampled experience data, wherein each experience corresponds to one node in the node diagram, and the node n in the node diagram _i And n _j Through edge e _ij Connected and edge e _ij Weights w (e) _ij ) For node n _i And n _j Is a similarity value of (1);

edge e between nodes in the node graph by a maximum spanning tree algorithm _ij Pruning is carried out, and an experience association relation graph is obtained, so that the edge density in the node graph is reduced.

In one embodiment of the present application, the method for analyzing the similarity of empirical data between adjacent time steps in the time dimension includes:

in the time dimension, a dynamic network g= { G is constructed ₀ ,G ₁ ,...,G _T }, wherein G _i An experience association relation diagram on a time step i constructed by a maximum spanning tree algorithm;

for the dynamic network g= { G ₀ ,G ₁ ,...,G _T Each empirical association diagram G in } _i Community division is carried out through a Louvain algorithm, wherein communities are experience association relationship graphs G _i A set formed by the middle nodes;

and judging whether similar experience data are repeatedly selected in the adjacent time steps according to the correlation judgment of communities in the adjacent time steps.

In one embodiment of the application, the node n _i And n _j The similarity value formula of (2) is:

wherein d(s) _i ,s _j ) Representing the current state s _i Sum s _j Between normalized Euclidean distances, d(s) _i ',s' _j ) Representing the next state s _i 'and s' _j Between normalized Euclidean distances, d (a) _i ,a _j ) Representing action a _i And a _j Between normalized Euclidean distances, d (r) _i ,r _j ) Representing rewards r _i And r _j Normalized euclidean distance therebetween.

In one embodiment of the present application, the formula is:

wherein r (C) _i ,C _j ) Representing community C _i And community C _j Is (m) _i ,m _j ) Representing community C _i Middle node m _i And community C _j Middle node m _j Is a similarity of (3).

In one embodiment of the present application, if the training quality of the reinforcement learning model does not meet the expected value, the method performs optimization recommendation on the reinforcement learning model training, and includes:

if the training quality of the reinforcement learning model does not accord with the expected value, optimizing and recommending the reinforcement learning model training according to the relation between experience diversity and model prediction effect, wherein the experience diversity is influenced by three factors of the number of replay buffers, the number of sampling experiences and a method for sampling experiences, and the replay buffers are used for storing experience data;

the optimization recommendation includes a combination of different values between three factors,each combination is composed of R ² Measured by R ² The higher the prediction effect of the model under the current combination is, the better the R ² The formula of (2) is:

wherein n represents the number of samples, y _i A true value representing the i-th sample,representing the predicted value of the ith sample, +.>Mean values of the true values are shown.

In order to solve the above technical problems, the present application provides a visual analysis system for reinforcement learning model, comprising:

aggregation and mapping module: the method comprises the steps of acquiring state data for reinforcement learning model training, aggregating the state data, and performing space mapping on the aggregated data to obtain a two-dimensional space structure diagram, wherein the model training process comprises a plurality of rounds, each round comprises a plurality of time steps, and each time step generates corresponding state data and experience data;

similarity analysis module: for selecting a number of past empirical data for a time step in each round and performing a similarity analysis on the empirical data, comprising: in the spatial dimension, analyzing similarity in several empirical data in a single time step; in the time dimension, analyzing the similarity of experience data between adjacent time steps;

and an evaluation module: the method comprises the steps of realizing evaluation of training quality of the reinforcement learning model according to analysis results of similarity of experience data, and if the training quality of the reinforcement learning model meets an expected value, continuing training the reinforcement learning model according to a current strategy; if the training quality of the reinforcement learning model does not accord with the expected value, optimizing and recommending the reinforcement learning model training;

visual analysis module: and the method is used for respectively and visually displaying the result of recommending the training optimization of the reinforcement learning model and the two-dimensional space structure diagram.

Compared with the prior art, the technical scheme of the application has the following advantages:

according to the application, through analyzing the similarity of experience data, the information overlapping degree between experiences in the same time step can be quantified, and the evolution state of experience association in adjacent time steps can be explored, so that a user can be helped to diagnose training abnormality of a model;

the application realizes the evaluation of the training process of the reinforcement learning model through the similarity of experience data, when the evaluation result does not accord with the expectation, the application combines the different values among three influencing factors of the number of the replay buffer areas, the number of the sampling experience and the sampling method and quantifies the index R ² To investigate the influence of experience diversity on model training effects to assist in formulating strategies for optimizing models;

the method and the system can also realize the recommendation of the training optimization of the reinforcement learning model and the visual display of the two-dimensional space structure diagram respectively, realize the visual analysis of the training process of the reinforcement learning model, and further help a user to better understand the training process of the reinforcement learning model.

Drawings

In order that the application may be more readily understood, a more particular description of the application will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a flow chart of an implementation of visual analysis of a DDPG reinforcement learning model in an embodiment of the present application;

FIG. 3 is a dynamic diagram of the empirical correlation constructed in an embodiment of the application as a function of time steps;

FIG. 4 is a schematic diagram of community partitioning in an embodiment of the present application;

FIG. 5 is a schematic diagram of dynamic community evolution in an embodiment of the present application;

FIG. 6 is a statistical information view of a DDPG visual analysis system in accordance with an embodiment of the present application;

FIG. 7 is a round view of a DDPG visual analysis system in an embodiment of the present application

FIG. 8 is an empirical exploration view of a DDPG visual analysis system in an embodiment of the present application;

fig. 9 is a recommended view of the DDPG visual analysis system in the embodiment of the present application.

Detailed Description

The present application will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the application and practice it.

Example 1

Referring to fig. 1, the present application relates to a visual analysis method of reinforcement learning model, comprising:

and respectively visually displaying the recommended result of training and optimizing the reinforcement learning model and the two-dimensional space structure diagram to realize visual analysis of the training process of the reinforcement learning model.

The present embodiment is described in detail below:

the present embodiment relates generally to an interpretive analysis of reinforcement learning models represented by DDPG (Deep Deterministic Policy Gradient) models through visualization techniques, which enhance the understanding and trust of users to machines.

The terms involved in this embodiment and their expressions are as follows:

* State(s): variables describing conditions or features in the environment. In reinforcement learning, an agent makes a decision based on the current state and transitions to the next state s'.

* Action (a): the agent may perform operations in a particular state. An action is a way of interaction between an agent and an environment.

* Rewards (r): at each time step, the smart agent will obtain a reward from the environment as feedback on the action it takes. The reward may be positive, negative, or zero, indicating a good or bad action.

* Experience (m: < s, a, r, s >): the combination of (state, action, rewards, next state) obtained after an agent takes some action in a particular state is used to train the agent's policies.

* Round (e): reinforcement learning tasks typically consist of a series of consecutive time steps. A round may be a complete task or a complete scene. For example, in a game, a round may be a period of time from the start of the game to the end of the game.

* Time step (t): each time step represents an interaction of the agent with the environment, i.e. taking an action and obtaining a reward.

Referring to fig. 2, the present embodiment mainly includes three major parts:

high-dimensional state data analysis (to give the user a basic knowledge of the reinforcement learning model training process)

The present embodiment is mainly analyzed from three aspects: (1) similarity-based data aggregation in the time dimension; (2) a random forest based high-dimensional feature importance analysis; (3) visual analysis of data based on t-SNE dimension reduction.

(1) First, training of the deep reinforcement learning model includes multiple rounds and each round includes multiple time steps, generating a large amount of high-dimensional state data. Presenting all of these status data requires the visual system to consume significant computational resources and results in long latency. Through preliminary analysis, the embodiment finds that redundant information exists among a large number of states corresponding to adjacent time steps, so that key features and structures can be extracted by carrying out aggregation processing on data in a time dimension, and the data quantity required to be processed by a visual system is greatly reduced. The method comprises the following specific steps: an outlier is first detected in each round using a k-nearest neighbor algorithm (k Nearest Neighbors, kNN for short) to locate the time step at which a bonus dip (e.g., bonus from 1 to 100) occurs. The remaining time steps are then clustered using the k-Means algorithm. k-Means is a conventional clustering algorithm that divides data into clusters of similar objects. The algorithm is simple in principle, can converge on a large data set faster, and is suitable for different similarity measures. The present embodiment trains the DDPG model by taking energy prediction as an example, and the dimension of the related state data is far higher than that of action and rewards data.

(2) The next feature importance analysis is to assist the domain expert in screening redundant information during feature selection, thereby reducing the computational complexity and improving the model performance. Specifically, the present embodiment employs a random forest regression model. For round i, its corresponding aggregated state dataset is S _i ＝{s _i1 ,s _i2 ,s _i3 ,...,s _iT (s is therein _ij Representing the state corresponding to the jth time step in the round i, wherein j is more than or equal to 1 and less than or equal to T; at the same time, s _ij Can be defined by h-dimensional feature vectorsThe representation is made of a combination of a first and a second color,by at S _i Fitting a random forest regression model, and scoring and ranking importance according to the contribution degree of each feature to model prediction, wherein an expert can only keep the features with the front importance ranking (namely, at s) according to the specific requirements of the application _ij Corresponding h-dimensional feature vector->And the preset number of features are screened out) to achieve the purpose of simplifying calculation and analysis.

(3) Finally, to facilitate observation and exploration of the distribution pattern among the data, the present embodiment uses the t-SNE technique to map the data in all rounds from the high-dimensional feature space to the two-dimensional visual space for presentation. the t-SNE technique can maintain the association of data in the original high-dimensional space in the low-dimensional space. The closer the data points in visual space are located, the more similar their corresponding high-dimensional features are.

And (II) space-time modeling and analysis of empirical data (evaluation of the training process of the reinforcement learning model and measurement of the quality of the training process)

(1) In the spatial dimension, the present embodiment focuses on analyzing the correlation between empirical data sampled from the replay buffer in a single time step. The specific process is as follows: assuming that each experience corresponds to a node in the node diagram, node n _i And n _j With edge e therebetween _ij Connected, and edge weight w (e _ij ) And the similarity value is equal to the similarity value of the two nodes, and the similarity value is calculated by the formula (1). Initially, this embodiment retains all the edges of the node map, but this makes the map dense, difficult to intuitively demonstrate the structural features of the node map, and difficult to calculate. Thus, the present embodiment uses a maximum spanning tree algorithm (Maximum Spanning Tree, MST for short) to reduce the density of edges in the graph. The MST algorithm is a graph theory algorithm that finds a loop-free subgraph containing all nodes in a given weighted undirected graph and maximizes the sum of the weights of the edges in the graph. The algorithm can remove redundant information in the original graph to the greatest extent, maintain connectivity of the subgraph, and help subsequent application of graph theory methods to explore association relations of experience data.

The similarity formula between experience terms is as follows:

In short, in the spatial dimension, if the sum of the weights of the sides in the empirical relationship diagram (i.e., the loop-free subgraph) generated in a single time step is larger, it is indicated that the nodes in the empirical relationship diagram are more similar, which is disadvantageous to the training process and undesirable; if the sum of the weights of the sides in the empirical correlation diagram (i.e., the loop-free subgraph) generated in a single time step is smaller, it is desirable to indicate that the nodes in the empirical correlation diagram are less similar, which is advantageous to the training process.

(2) In the time dimension, the embodiment constructs a dynamic network g= { G ₀ ,G ₁ ,...,G _T }, wherein G _i Is an empirical correlation diagram at time step i constructed by the MST algorithm described above, as shown in FIG. 3, wherein G _i The similarity S (·) between the intermediate nodes is calculated by equation (1).

By analyzing the topology at each time step contained in the dynamic graph model, the user can learn the degree of information overlap between the empirical data in a quantized manner. Theoretically, the less similar the experience chosen over a single time step, the more likely it is for an agent to learn as much knowledge as possible, the less the sum of the weights of the edges in the graph. The present embodiment mainly expands the topology analysis of the graph from two aspects:

and (5) degree distribution. The degree refers to the number of neighboring nodes for each node in the graph. The degree distribution refers to the total number of nodes corresponding to the value of each possible degree in the graph.

Community structure. Communities are collections formed by densely connected nodes in a graph, and the ratio of nodes in communitiesThe nodes between the communities are more tightly connected. In order to obtain the community structure of the graph, the embodiment selects a Louvain algorithm with the advantages of high efficiency, expandability, accuracy and the like, and the iteration optimization target is that the modularity of the graph is maximized. As shown in FIG. 4 (b), after community detection, G in FIG. 4 (a) _i Is divided into 3 different communities.

In addition, the present embodiment analyzes the evolution of communities in a dynamic network. The lifecycle of a dynamic community may be represented by a directed acyclic graph (Directed Acyclic Graph, DAG for short), as shown in fig. 5, a community may experience: birth, division, growth, decay, and death. By exploring the evolution mode of the dynamic graph structure in units of communities, different states involved in experience relevance in a certain time period can be revealed from a higher level, and a user is assisted in judging whether an agent repeatedly selects similar experience sets in different time steps, so that training is trapped in local circulation. Community C for different time steps ₁ And C ₂ (degree of overlap) community correlation r (C) ₁ ,C ₂ ) The calculation of (2) is shown in formula (2).

In short, in the time dimension, if the number of nodes overlapping the community in adjacent time steps exceeds a preset number, it indicates that the newly added experience data is less, which is disadvantageous to the training process and undesirable; if the number of nodes overlapping the community in adjacent time steps does not exceed the preset number, then more empirical data is added, which is beneficial to the training process and desirable.

(III) model training optimization recommendation (if the evaluation result of the reinforcement learning model training process is poor, the training process is optimized according to the method recommended by the embodiment)

The present embodiment focuses mainly on the relationship between experience diversity and model predictive effects, where the former is mainly affected by the number of replay buffers, the number of sampling experiences, and the sampling method. The sampling method includes random sampling and preferential empirical playback. Accordingly, the recommended results contain combinations of different values between these influencing factors, each combination being defined by R ² Measured by the weight of the sample. R is R ² The higher the prediction effect of the model under the current combination is, the better the prediction effect is. R is R ² Is defined as follows:

where n represents the number of samples, y _i A true value representing the i-th sample,representing the predicted value of the ith sample, +.>Mean values of the true values are shown. The present example samples each combination using random and preferential empirical playback, respectively, and performs 20 replicates, respectively, and then selects the largest R ² As a result of combining the final evaluation.

The DDPG visual analysis system provided by the embodiment comprises four modules, namely a statistical information view, a round view, an experience exploration view and a recommendation view, as shown in fig. 6-9.

The statistical information view (fig. 6) uses a line graph, a stack graph, and a box graph for presenting relevant statistical information for model training in different sampling modes at selected times. The line graph presents the total prize value for each round and the total TD error value for all samples tested, respectively. The line graph mainly shows the action information under different rounds. The box plot shows the ratio of the total TD error of each step of sampling experience in each pass to the TD error of all experiences in the experience pool. Because the area is limited, the visual effect of the method is extremely poor, and therefore, in the embodiment, part of episodes are randomly extracted from all rounds to be displayed, through multiple experiments, 20 rounds are selected to be displayed, the visual effect is good, the original data structure can be reserved, the abscissa is a sampling index, and when a user places a mouse on a current box, the system displays the original rounds. The user can analyze the above views to understand the performance of the model at different stages, and the learning speed and stability of the model. While this information may be used to adjust the hyper-parameters or other related parameters of the model to optimize the performance of the model.

The round view (fig. 7) mainly comprises a line graph, a stacked histogram and a scatter diagram, and is used for displaying action information, state characteristic importance and distribution rules of four-element data points in a two-dimensional space of each time step in the round, so that a model training process can be better understood. The line graph shows the actual energy consumption value, the predicted energy consumption value and the difference trend between the two, so that a user can intuitively know the prediction capability and accuracy of the model according to the information. The histogram mainly shows the total TD error value of the sample lot experience for each step in the scenario. The histogram helps to identify outliers in the data, such as TD errors that are too large or too small. These outliers may affect the training and performance of the model. Meanwhile, in order to help the user intuitively understand the trend of the numerical values, the embodiment provides a function which can be converted into a line graph. The cluster histogram shows the status feature importance score distribution. Wherein the abscissa is the status feature. The ordinate is the importance score for rewards and predictive energy consumption. By comparing the importance of different features in terms of prize value and predicted energy, it can be determined which features are most critical to model performance. This provides an important basis for the selection of features, which can help to optimize the model better. The scatter plot shows the results of mapping four-dimensional (current state, action, rewards, next state) data points of all steps in the scenario to the two-dimensional space after the t-SNE dimension reduction, which is beneficial to observing the distribution rule among the data. When a user clicks on a particular step, the point is highlighted in the scatter plot, helping the user to more intuitively see the specific location of the step data point throughout the scatter plot and compare it to other data points.

The empirical exploration view (fig. 8) consists of several parts, including a force directed layout of an empirical similarity graph, which visualizes the relationships between experiences using a physics heuristic approach. In addition, the histogram of node degree distributions provides a node connectivity and structural feature profile of the graph. Furthermore, the Sang Jitu layout is based on a dynamic community detection method, revealing how experience groups into different groups over time. These visual layouts together provide a comprehensive view of the experience space and allow for more in-depth analysis of the underlying data. In a force-directed arrangement, each node is considered a charged particle, each edge is considered a spring, and the interaction between the node and the edge creates a mechanical system, the node and edge will move under the force, forming a reasonable layout. The larger the TD error of the experience represented by the node, the larger the radius of the node. The edge lengths in the figure are mapped according to the weights of the edges, and if the weights of the edges are larger, the edge lengths are smaller. The color of the node corresponds to the color of the cluster to which the community detection result obtained by the method belongs. The abscissa of the histogram is the size of the degree, and the ordinate is the number of nodes whose degree is the value. The node degree distribution reflects the similarity and the characteristics of the connection between experiences. When the number of degrees is high, it may indicate that there is some repeated experience, or that the similarity between experiences is high, resulting in a low diversity of experience. The layout of Sang Ji chart presents the results of dynamic web community detection built in the method. The present embodiment represents communities with rectangles, and the connected edge is the relevance of the communities. Initially, this embodiment retains all the connection edges, but their visualization effect is not ideal, making it difficult to analyze the relationships between communities. To improve this problem, the present embodiment retains only the connecting edges of each community and the most similar communities in its neighboring steps, and displays the connecting edges having a value greater than k, while the remaining connecting edges are colorless. Through experimental comparison, the k is set to 0.4, so that the similarity degree among groups can be effectively reflected, good visual effect is ensured, and the number of displayed edges is reduced. The empirical exploration view provides the ability to intuitively understand the community structure and its degree of association in DDPG model training at different time steps. It also allows for rapid identification of important links and changes between time steps and detection of whether there is significant structural evolution during the training process. When the user clicks on a rectangle in the Sang Ji graph, the lower histogram shows detailed data of experiences in the community represented by the rectangle, where the abscissa is the experienced feature vector and the ordinate is the value of the feature vector.

The recommended view (fig. 9) shows the model training effect for different parameter combinations. The parameter combinations of the heatmap are the number of replay buffers (horizontal axis) and the number of sample batch experiences (vertical axis). The color of the rectangle is according to R ² To map. The darker the color, R ² The higher the prediction effect is, the better. The user may determine a combination of parameter values that yields a better model effect based on the color comparison.

Example two

The present embodiment provides a visual analysis system for reinforcement learning model, including:

Example III

The present embodiment provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the visual analysis method of the reinforcement learning model of embodiment one when the computer program is executed.

Example IV

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the visual analysis method of the reinforcement learning model of the embodiment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present application will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present application.

Claims

1. A visual analysis method of a reinforcement learning model, characterized by: comprising the following steps:

2. The visual analysis method of reinforcement learning model according to claim 1, characterized in that: the method for obtaining the state data trained by the reinforcement learning model, aggregating the state data, and performing space mapping on the aggregated data to obtain a two-dimensional space structure diagram comprises the following steps:

acquiring state data trained by the reinforcement learning model;

3. The visual analysis method of reinforcement learning model according to claim 2, characterized in that: the aggregation is carried out on the state data corresponding to the time steps in each round in the time dimension, and the method comprises the following steps:

4. The visual analysis method of reinforcement learning model according to claim 2, characterized in that: the method for screening out the preset number of features based on the random forest regression model performs feature importance analysis on the aggregated state data, and comprises the following steps:

will s _ij From h-dimensional feature vectorsThe representation is: />

By being in said state dataset S _i Fitting a random forest regression model, carrying out importance scoring and ranking according to the contribution degree of each feature to model prediction, and carrying out s according to the scoring and ranking results _ij Corresponding h-dimensional feature vectorScreening out the presetNumber characteristics.

5. The visual analysis method of reinforcement learning model according to claim 1, characterized in that: the method for analyzing the similarity in a plurality of experience data in a single time step in the space dimension comprises the following steps:

6. The visual analysis method of reinforcement learning model according to claim 5, characterized in that: the method for analyzing the similarity of the experience data between the adjacent time steps in the time dimension comprises the following steps:

7. The visual analysis method of reinforcement learning model according to claim 5, characterized in that: the node n _i And n _j The similarity value formula of (2) is:

8. The visual analysis method of reinforcement learning model according to claim 6, characterized in that: and judging the correlation according to communities in adjacent time steps, wherein the formula is as follows:

9. The visual analysis method of reinforcement learning model according to claim 1, characterized in that: if the training quality of the reinforcement learning model does not accord with the expected value, the reinforcement learning model training is optimized and recommended, and the method comprises the following steps:

the optimization recommendation includes combinations of different values between three factors, each combination consisting of R ² Measured by R ² The higher the prediction effect of the model under the current combination is, the better the R ² The formula of (2) is:

10. A visual analysis system for reinforcement learning models, characterized by: comprising the following steps: