CN114358247A

CN114358247A - Intelligent agent behavior interpretation method based on causal relationship inference

Info

Publication number: CN114358247A
Application number: CN202111625582.9A
Authority: CN
Inventors: 王汉; 朴海音; 陈永红; 陶晓洋; 于津; 郝一行; 彭宣淇; 韩玥; 杨晟琦; 叶超; 樊松源; 孙阳
Original assignee: Shenyang Aircraft Design Institute Yangzhou Collaborative Innovation Research Institute Co ltd
Current assignee: Shenyang Aircraft Design Institute Yangzhou Collaborative Innovation Research Institute Co ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-04-15

Abstract

The invention discloses an agent behavior interpretation method based on causal relationship inference, and belongs to the technical field of assistant decision making and causal inference. Training data collection is carried out on an agent which is trained by reinforcement learning, and the data comprises an environmental state, an action taken and reward information in the training process of the agent; off-line training is carried out on the data through a causal relationship discovery and data regression fitting method, and a reasonable action causal relationship model is output; and performing online explanation on the behavior of the agent by using a behavior causal relationship model. The invention can achieve good behavior interpretation effect.

Description

Intelligent agent behavior interpretation method based on causal relationship inference

Technical Field

The invention belongs to the technical field of assistant decision and causal inference, and particularly relates to a method for strengthening learning intelligent agent behavior interpretation.

Background

In recent years, the application value of the unmanned aerial vehicle is highlighted in local wars of several fields, and the unmanned aerial vehicle is concerned by more and more researchers. The air confrontation game decision of the unmanned aerial vehicle directly reflects the intelligent level of the unmanned aerial vehicle, and influences the fighting capacity of the unmanned aerial vehicle when confrontation with the manned or unmanned aerial vehicle is carried out. At present, the mainstream method for solving the problem of unmanned aerial vehicle air countermeasure game decision is reinforcement learning, such as policy gradient and operator-critic. The reinforcement learning does not need huge training data sets and sufficient prior knowledge, and the intelligent agent can learn from scratch, continuously learn in the continuous interaction process with the environment, adjust the strategy of the intelligent agent and realize the selection of the optimal behavior. However, the deep reinforcement learning is faced with the problem of lack of interpretability, so that the analysis result of the algorithm is difficult to be fully utilized in actual combat. The behavior of the agent obtained through reinforcement learning is a black box model, and is lack of interpretability. The method has the advantages that certain obstacles are generated to the human trusting the behavior of the intelligent agent, and the analysis of the behavior model of the intelligent agent by using the training data of the intelligent agent is an important means for explaining the intelligent agent. The human beings can predict the behavior of the intelligent agent by modeling the behavior of the intelligent agent, explain the reason why the intelligent agent does a certain behavior, and the intelligent agent can make the optimal solution under the current environment state. Therefore, the explanation of the behavior of the intelligent body has important guiding significance for the human to trust the intelligent body and optimize the training mode of the intelligent body.

In the current behavior interpretation of the intelligent agent, people basically rely on the correlation between the interpretability of the model and training data for the behavior interpretation of the intelligent agent. For example, the behavior logic of the agent itself generated based on the rules is completely written by human beings, and in a specific case, it can be clearly known what decision the agent will take next, what behavior to make, and no ambiguous choice exists. The intelligent agent model obtained by the method has strong interpretability, but the intelligent agent model excessively depends on the intelligent agent behavior logic written by human, so that the intelligence of a machine is difficult to embody, and the force is not good when complex tasks are processed. Another way to generate agents is through massive amounts of data and neural network training, in which the agents' behavior generation basis can be interpreted by finding the correlation between certain behavior and certain quantities of the agents through finding correlations between the data that train the agents. Although this method utilizes the superiority of machine in processing large amount of data and the generalization ability of generated model, its interpretation intelligence has a significant drawback that the interpretation obtained by the correlation between data cannot conform to human logic.

Disclosure of Invention

The object of the present invention is to solve or improve the above mentioned drawback of the behavior model of an agent that is difficult to interpret. Therefore, the invention provides an intelligent agent behavior interpretation method based on causal relationship inference, the causal behavior structure diagram obtained by the method can predict the next action of the intelligent agent, explain the basis of the intelligent agent for making a certain behavior, and has great significance in improving the interpretability of the intelligent agent behavior and subsequently optimizing the training mode of the intelligent agent.

The technical scheme of the invention is as follows:

an intelligent agent behavior interpretation method based on causal relationship inference comprises the steps of firstly, carrying out training data acquisition on an intelligent agent which is trained by reinforcement learning, wherein the training data acquisition mainly comprises an environmental state, a taken action and reward information in the intelligent agent training process, and selecting airplane intelligent agent training data in a certain proportion as a data set; then, discovering the causal relationship among the data according to the relation among the data of the data set and by combining the prior knowledge; performing regression fitting on the training data to generate a reasonable behavior causal structure model; and finally, inputting the real-time observation data into a behavior causal structure model to predict possible actions taken by the intelligent agent and explain the behavior of the intelligent agent. The method comprises the following steps:

a) performing off-line acquisition on sample data in the process of the reinforcement learning training agent;

sample data is obtained in the interaction process of the intelligent agent and the environment in the process of training the intelligent agent through reinforcement learning. The sample data consists essentially of three parts, namely the state of the environment, the actions of the agent and the rewards earned by the agent. The whole time sequence in the task exploration of the intelligent agent is trained as a group of sample data through reinforcement learning. The state of the environment, the actions of the agent are closely related to the rewards earned by the agent and the reinforcement learning training process, and the collected data are shown in table 1.

b) Combining causality among data and human experience to obtain a behavioral cause-effect structure chart;

the construction process of the behavioral cause and effect structure chart comprises the following steps:

and performing behavior interpretation modeling on the intelligent agent for reinforcement learning training in a mode of constructing a behavior cause and effect structure diagram. The causality analysis is carried out on data generated in the process of training the intelligent agent, and a reasonable behavior cause-effect structure diagram can be constructed by combining human experience knowledge, which is the basis of behavior interpretation modeling. The goal is to express the relationship among the environment state, the action and the reward, and to discover the relationship among the data causally by considering the time sequence of the data generated by reinforcement learning.

(1) Discovering causality among data

There are two ways to discover causality among data:

one is an independence test based approach, with a sample correlation coefficient of two variables X and Y as:

judging independence wherein X_iAnd Y_iWhich represents the value of the variable(s),

and

represents the mean of the data. Causality is verified in combination with markov assumptions on an independent basis.

The other method is a method for adding noise on a model: y ═ f (X, E), X ═ E; the model is formed by a linear model Y-a-X + E and a nonlinear model Y-f₂(f₁(x) + E). Wherein X and Y represent variables, a represents a weight parameter, X represents the value of a variable, f₁、f₂Representing the function equation and E the data noise model.

The two methods are verified respectively, the first scheme is simple to implement, but the accuracy is not high in processing complex problems, the accuracy of the second method is influenced by various hyper-parameters, the implementation is complex, and the effect on complex reasoning problems is excellent.

(2) And selecting a proper data causal model according to the air task of the intelligent agent, and constructing a reasonable behavior causal structure chart by combining human understanding analysis, namely priori knowledge, of the task of the intelligent agent on the basis of causality discovery among data. The behavioral cause-effect structure graph is composed of nodes and directed edges connecting the nodes. The nodes represent random variables, the directed edges among the nodes represent the mutual relations among the nodes, and the conditional probability represents the strength of the relations among the nodes. The random variables include the state of the environment, the actions of the agent and the rewards earned by the agent.

c) Method for constructing cause and effect structure chart model by using behavior cause and effect structure chart

Inputting the collected sample data into a multilayer perceptron neural network for off-line training, learning out a transfer matrix of nodes of a cause and effect structure graph model, and obtaining the relation weight between the nodes, wherein the weight represents the strength of the relation between the nodes; adding the obtained weight into the behavior cause and effect structure chart obtained in the step b), predicting the behavior of the next action of the intelligent agent by using the current state of the intelligent agent and the behavior cause and effect structure chart with parameters, and comparing the predicted result with the actual result of the intelligent agent to explain an intelligent agent behavior model.

The action cause and effect structure diagram can perform qualitative analysis on the action of the intelligent agent, but the qualitative analysis is still far from enough for accurately predicting the next action of the intelligent agent, and a regression model of the variable and the quantity between the variables is trained on the existing data.

The multi-layer perceptron neural network utilizes a ReLU function as an activation function, and the ReLU (transformed linear unit) function provides a very simple nonlinear transformation. Given the element x, the function is defined as: relu (x) max (x,0)

More than one hidden layer (hidden layer) is introduced on the basis of a single-layer neural network in the multi-layer perceptron neural network. The hidden layer is positioned between the input layer and the output layer, and the network model of the multilayer perceptron is as follows:

O＝(XW_h+b_h)W_o+b_o＝XW_hW_o+b_hW_o+b_o

output of the network is formed by R^N×H

X | ∈ R | input | >^N×DAnd D represents the number of features

W_hWeight | ∈ R of h-th layer^D×HAnd H represents the number of hidden units in the first layer

b_hH-th network output ∈ R^N×H

W_oWeight | ∈ R of o-th layer^D×H

b_oOutput ∈ R of the network of layer o^N×H

FIG. 1 illustrates a neural network diagram of a multi-layered perceptron; in the multi-layer perceptron shown in the model diagram, as can be seen from the model diagram, the neurons in the hidden layer of the multi-layer perceptron neural network are completely connected with the inputs in the input layer, and the neurons in the output layer are also completely connected with the neurons in the hidden layer. Therefore, the hidden layer and the output layer in the multi-layer perceptron neural network are all fully connected layers.

Further, taking 70% of sample data as a training set, taking 30% of the sample data as a test set, training by adopting a minipatch method, selecting the batch size as 64, completely training all the sample data once, stopping training, and outputting a final weight matrix and a final bias item in the forward propagation process. The overall training process is shown in the training portion of fig. 2.

d) And reasonably explaining the behavior of the agent by using the causal structure diagram model when the agent performs tasks.

Predicting and interpreting the actions of the intelligent agent: and the trained model can be used for behavior interpretation of the intelligent agent. And substituting the environmental state of each moment in the reinforcement learning task into the forward propagation process by combining the trained weight matrix and the bias item, wherein the obtained output is a two-dimensional vector which respectively represents the probability of occurrence and the probability of non-occurrence of the action at the current moment, and the maximum selected from the two is whether the action at the current moment is executed or not, and all the actions are executed once. Finding the most likely actions to occur.

In the above process, a), b) and c) are off-line processes, aiming to obtain a trained prediction model, and d) are on-line application of the model, aiming to obtain reasonable explanation of the behavior of the intelligent agent.

The invention can analyze the behavior mode of the intelligent agent by collecting the data of the airplane intelligent agent in the training environment, and finishes the explanation of the behavior of the intelligent agent. Performing off-line training by using data in the process of intelligent agent reinforcement learning training; and training the obtained prediction model for online prediction. Whether the behavior of the agent and the change of the environmental state and the rewards obtained by the agent have causality can be found by adopting the relation between the data and combining the prior knowledge, and the causality between the data can be reasonably found by combining the expert knowledge. The next action of the intelligent agent can be predicted more truly through data fitting on the basis of the causal graph. In line therewith, the behavior decisions of the agent can also be interpreted accordingly. The intelligent agent behavior causal model obtained through the intelligent agent training data can intuitively and simply describe the behavior characteristics of the intelligent agent, and becomes a means for optimizing the intelligence of the intelligent agent.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a diagram of a neural network of a multi-layered perceptron.

Fig. 3 illustrates a diagram.

FIG. 4 is a flow chart of the method of the present invention.

Detailed Description

The technical solution of the present invention is further illustrated by the accompanying drawings and examples.

And (4) carrying out causal structure learning by keeping the data in the training process of the airplane intelligent agent. The main data of the data comprises current time (time), relative distance (distance), approach rate (closure), current heading angle (psi _ V), height (h), speed (velocity), climbing rate (h _ dot), attack angle (alpha), sideslip angle (beta), overload (n _ load), blood volume (blood) and residual oil volume (oil) which are respectively displayed by red and blue sides.

The independence is judged by calculating a sample correlation coefficient between every two variables based on an independence test method, and the causality is verified by combining Markov assumption on the basis of the independence. For more complex data such as relative coordinates of an airplane, a noise adding method is adopted to a model, a cause and effect structure diagram among the data is built, a multi-layer perceptron is utilized to learn diagram transition matrix parameters, and a cause and effect threshold value is set to finally obtain the anti-cause and effect structure diagram of the airplane shown in the diagram 4.

According to the current attitude, the blood volume and other states of the airplane, the state of the airplane at the next moment is calculated by combining with the graph 3, for example, the airplane is judged to perform maneuvering action of pulling the attack angle after the attack angle of the airplane at the next moment is predicted to be increased, if the actually simulated output result is that the attack angle is increased, the causal structure model can well explain the behavior of the airplane intelligent body, and the effect of explaining the result of the reinforcement learning training is achieved.

The flow of the method of the present invention, which combines the steps of 1), 2), 3), 4) is shown in FIG. 4. The data acquisition and network training part is an off-line process, and the emission point prediction is an on-line process.

For each sequence sample of the reinforcement learning, each time point has a true value, i.e. a label, of the selection action, and for each time point, the prediction model also gives the predicted action, and the difference between the two is used as a standard for measuring the prediction effect. The accuracy, precision and recall are selected as evaluation indexes. For the binary classification problem, the samples can be classified into four cases according to the combination of the real class and the prediction class, and the specific classification is shown in table 2.

The accuracy A, precision P and recall R are defined as

The test result of the intelligent agent behavior interpretation method based on causal relationship inference shows that the behavior predicted by the method is basically similar to the behavior decision of the intelligent agent trained by reinforcement learning.

In conclusion, the behavior of the smart body is predicted to be basically consistent with the behavior of the smart body by using the smart body behavior interpretation method based on causal relationship inference, which means that the behavior of the flying smart body can be interpreted by using a behavior causal structure model; secondly, through the behavior interpretation mode, some unintelligent characteristics of the intelligent agent can be discovered, and the training mode of the intelligent agent is adjusted in turn.

TABLE 1 Intelligent agent training data samples

Time	T
		Environment(s)	Sⁱ
Movement of	Aⁱ
		Reward	rⁱ

TABLE 2 Classification result confusion matrix

Claims

1. An agent behavior interpretation method based on causal relationship inference is characterized by comprising the following steps:

sample data is obtained in the interaction process of the intelligent agent and the environment in the process of training the intelligent agent through reinforcement learning; the sample data comprises three parts, namely the state of the environment, the action of the intelligent agent and the reward acquired by the intelligent agent; taking the whole time sequence in the task exploration of the primary reinforcement learning training agent as a group of sample data;

b) combining causality among data and human experience to obtain a behavioral cause-effect structure chart; the construction process of the behavioral cause and effect structure chart comprises the following steps:

(1) discovering causality among data

There are two ways to discover causality among data:

method based on independence testThe sample correlation coefficient by two variables X and Y is:

and

represents the mean of the data; verifying causality by combining Markov hypothesis on the basis of independence;

the other method is a method for adding noise on a model: y ═ f (X, E), X ═ E; the model is formed by a linear model Y-a-X + E and a nonlinear model Y-f₂(f₁(x) + E); wherein X and Y represent variables, a represents a weight parameter, X represents the value of a variable, f₁、f₂Representing a function equation, and E represents a data noise model;

(2) selecting a proper data causal model according to an air task of the intelligent agent, and constructing a reasonable behavior causal structure chart by combining human understanding analysis, namely priori knowledge, of the task of the intelligent agent on the basis of causality discovery among data; the action cause and effect structure chart is composed of nodes and directed edges connecting the nodes; the nodes represent random variables, directed edges among the nodes represent the mutual relation among the nodes, and the conditional probability represents the strength of the relation among the nodes; the random variable comprises the state of the environment, the action of the agent and the reward acquired by the agent;

Inputting the collected sample data into a multilayer perceptron neural network for off-line training, learning out a transfer matrix of nodes of a cause and effect structure graph model, and obtaining the relation weight between the nodes, wherein the weight represents the strength of the relation between the nodes; adding the obtained weight into the behavior cause and effect structure chart obtained in the step b), predicting the behavior of the next action of the intelligent agent by using the current state of the intelligent agent and the behavior cause and effect structure chart with parameters, and comparing the prediction result with the actual result of the intelligent agent to explain an intelligent agent behavior model;

the multilayer perceptron neural network utilizes a ReLU function as an activation function, and the ReLU function provides a very simple nonlinear transformation; given the element x, the function is defined as: relu (x) max (x,0)

More than one hidden layer (hidden layer) is introduced on the basis of a single-layer neural network in the multi-layer perceptron neural network; the hidden layer is positioned between the input layer and the output layer, and the network model of the multilayer perceptron is as follows:

O＝(XW_h+b_h)W_o+b_o＝XW_hW_o+b_hW_o+b_o

output of the network is formed by R^N×H

X | ∈ R | input | >^N×DAnd D represents the number of features

b_hH-th network output ∈ R^N×H

W_oWeight | ∈ R of o-th layer^D×H

b_oOutput ∈ R of the network of layer o^N×H

The neuron in the hidden layer of the multilayer perceptron neural network is completely connected with each input in the input layer, and the neuron in the output layer is also completely connected with each neuron in the hidden layer; therefore, the hidden layer and the output layer in the multi-layer perceptron neural network are all fully connected layers;

completely training all sample data once, stopping training, and outputting a final weight matrix and a final bias item in the forward propagation process;

d) reasonably explaining the behavior of the agent by using a causal structure diagram model when the agent performs a task;

substituting the environmental state of each moment in the reinforcement learning task into the forward propagation process by combining the trained weight matrix and the bias item, wherein the obtained output is a two-dimensional vector which respectively represents the probability of occurrence and the probability of non-occurrence of the action at the current moment, and the maximum selected from the two is whether the action at the current moment is executed or not, and all the actions are executed once; finding the most likely actions to occur.