CN112700011A

CN112700011A - Intelligent agent decision information display method and device, electronic equipment and storage medium

Info

Publication number: CN112700011A
Application number: CN202011643879.3A
Authority: CN
Inventors: 王雨萱; 徐昀; 高�浩
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-23

Abstract

The embodiment of the application discloses a method and a device for displaying intelligent agent decision information, electronic equipment and a storage medium, relates to the technical field of machine learning, and aims to more intuitively display the intelligent agent decision information. The intelligent agent decision information display method comprises the following steps: selecting an agent activity period in the displayed agent decision information display interface, and selecting an agent based on the selected agent activity period; providing an action step selection control for the selected agent in the selected agent activity cycle in the display interface; selecting an action step from the set of action steps of the agent based on the selection of a control at the action step; and displaying decision information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action based on the selected action step. The embodiment of the application is suitable for displaying the intelligent agent decision information in machine learning.

Description

Intelligent agent decision information display method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method and an apparatus for displaying agent decision information, an electronic device, and a storage medium.

Background

Reinforcement learning is one of the methodologies of machine learning, and a typical Reinforcement Learning (RL) problem can be summarized as: learning an optimal Policy (Policy) that allows agents to act (Action) in a specific Environment (Environment) according to the current State (State) to obtain maximum return (Reward).

After the learning of the agent is completed, in order to provide assistance for the decision of the user, some information of the learning process, such as some decision information of the agent in the learning process, is often required to be provided. Currently, the mainstream reinforcement learning products are mainly oriented to technicians with a certain foundation, and are usually completed in a code form in the aspect of result display, so that the results are not intuitive enough.

Disclosure of Invention

In view of this, embodiments of the present application provide an agent decision information display method, an apparatus, an electronic device, and a storage medium, which can display agent decision information more intuitively.

In a first aspect, an embodiment of the present application provides an agent decision information display method, including:

displaying an intelligent agent decision information display interface;

selecting an agent activity period in the display interface, and selecting an agent based on the selected agent activity period; wherein the agent is an agent that performs a particular action during the agent activity period;

providing an action step selection control for the selected agent in the selected agent activity cycle in the display interface;

selecting an action step from the set of action steps of the agent based on the selection of a control at the action step;

and displaying decision information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action based on the selected action step.

According to a specific implementation manner of the embodiment of the application, each activity period of the intelligent agent corresponds to a corresponding accumulated reward value; an intelligent agent activity cycle selection frame is arranged in the display interface, and an expansion display operation button is arranged in the intelligent agent activity cycle selection frame;

the selecting an agent activity period in the display interface includes:

displaying at least one selectable smart agent activity period and a corresponding accumulated reward value based on the operation of the expansion display operation button in the smart agent activity period selection frame;

a period of agent activity is selected based on the cumulative prize value.

According to a specific implementation manner of the embodiment of the application, an intelligent agent selection frame is arranged in the display interface, and an expansion display operation button is arranged in the intelligent agent selection frame;

selecting an agent based on the selected agent activity period, comprising:

displaying at least one selectable agent based on an operation of an expansion display operation button in the agent selection box;

an agent is selected from the presented at least one alternative agent.

According to a specific implementation manner of the embodiment of the application, the action step selection control comprises a coordinate axis control and/or a selection frame control;

the selecting a control based on the action steps provided in the display interface selects an action step from the set of action steps of the agent, and comprises the following steps:

and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection box control.

According to a specific implementation manner of the embodiment of the application, after selecting an action step in the set of action steps of the agent based on the coordinate axis control and/or the selection box control, the method further includes:

and displaying the reward value and the accumulated reward value of the currently selected action step in the display interface.

According to a specific implementation manner of the embodiment of the application, the coordinate axis control comprises a coordinate axis and a slider arranged on the coordinate axis;

selecting an action step from the set of action steps of the agent based on the coordinate axis control, wherein the selecting an action step comprises: and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

According to a specific implementation manner of the embodiment of the present application, the selecting, based on the selection box control, an action step from the set of action steps of the agent includes: selecting an action step based on an operation of a flip up or flip down button in a selection box control or based on action step information input in a display window of the selection box control.

According to a specific implementation manner of the embodiment of the application, the action step selection control comprises a coordinate axis control and a selection frame control;

in the process of selecting the action step corresponding to the slider dragged to the target position based on the dragging operation of the slider on the coordinate axis, the method further comprises the following steps:

and synchronously displaying the action corresponding to the position of the slider on the coordinate axis in a display window of the selection frame control.

in selecting an action step based on operation of a flip up or flip down button in the selection box control, the method further comprises: synchronously moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step displayed in the display window of the selection frame control;

alternatively, the first and second electrodes may be,

in selecting an action step based on the action step information input in the display window of the selection box control, the method further includes:

and automatically moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step corresponding to the action step information input in the display window of the selection frame control.

According to a specific implementation manner of the embodiment of the application, the coordinate axes comprise an abscissa axis and an ordinate axis, the abscissa axis represents an action step of a selected intelligent agent, and the ordinate axis represents a reward value of the selected intelligent agent in each action step;

when the action step selection control is provided in the display interface, the method further comprises:

connecting the reward values of the selected intelligent agents in each action step together in a connection mode to form an action step reward value trend curve of the selected intelligent agents and displaying the action step reward value trend curve; alternatively, the first and second electrodes may be,

and (4) displaying the reward value of the selected intelligent agent at each action step in a histogram mode.

According to a specific implementation manner of the embodiment of the present application, the displaying, based on the selected action step, the decision information of the agent performing the specific action in the action step includes:

and displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action in a preset area of a display interface based on the selected action step.

In a second aspect, an embodiment of the present application provides an agent decision information display device, including:

the interface display module is used for displaying an intelligent agent decision information display interface;

the period and agent selection module is used for selecting an agent activity period in the display interface and selecting an agent based on the selected agent activity period; wherein the agent is an agent that performs a particular action during the agent activity period;

the step selection control providing module is used for providing action step selection controls of the selected intelligent agent in the selected intelligent agent activity period in the display interface;

the action step selection module is used for selecting an action step from the action step set of the intelligent agent based on an action step selection control provided in the display interface;

and the decision information display module is used for displaying decision information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action based on the selected action step.

the period and agent selection module comprises:

the period selection submodule is used for displaying at least one selectable intelligent agent activity period and a corresponding accumulated reward value based on the operation of the expansion display operation button in the intelligent agent activity period selection frame; a period of agent activity is selected based on the cumulative prize value.

the period and agent selection module comprises: an agent selection submodule for: displaying at least one selectable agent based on an operation of an expansion display operation button in the agent selection box; an agent is selected from the presented at least one alternative agent.

the action step selection module is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection box control.

According to a specific implementation manner of the embodiment of the present application, the intelligent agent decision information display apparatus further includes: and the reward value information display module is used for displaying the reward value and the accumulated reward value of the currently selected action step in the display interface.

the action step selection module is specifically configured to: and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

According to a specific implementation manner of the embodiment of the present application, the action step selection module is specifically configured to: selecting an action step based on an operation of a flip up or flip down button in a selection box control or based on action step information input in a display window of the selection box control.

the action step selection module is used for selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis, and is also used for: and synchronously displaying the action corresponding to the position of the slider on the coordinate axis in a display window of the selection frame control.

the action step selection module is further configured to, in the process of selecting an action step based on an operation of a flip-up or flip-down button in the selection box control: synchronously moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step displayed in the display window of the selection frame control;

alternatively, the first and second electrodes may be,

the action step selection module, in the process of selecting an action step based on the action step information input in the display window of the selection box control, is further configured to:

the interface display module, when providing the action step selection control in the display interface, is further configured to:

According to a specific implementation manner of the embodiment of the present application, the decision information display module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing the intelligent agent decision information display method described in any one of the foregoing implementation manners.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the agent decision information presentation method according to any one of the foregoing implementation manners.

In the embodiment of the application, an agent activity period can be selected from a displayed agent decision information display interface, and an agent is selected based on the selected agent activity period; and selecting a control based on the action steps provided in the display interface, selecting an action step from the action step set of the intelligent agent, and displaying decision information of a decision made by the intelligent agent in the action step based on the selected action step, so that the decision information of the intelligent agent can be displayed more intuitively, and a user can conveniently obtain the decision information of the decision made by the intelligent agent in the specific action.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of an agent decision information display method according to an embodiment of the present application;

fig. 2a and 2b are schematic diagrams of a reinforcement learning decision information interface displayed by an agent decision information display method according to an embodiment of the present application;

FIG. 3 is a block diagram of an agent decision information presentation device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reinforcement learning is one of the methodologies of machine learning. A typical Reinforcement Learning (RL) problem can be summarized as: learning an optimal Policy (Policy) that allows agents to act (Action) in a specific Environment (Environment) according to the current State (State) to obtain maximum Reward (Reward)

First, some existing terms in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

The intelligent agent: the main body for machine learning in the algorithm can sense the state of the external environment to make a decision, make an action on the environment and adjust the decision through the feedback of the environment.

Environment: the state of a collection of everything outside the agent changes as a result of the agent's actions, and such changes may be fully or partially perceived by the agent. The context may be fed back to the agent's corresponding reward after each decision.

The state is as follows: the state is a description of the environment, and changes occur after the agent makes an action.

An action space: an action is a description of the behavior of an agent and is the result of an agent's decision. The set of all possible actions is the action space. The motion space may be discrete or continuous.

Rewarding: the agent gives feedback of the post-action environment to the agent. Is a scalar function of the current time state, the action, and the next time state.

And (3) returning: the reward is the accumulation of rewards over time steps, and after introducing the concept of a track, the reward is also the sum of all rewards on the track.

The process of reinforcement learning is exemplified by the character model training in the man-machine Battle mode in glory of the game king of the Multiplayer Online tactical sports (MOBA) type, in combination with the above technical wordings. In the game, a character to be trained is an intelligent agent, elements such as a red buff, a blue buff (wherein the buff refers to a gain state) and a tower existing in the game are an environment where the intelligent agent is located, the intelligent agent can execute various actions such as movement (including four options, namely an upper option, a lower option, a left option and a right option), a transmitting skill (such as a skill 1 and a skill 2) and a flash in each step, the intelligent agent can give different rewards to each different action environment and reflect the forms of blood volume increase and decrease, the death or tower pushing success of the intelligent agent is one round, each action executed by the intelligent agent is one step, each step has a reward value as feedback, and the reward value in each step is accumulated to be the reward value in each round.

The technical solutions provided by the embodiments of the present invention are explained in detail below with reference to the above prior art terms.

Fig. 1 is a schematic flow chart of an agent decision information display method according to an embodiment of the present application, and referring to fig. 1, the agent decision information display method according to the embodiment of the present application includes the steps of:

and S100, displaying an intelligent agent decision information display interface.

The decision information is the decision condition of a certain activity period (such as a certain game play) in the learning process and a certain step in the activity period of the intelligent agent. The display interface may be a display interface displayed on a display screen of an electronic device such as a server, a desktop computer, a tablet computer, a smart phone, or the like.

S102, selecting an agent activity period in the display interface, and selecting an agent based on the selected agent activity period.

And providing a period selection control of the activity period of the intelligent agent and an intelligent agent selection control for selecting each intelligent agent in the activity period of the intelligent agent in the display interface. The period selection control and the agent selection control can be selection controls such as a selection window with a pull-down button and the like.

There may be one or more agents in a selected agent activity period. In the case of multiple agents, any one of the agents may be selected to present decision information for the selected agent. Wherein the selected agent is an agent that performs a particular action during the agent activity cycle.

And S104, providing an action step selection control of the selected intelligent agent in the selected intelligent agent activity period in the display interface.

In the display interface, an action step selection control of the selected agent in the activity period of the selected agent can be provided in advance. And after an agent is selected in the selected agent activity period, updating the content information displayed in the action step selection control provided in the display interface.

In this step, the action step selection control after the content information is updated is displayed in the display interface.

S106, selecting a control based on the action steps, and selecting an action step from the action step set of the agent.

An agent's set of action steps refers to a set of actions steps of an agent during an agent's active period. Each action step is a specific action that the agent chooses to perform in the action space corresponding to an action.

And S108, displaying decision information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action based on the selected action step.

In this step, based on the selected action step, the decision information of the decision made by the agent in executing the specific action in the action step is displayed in the display interface. The decision information can be displayed in a display interface through the observation value (observation result) of the step, the execution action (action distribution) of the step, the specific action value (action output) of the step and the like.

In some embodiments, each agent activity period may correspond to a respective cumulative prize value. The cumulative prize value may be a cumulative sum of the prize values for each action step of each agent during an agent activity period.

An intelligent agent activity cycle selection frame is arranged in the display interface, and an expansion display operation button is arranged in the intelligent agent activity cycle selection frame. Wherein the selecting an agent activity period in the display interface may include: displaying at least one selectable smart agent activity period and a corresponding accumulated reward value based on the operation of the expansion display operation button in the smart agent activity period selection frame; a period of agent activity is selected based on the cumulative prize value.

The method comprises the steps of obtaining accumulated reward values of all intelligent agents, selecting an intelligent agent activity period based on the accumulated reward values, wherein the intelligent agent activity period with the highest accumulated reward value can be selected according to the sequence of the accumulated reward values, and the intelligent agent activity period with interest can be selected according to the sequence of the accumulated reward values.

In some embodiments, a smart agent selection box is disposed in the display interface, and an expansion display operation button is disposed in the smart agent selection box.

The selecting an agent based on the selected agent activity period may include: displaying at least one selectable agent based on an operation of an expansion display operation button in the agent selection box; an agent is selected from the presented at least one alternative agent.

In one example, after clicking on the expansion display operation button in the agent selection box, at least one of the agents from which to select may be presented in the form of a drop-down list to facilitate the user in selecting an agent of interest therefrom.

In some embodiments, the action step selection control provided in the display interface may include a coordinate axis control; wherein, the selecting a control based on the action step, and selecting an action step in the action step set of the agent, may include: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control.

In one example, the coordinate axis control comprises a coordinate axis and a slider arranged on the coordinate axis;

the selecting an action step from the set of action steps of the agent based on the coordinate axis control may include: and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

In some embodiments, the action step selection control provided in the display interface may include a selection box control. Wherein, the selecting a control based on the step, in the action step set of the agent, a step of selecting an action includes: selecting an action step from the set of action steps of the agent based on the selection box control.

In one example, the selecting an action step in the set of action steps of the agent based on the selection box control includes: selecting an action step based on an operation of a flip up or flip down button in a selection box control or based on action step information input in a display window of the selection box control.

In some embodiments, the action step selection controls provided in the display interface may include a coordinate axis control and a selection box control; the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis.

In one example, the selecting a control based on the step, from the set of action steps of the agent, an action step comprising: and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

In another example, the selecting an action step in the set of action steps of the agent based on the step selection control includes: selecting an action step based on an operation of a flip up or flip down button in a selection box control or based on action step information input in a display window of the selection box control.

In selecting an action step based on operation of a flip up or flip down button in the selection box control, the method may further comprise: and synchronously moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step displayed in the display window of the selection frame control. Alternatively, in selecting an action step based on the action step information input in the display window of the selection box control, the method may further include: and automatically moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step corresponding to the action step information input in the display window of the selection frame control.

In some embodiments, the coordinate axes comprise an abscissa axis and an ordinate axis; the axis of abscissas represents action steps for a selected agent and the axis of ordinates represents the reward value for the selected agent at each action step.

In the display interface, the reward value of the selected agent at each action step may be graphically presented.

In one example, the reward values for the selected agent at each action step may be linked together by a link to form a trend curve of reward values for the action steps of the selected agent.

In another example, the reward value for the selected agent at each action step may be displayed in a histogram to form a histogram of action step reward values for the selected agent.

Through the action step reward value trend curve or the action step reward value histogram, the user can intuitively know the action step reward value trend of the selected intelligent agent.

In order to enable the user to better and accurately know the reward value of one action step of the selected agent, in some embodiments, the reward value corresponding to the action step corresponding to the hovering position of the mouse pointer may be popped up in a manner that the mouse pointer is hovered on the coordinate axis control.

Under the condition that the coordinate axis control comprises a coordinate axis and a slider arranged on the coordinate axis, when a mouse pointer hovers on the coordinate axis for a preset time (such as 0.5 second or 1 second), the reward value corresponding to the action step can be popped up in a window, bubble and other modes.

In one example, when the bonus value corresponding to the action step is popped up, the accumulated bonus value corresponding to the action step can be popped up and displayed. The accumulated reward value corresponding to the action steps refers to the accumulated sum of the reward values of all the action steps from the first action step to the currently selected action step of the selected intelligent agent in the activity period of the selected intelligent agent.

In addition to displaying the reward value and the cumulative reward value corresponding to the selected action step in a pop-up manner by hovering over a mouse pointer, in some embodiments, after selecting an action step in the set of action steps of the agent based on the axis control and/or the checkbox control, the reward value and the cumulative reward value of the currently selected action step may also be displayed numerically in the display interface, for example, on the side of the axis control or the checkbox control (in the figure, on the right side of the checkbox control).

In some embodiments, the presenting decision information of the agent performing the specific action at the action step based on the selected action step may include: and displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action in a preset area of a display interface based on the selected action step.

Wherein "observation information" represents feedback given by the environment to the current number of steps (i.e., observations of the environment after performing this step); "action distribution information" represents specific actions (for example, "move" includes up, down, left, and right actions) contained in the action types (for example, "move") that can be performed by the agent and scores of different actions; "action output information" represents the actual action output of the agent.

And after the user selects an action step in the action step set of the selected agent, the observation result information, the action distribution information and/or the action output information displayed in the display interface can be refreshed and displayed in real time.

The following describes an intelligent agent decision information presentation method according to an embodiment of the present application, with reference to a specific example.

Fig. 2a and 2b are schematic diagrams of a reinforcement learning decision information interface displayed by an agent decision information display method according to an embodiment of the present application, and referring to fig. 2a and 2b, according to the method of the embodiment of the present application, a decision condition of an agent at a specific local or local step in a learning process can be displayed for a user.

The user first needs to determine the number of rounds that need to be viewed through the "current number of rounds" drop-down box, and then determines which step in the rounds to view through the "select this number of rounds" component (if there are multiple agents in the environment, then it also needs to select which agent to view through "select agent").

The intelligent agent is embodied by the contents of selecting a control in the action step, an observed value (an observed result) in the step, executing the action (action distribution) in the step, a specific action value (action output) in the step and the like in the decision-making condition of the step. The action step selection control can be used for displaying the distribution and the comparison condition of the reward value of the local office, the reward value and the accumulated reward value of each step of the local office and the like.

The interface layout is specified as follows:

in the title column, information such as a title of decision details, a choice of the number of rounds, etc. can be included.

Upper left side of interface: multi-agent switching (where agent selection is made if there are multiple agents in the current environment)

The middle part of the upper layer: the local office selects the number of steps (including selecting control, stepping control supporting input, information prompt, etc.)

The lower layer: and displaying the observation result, the action distribution and the action output under the currently selected step number.

Referring to fig. 2, the agent decision information presentation process may include:

the user needs to determine the number of bureaus and the specific intelligent agent to be observed;

after the two items of information are determined, the local step selection control (also called as step number selector) displays corresponding information such as a reward value and the like in the step selection control area according to the currently selected local number and the agent.

The step selection controls may include a coordinate axis control and a selection box control. The coordinate axes may include an abscissa axis and an ordinate axis; the axis of abscissas represents action steps for a selected agent and the axis of ordinates represents the reward value for the selected agent at each action step. If the reward value is negative, it is displayed on the negative axis. The positive and negative regions may be differentiated by different colors.

In the display interface, the reward values of the selected intelligent agents in each action step are connected together in a connection mode to form a trend curve of the reward values of the action steps of the selected intelligent agents.

Hovering a user mouse pointer over a coordinate axis control displays corresponding prize value values, such as a prize value for this step and a cumulative prize value for this step.

When the user drags the slider, the current specific step value is displayed in real time in the right selection box control (also called stepper).

The stepper supports user input, and when user input is received, the slider automatically jumps back to the position corresponding to the number of steps

The user can view the content step by step through the up and down buttons of the stepper, and the left slider is linked with the content.

Displaying reward value information of the current step number of the slider on the rightmost side of the control

When the user changes the station number/step number, the contents of lower observation results, action distribution, action output and the like are refreshed in real time.

Fig. 3 is a block diagram of an agent decision information display apparatus according to an embodiment of the present application, and referring to fig. 3, the agent decision information display apparatus according to the embodiment includes: the system comprises an interface display module 10, a period and agent selection module 20, a step selection control providing module 30, an action step selection module 40 and a decision information display module 50.

And the interface display module 10 is used for displaying the intelligent agent decision information display interface.

A period and agent selection module 20 for selecting an agent activity period in the display interface and selecting an agent based on the selected agent activity period; wherein the agent is an agent that performs a specific action during the agent activity cycle.

A step selection control providing module 30, configured to provide an action step selection control of the selected agent in the selected agent activity cycle in the display interface.

In this embodiment, the step selection control providing module 30 may display the action step selection control after the content information is updated in the display interface.

And the action step selection module 40 is used for selecting an action step from the action step set of the intelligent agent based on the action step selection control.

And providing an action step selection control of the selected agent in the activity period of the selected agent in the display interface, wherein an action step can be selected from the action step set of the agent according to the action step selection control.

And a decision information display module 50, configured to display, based on the selected action step, decision information of a decision made by the agent in executing a specific action in the action step.

The decision information can be displayed in a display interface through the observation value (observation result) of the step, the execution action (action distribution) of the step, the specific action value (action output) of the step and the like.

In some embodiments, each agent activity period corresponds to a respective cumulative reward value; an intelligent agent activity cycle selection frame is arranged in the display interface, and an expansion display operation button is arranged in the intelligent agent activity cycle selection frame.

The period and agent selection module 20 may include: the period selection submodule is used for displaying at least one selectable intelligent agent activity period and a corresponding accumulated reward value based on the operation of the expansion display operation button in the intelligent agent activity period selection frame; a period of agent activity is selected based on the cumulative prize value.

The period and agent selection module 20 may include: an agent selection submodule for: displaying at least one selectable agent based on an operation of an expansion display operation button in the agent selection box; an agent is selected from the presented at least one alternative agent.

In some embodiments, the action step selection control comprises a coordinate axis control; the action step selection module 40 is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control.

the action step selection module 40 is specifically configured to: and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

In some embodiments, the action step selection control comprises a selection box control.

The action step selection module 40 is specifically configured to: selecting an action step based on an operation of a flip up or flip down button in a selection box control or based on action step information input in a display window of the selection box control.

In some embodiments, the action step selection controls include a coordinate axis control and a selection box control; the action step selection module 40 is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and the selection frame control.

The action step selection module 40 is further configured to, in the process of selecting the action step corresponding to the slider dragged to the target position based on the dragging operation of the slider on the coordinate axis: and synchronously displaying the action corresponding to the position of the slider on the coordinate axis in a display window of the selection frame control.

The action step selection module 40, during the process of selecting an action step based on the operation of the up-down button or the down-down button in the selection box control, is further configured to: synchronously moving the slide block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step displayed in the display window of the selection frame control;

alternatively, the first and second electrodes may be,

the action step selection module 40, during the process of selecting an action step based on the action step information input in the display window of the selection box control, is further configured to:

In some embodiments, the agent decision information presentation apparatus further includes: and the reward value information display module is used for displaying the reward value and the accumulated reward value of the currently selected action step in the display interface.

In some embodiments, the axes include an abscissa axis representing an action step of a selected agent and an ordinate axis representing a reward value for the selected agent at each action step;

the interface display module 10, when providing the action step selection control in the display interface, is further configured to:

In some embodiments, the decision information presentation module 50 is specifically configured to: and displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent in the action step when the intelligent agent executes a specific action in a preset area of a display interface based on the selected action step.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and referring to fig. 4, the electronic device according to the embodiment includes: the device comprises a shell 41, a processor 42, a memory 43, a circuit board 44 and a power circuit 45, wherein the circuit board 44 is arranged inside a space enclosed by the shell 41, and the processor 42 and the memory 43 are arranged on the circuit board 44; a power supply circuit 45 for supplying power to each circuit or device of the electronic apparatus; the memory 43 is used for storing executable program code; the processor 42 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 43, so as to execute the intelligent agent decision information presentation method described in any one of the foregoing embodiments.

The specific execution process of the above steps by the processor 42 and the steps further executed by the processor 42 by running the executable program code may refer to the description of the embodiment shown in fig. 1 of the present invention, and are not described herein again.

The present application further provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the intelligent agent decision information presentation method described in any of the foregoing embodiments.

According to the method, the device, the electronic equipment and the storage medium for displaying the intelligent agent decision information, an intelligent agent activity period can be selected from a displayed intelligent agent decision information display interface, and an intelligent agent is selected based on the selected intelligent agent activity period; and selecting a control based on the action steps provided in the display interface, selecting an action step from the action step set of the intelligent agent, and displaying decision information of a decision made by the intelligent agent in the action step based on the selected action step, so that the decision information of the intelligent agent can be displayed more intuitively, and a user can conveniently obtain the decision information of the decision made by the intelligent agent in the specific action.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An agent decision information presentation method, comprising:

displaying an intelligent agent decision information display interface;

selecting an action step from the action step set of the agent based on the action step selection control;

2. The agent decision information presentation method according to claim 1, wherein the action step selection control comprises a coordinate axis control and/or a selection box control;

selecting a control based on the action steps, and selecting an action step in the action step set of the agent, wherein the action step selection control comprises the following steps:

3. An agent decision information presentation method according to claim 4, wherein after selecting an action step in the set of action steps of the agent based on the coordinate axis control and/or the selection box control, the method further comprises:

4. The agent decision information presentation method of claim 2, wherein the coordinate axis control comprises a coordinate axis and a slider disposed on the coordinate axis;

selecting an action step from the set of action steps of the agent based on the coordinate axis control, wherein the selecting an action step comprises:

and selecting an action step corresponding to the sliding block dragged to the target position based on the dragging operation of the sliding block on the coordinate shaft.

5. The agent decision information presentation method of claim 4, wherein the action step selection controls comprise a coordinate axis control and a selection box control;

6. The agent decision information presentation method of claim 4, wherein the axes comprise an abscissa axis and an ordinate axis, the abscissa axis representing an action step of a selected agent, the ordinate axis representing a reward value for the selected agent at each action step;

7. The agent decision information presentation method of claim 6, further comprising: hovering the mouse pointer on the abscissa axis, and popping up the reward value corresponding to the action step corresponding to the hovering position of the mouse pointer.

8. An agent decision information presentation device, comprising:

an action step selection module for selecting an action step from the action step set of the agent based on the control selected in the action step;

9. An electronic device, characterized in that the electronic device comprises: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space enclosed by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, and is used for executing the intelligent agent decision information presentation method of any one of the preceding claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs which are executable by one or more processors to implement the agent decision information presentation method of any one of the preceding claims 1-7.