CN112700011B

CN112700011B - Agent decision information display method and device, electronic equipment and storage medium

Info

Publication number: CN112700011B
Application number: CN202011643879.3A
Authority: CN
Inventors: 王雨萱; 徐昀; 高�浩
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2024-05-31
Anticipated expiration: 2040-12-31
Also published as: CN112700011A

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for displaying decision information of an agent, relates to the technical field of machine learning, and aims to display the decision information of the agent more intuitively. The agent decision information display method comprises the following steps: selecting an agent activity period in the displayed agent decision information display interface, and selecting an agent based on the selected agent activity period; providing an action step selection control of the selected agent in the selected agent activity period in the display interface; selecting an action step from the action step set of the agent based on the selection control of the action step; and based on the selected action step, displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action step. The embodiment of the application is suitable for displaying the decision information of the agent in machine learning.

Description

Agent decision information display method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to an agent decision information display method, an agent decision information display device, an electronic device, and a storage medium.

Background

Reinforcement learning is one of the methodologies of machine learning, and a typical Reinforcement Learning (RL) problem can be summarized as: learning an optimal strategy (Policy) that allows an Agent (Agent) to act (Action) in a particular Environment (Environment) based on a current State (State) to obtain a maximum return (Reward).

After learning of the agent is completed, in order to assist the user decision, some information of the learning process, such as some decision information of the agent in the learning process, is often required to be provided. The main current reinforcement learning products are mainly oriented to technicians with certain foundation, and are usually completed in a code form in the aspect of result display, so that the reinforcement learning products are not intuitive.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for displaying decision information of an agent, which can display decision information of an agent more intuitively.

In a first aspect, an embodiment of the present application provides a method for displaying decision information of an agent, including:

displaying an agent decision information display interface;

Selecting an agent activity period in the display interface, and selecting an agent based on the selected agent activity period; wherein the agent is an agent that performs a specific action during the agent activity period;

Providing an action step selection control of the selected agent in the selected agent activity period in the display interface;

selecting an action step from the action step set of the agent based on the selection control of the action step;

and based on the selected action step, displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action step.

According to a specific implementation manner of the embodiment of the application, each intelligent agent activity period corresponds to a corresponding accumulated rewards value; an agent activity period selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent activity period selection frame;

The selecting an agent activity period in the display interface includes:

displaying at least one alternative agent activity period and a corresponding accumulated rewards value based on operation of an expanded display operation button in the agent activity period selection frame;

An agent activity period is selected based on the cumulative prize value.

According to a specific implementation manner of the embodiment of the application, an agent selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent selection frame;

the selecting an agent based on the selected agent activity period comprises:

Displaying at least one alternative agent based on operation of a spread display operation button in the agent selection frame;

An agent is selected from the displayed at least one alternative agent.

According to a specific implementation manner of the embodiment of the application, the action step selection control comprises a coordinate axis control and/or a selection frame control;

The selecting a motion step in the set of motion steps of the agent based on the motion step selection control provided in the display interface includes:

and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection frame control.

According to a specific implementation manner of the embodiment of the present application, after selecting an action step in the action step set of the agent based on the coordinate axis control and/or the selection frame control, the method further includes:

and displaying the reward value and the accumulated reward value of the currently selected action step in the display interface.

According to a specific implementation manner of the embodiment of the application, the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis;

And selecting an action step from the action step set of the intelligent agent based on the coordinate axis control, wherein the action step comprises the following steps: and selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis.

According to a specific implementation manner of the embodiment of the present application, the selecting, based on the selection box control, an action step from the action step set of the agent includes: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

According to a specific implementation manner of the embodiment of the application, the action step selection control comprises a coordinate axis control and a selection frame control;

during the act step corresponding to the step of selecting the slider to drag to the target position based on the drag operation of the slider on the coordinate axis, the method further comprises:

and synchronously displaying the action steps corresponding to the positions of the sliding blocks on the coordinate axes in the display window of the selection frame control.

During the selection of an action step based on the manipulation of the up or down button in the selection box control, the method further comprises: synchronously moving the sliding blocks on the coordinate axes to preset positions on the coordinate axes, wherein the action steps corresponding to the preset positions are consistent with the action steps displayed in the display window of the selection frame control;

Or alternatively

In selecting an action step based on action step information entered in a display window of the selection box control, the method further comprises:

And automatically moving the sliding block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step corresponding to the action step information input in the display window of the selection frame control.

According to a specific implementation manner of the embodiment of the present application, the coordinate axes include an abscissa axis and an ordinate axis, the abscissa axis represents an action step of a selected agent, and the ordinate axis represents a reward value for the selected agent at each action step;

When the action step selection control is provided in the display interface, the method further comprises:

Connecting the rewarding values of the selected intelligent agent in each action step together in a connecting line mode to form an action step rewarding value trend curve of the selected intelligent agent, and displaying the action step rewarding value trend curve; or alternatively

And displaying the rewarding value of the selected agent in each action step in a bar graph mode.

According to a specific implementation manner of the embodiment of the present application, the displaying decision information of the agent for executing a specific action in the action step based on the selected action step includes:

based on the selected action step, displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent when the intelligent agent executes a specific action in the action step in a preset area of a display interface.

In a second aspect, an embodiment of the present application provides an agent decision information display apparatus, including:

The interface display module is used for displaying an intelligent agent decision information display interface;

the period and agent selecting module is used for selecting an agent activity period in the display interface and selecting an agent based on the selected agent activity period; wherein the agent is an agent that performs a specific action during the agent activity period;

The step selection control providing module is used for providing action step selection controls of the selected intelligent agent in the selected intelligent agent activity period in the display interface;

the action step selection module is used for selecting an action step from the action step set of the intelligent agent based on the action step selection control provided in the display interface;

And the decision information display module is used for displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action steps based on the selected action steps.

the period and agent selection module includes:

The period selection sub-module is used for displaying at least one alternative intelligent agent activity period and a corresponding accumulated rewarding value based on the operation of the unfolding display operation button in the intelligent agent activity period selection frame; an agent activity period is selected based on the cumulative prize value.

The period and agent selection module includes: an agent selection sub-module for: displaying at least one alternative agent based on operation of a spread display operation button in the agent selection frame; an agent is selected from the displayed at least one alternative agent.

the action step selection module is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection frame control.

According to a specific implementation manner of the embodiment of the present application, the agent decision information display device further includes: and the rewarding value information display module is used for displaying the rewarding value and the accumulated rewarding value of the currently selected action step in the display interface.

the action step selection module is specifically configured to: and selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis.

According to a specific implementation manner of the embodiment of the present application, the action step selection module is specifically configured to: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

The action step selection module is further configured to, in a process of selecting an action step corresponding to a slider dragged to a target position based on a drag operation of the slider on the coordinate axis: and synchronously displaying the action steps corresponding to the positions of the sliding blocks on the coordinate axes in the display window of the selection frame control.

The action step selection module is further configured to, in a process of selecting an action step based on an operation of a flip-up or a flip-down button in the selection frame control: synchronously moving the sliding blocks on the coordinate axes to preset positions on the coordinate axes, wherein the action steps corresponding to the preset positions are consistent with the action steps displayed in the display window of the selection frame control;

Or alternatively

The action step selection module is further configured to, in a process of selecting an action step based on action step information input in a display window of the selection box control:

the interface display module is further configured to, when the action step selection control is provided in the display interface:

According to a specific implementation manner of the embodiment of the present application, the decision information display module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, and is configured to execute the agent decision information display method according to any one of the foregoing implementations.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing one or more programs, where the one or more programs are executable by one or more processors to implement the agent decision information display method according to any one of the foregoing implementations.

In the embodiment of the application, an agent activity period can be selected in the displayed agent decision information display interface, and an agent is selected based on the selected agent activity period; based on the action step selection control provided in the display interface, selecting an action step from the action step set of the intelligent agent, and based on the selected action step, displaying decision information of decision made by the intelligent agent when the intelligent agent executes specific action in the action step, so that decision information of the intelligent agent can be displayed more intuitively, and a user can conveniently acquire the decision information of decision made by the intelligent agent when executing specific action.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an agent decision information display method according to an embodiment of the present application;

FIGS. 2a and 2b are schematic diagrams of reinforcement learning decision information interfaces according to an embodiment of the present application;

FIG. 3 is a block diagram of an agent decision information display device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are merely some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Reinforcement learning is one of the methodologies of machine learning. A typical reinforcement learning (Reinforcement Learning, RL) problem can be summarized as: learning an optimal strategy (Policy) that allows an Agent (Agent) to act (Action) in a particular Environment (Environment) based on a current State (State) to obtain a maximum return (Reward)

First, some of the existing terms in the embodiments of the present application are explained for easy understanding by those skilled in the art.

An intelligent agent: the main body for machine learning in the algorithm can sense the state of the external environment to make decisions, act on the environment and adjust the decisions through the feedback of the environment.

Environment: the state of the collection of everything outside the agent changes under the influence of the action of the agent, and the change can be fully or partially perceived by the agent. The environment may feed back a corresponding reward to the agent after each decision.

Status: the state is a description of the environment, and changes after the agent acts.

Action space: the action is a description of the agent's behavior, as a result of the agent's decision. The set of all possible actions is an action space. The action space may be discrete or continuous.

Rewarding: the agent gives feedback to the agent from the environment after the action. Is a scalar function of the current time state, action, and next time state.

Reporting: the payback is the accumulation of rewards over time steps, and after introducing the concept of a track, the payback is also the sum of all rewards on the track.

Taking the example of character model training in man-machine combat mode in the glowing of a plurality of players of the online tactical arena (Multiplayer Online Battle Arena, MOBA) type, the process of reinforcement learning is illustrated in conjunction with the above technical terms. In the game, the character to be trained is an intelligent body, the red buff and the blue buff (wherein the buff refers to the gain state) existing in the game, the tower and other elements are the environment of the intelligent body, each step of the intelligent body can execute multiple actions such as movement (comprising four options up, down, left and right), transmitting skills (such as skills 1 and skills 2), flash and the like, the intelligent body executes various action environments such as giving different rewards in the forms of increasing and decreasing blood volume and the like, the intelligent body dies or the tower pushing is successful, namely, one game is realized, each action executed by the intelligent body is one step, each step has a rewarding value as feedback, and the rewarding value of each game is obtained after the rewarding value of each step is accumulated.

The technical solutions provided by the embodiments of the present invention are explained in detail below in conjunction with the above description of the prior art.

Fig. 1 is a flow chart of an agent decision information display method according to an embodiment of the application, referring to fig. 1, the agent decision information display method according to an embodiment of the application includes the steps of:

S100, displaying an agent decision information display interface.

Decision information is the decision of an agent during a certain activity period (e.g., a certain game in a game) during learning, and a certain step in the activity period. The display interface may be a display interface shown on a display screen of an electronic device such as a server, a desktop computer, a tablet computer, a smart phone, and the like.

S102, selecting an agent activity period in the display interface, and selecting an agent based on the selected agent activity period.

And a period selection control for the active periods of the agents and an agent selection control for selecting each agent in the active periods of the agents are provided in the display interface. The periodic selection control and the agent selection control may be selection controls such as a selection window with a drop down button, respectively.

There may be one or more agents in a selected agent activity period. In the case of multiple agents, any one of the agents may be selected to present decision information for the selected agent. Wherein the selected agent is an agent that performs a specific action during the agent activity period.

S104, providing action step selection controls of the selected agent in the selected agent activity period in the display interface.

In the display interface, action step selection controls of the selected agent in the selected agent activity period may be provided in advance. And after an agent activity period is selected in the selection process and an agent is selected in the selected agent activity period, updating the content information displayed in the action step selection control provided in the display interface.

In the step, an action step selection control after the content information is updated is displayed in the display interface.

S106, selecting a control based on the action steps, and selecting an action step from the action step set of the intelligent agent.

The action step set of an agent refers to a set of individual action steps of an agent in an agent activity cycle. Each action step is a specific action selected by the agent to be performed in an action space corresponding to an action.

S108, based on the selected action step, displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action step.

In this step, based on the selected action step, decision information of a decision made by the agent in the action step when executing a specific action is displayed on the display interface. The decision information can be displayed in the display interface through the contents of the observation value (observation result), the execution action (action distribution) of the step, the specific action value (action output) of the step and the like.

In some embodiments, each agent activity period may correspond to a respective cumulative prize value. The cumulative prize value may be a cumulative sum of prize values for each action step of each agent over an agent activity period.

An agent activity period selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent activity period selection frame. Wherein, the selecting an agent activity period in the display interface may include: displaying at least one alternative agent activity period and a corresponding accumulated rewards value based on operation of an expanded display operation button in the agent activity period selection frame; an agent activity period is selected based on the cumulative prize value.

The method comprises the steps of selecting a body activity period based on the accumulated rewards value, wherein the body activity period with the highest accumulated rewards value can be selected according to the order of the accumulated rewards value, or the body activity period with interest can be selected according to the order of the accumulated rewards value.

In some embodiments, an agent selection box is disposed in the display interface, and a spread display operation button is disposed in the agent selection box.

The selecting an agent based on the selected agent activity period may include: displaying at least one alternative agent based on operation of a spread display operation button in the agent selection frame; an agent is selected from the displayed at least one alternative agent.

In one example, after clicking the expansion display operation button in the agent selection frame, at least one alternative agent may be displayed in the form of a drop-down list, so as to facilitate the user to select an agent of interest from among them.

In some embodiments, the action step selection controls provided in the display interface may include a coordinate axis control; wherein, the selecting a control based on the action steps, in the action step set of the agent, selects an action step, may include: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control.

In one example, the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis;

The selecting an action step in the action step set of the agent based on the coordinate axis control may include: and selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis.

In some embodiments, the action step selection control provided in the display interface may include a selection box control. Wherein, based on the step selection control, in the action step set of the agent, an action step is selected, including: and selecting an action step from the action step set of the intelligent agent based on the selection box control.

In one example, the selecting, based on the selection box control, an action step from the set of action steps of the agent includes: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

In some embodiments, the action step selection controls provided in the display interface may include a coordinate axis control and a selection box control; the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis.

In one example, the selecting a control based on the step selects an action step from a set of action steps of the agent, including: and selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis.

In another example, the selecting a control based on the step selects an action step from a set of action steps of the agent, including: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

In selecting an action step based on operation of a flip-up or flip-down button in a selection box control, the method may further comprise: and synchronously moving the sliding blocks on the coordinate axes to preset positions on the coordinate axes, wherein the action steps corresponding to the preset positions are consistent with the action steps displayed in the display window of the selection frame control. Or in selecting an action step based on action step information entered in a display window of the selection box control, the method may further include: and automatically moving the sliding block on the coordinate axis to a preset position on the coordinate axis, wherein the action step corresponding to the preset position is consistent with the action step corresponding to the action step information input in the display window of the selection frame control.

In some embodiments, the coordinate axes include an abscissa axis and an ordinate axis; the abscissa axis represents the action steps of a selected agent, and the ordinate axis represents the prize value for each action step for the selected agent.

In the display interface, the prize value of the selected agent at each action step may be graphically displayed.

In one example, the prize values of the selected agent at each step of action may be linked together by a wire to form a trend curve of prize values of the selected agent at the step of action.

In another example, the prize value for each action step for the selected agent may be displayed in a bar graph to form a bar graph of prize values for the action steps for the selected agent.

Through the action step rewarding value trend curve or the action step rewarding value histogram, a user can intuitively know the action step rewarding value trend of the selected intelligent agent.

In order to enable the user to better and accurately acquire the rewarding value of one action step of the selected agent, in some embodiments, the rewarding value corresponding to the action step corresponding to the hovering position where the mouse pointer is located can be popped up by hovering the mouse pointer over the coordinate axis control.

Under the condition that the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis, when the mouse pointer hovers on the coordinate axis for a preset time (such as 0.5 seconds or 1 second, etc.), the rewarding value corresponding to the action step can be popped up through a window, bubbles, etc.

In one example, when the prize value corresponding to the action step is popped up, the cumulative prize value corresponding to the action step may be popped up and displayed together. The accumulated rewards corresponding to the action steps are accumulated sums of rewards of all the action steps from the first action step to the current selected action step in the activity period of the selected agent.

In addition to displaying the prize value and the cumulative prize value corresponding to the selected action step in a pop-up manner by hovering a mouse pointer, in some embodiments, after selecting an action step in the set of action steps of the agent based on the coordinate axis control and/or the selection box control, the prize value and the cumulative prize value of the currently selected action step may also be displayed in a digital manner in the display interface, for example, on a side of the coordinate axis control or the selection box control (on the right side of the selection box control in the drawing).

In some embodiments, the presenting, based on the selected action step, decision information of the agent to perform a specific action at the action step may include: based on the selected action step, displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent when the intelligent agent executes a specific action in the action step in a preset area of a display interface.

Wherein "observation information" represents feedback that the environment gives to the current number of steps (i.e., observations of the environment after performing the present step); "action distribution information" represents the scores of specific actions (e.g., "move" includes up, down, left, right actions) and different actions contained in the types of actions (e.g., move) that the agent can perform; the "action output information" represents the actual action output of the agent.

After a user selects one action step from the selected action step set of the agent, the observation result information, the action distribution information and/or the action output information displayed in the display interface can be refreshed and displayed in real time.

The method for displaying decision information of an agent according to the embodiment of the present application is described below with reference to a specific example.

Fig. 2a and fig. 2b are schematic views of reinforcement learning decision information interfaces displayed by an agent decision information display method according to an embodiment of the present application, and referring to fig. 2a and fig. 2b, the method according to an embodiment of the present application can display a specific office or a decision situation of a certain step in the office for a user in the learning process.

First the user needs to determine the number of bureaus to view through the "current number of bureaus" drop down box, then through the "select number of bureaus" component, it is determined which step in the bureaus to view (and if there are multiple agents in the environment, it is also necessary to select which agent to view through "select agents").

The intelligent agent is embodied in the case of the decision of the step through the contents of an action step selection control, a step observation value (observation result), a step execution action (action distribution), a step specific action value (action output) and the like. The action step selection control can be used for displaying the distribution and comparison of the rewards of the local office, the rewards of each step of the local office, the accumulated rewards and the like.

The interface layout is specifically described as follows:

The title bar may include information such as decision details, office number selections, etc.

Left side of interface upper layer: multi-agent switching (agent selection if multiple agents exist in the current environment)

Upper middle: local step number selection (including selection control, step control supporting input, information prompt, etc.)

The lower layer: and displaying the observation result, the action distribution and the action output of the current selected steps.

Referring to fig. 2, the agent decision information display process may include:

The user needs to determine the number of bureaus and specific intelligent agents to be observed;

After the two pieces of information are determined, the step selection control (also called as a step number selector) of the local office can display corresponding information such as prize values and the like in a step selection control area according to the currently selected office number and the intelligent agent.

The step selection controls may include a coordinate axis control and a selection box control. The coordinate axes may include an abscissa axis and an ordinate axis; the abscissa axis represents the action steps of a selected agent, and the ordinate axis represents the prize value for each action step for the selected agent. If the prize value is negative, it is displayed on the negative axis. The positive and negative regions may be distinguished by different colors.

In the display interface, the rewarding values of the selected intelligent agent in each action step are connected together in a connecting line mode to form a trend curve of the rewarding values of the selected intelligent agent in the action steps.

When the user's mouse pointer hovers over the coordinate axis control, corresponding prize value values, such as the step prize value and the step cumulative prize value, are displayed.

When the user drags the slider, the current specific step value is displayed in real time in the right selection box control (also called a stepper).

The stepper supports user input, and when the user input is received, the slider automatically jumps back to the corresponding step number position

The user can check the content step by step through the up-and-down button of the stepper, and the left slider is linked with the up-and-down button.

Prize value information of number of steps of current slider is displayed on rightmost side of control

The user changes the content such as observation results, action distribution, action output and the like under the office number/step number and refreshes in real time.

Fig. 3 is a block diagram of an agent decision information display device according to an embodiment of the present application, and referring to fig. 3, the agent decision information display device according to the embodiment includes: the system comprises an interface display module 10, a period and agent selection module 20, a step selection control providing module 30, an action step selection module 40 and a decision information display module 50.

The interface display module 10 is used for displaying an agent decision information display interface.

The period and agent selection module 20 is configured to select an agent activity period in the display interface, and select an agent based on the selected agent activity period; wherein the agent is an agent that performs a specific action during the agent activity period.

And the step selection control providing module 30 is used for providing action step selection controls of the selected agent in the selected agent activity period in the display interface.

In this embodiment, the step selection control providing module 30 may display the action step selection control after the content information is updated in the display interface.

And an action step selection module 40, configured to select an action step from the set of action steps of the agent based on the action step selection control.

And providing an action step selection control of the selected intelligent agent in the selected intelligent agent activity period in the display interface, wherein the action step selection control can be used for selecting an action step in the action step set of the intelligent agent according to the action step selection control.

The decision information display module 50 is configured to display decision information of a decision made by the agent when the agent performs a specific action in the action step based on the selected action step.

The decision information can be displayed in the display interface through the contents of the observation value (observation result), the execution action (action distribution) of the step, the specific action value (action output) of the step and the like.

In some embodiments, each agent activity cycle corresponds to a respective cumulative prize value; an agent activity period selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent activity period selection frame.

The cycle and agent selection module 20 may include: the period selection sub-module is used for displaying at least one alternative intelligent agent activity period and a corresponding accumulated rewarding value based on the operation of the unfolding display operation button in the intelligent agent activity period selection frame; an agent activity period is selected based on the cumulative prize value.

The cycle and agent selection module 20 may include: an agent selection sub-module for: displaying at least one alternative agent based on operation of a spread display operation button in the agent selection frame; an agent is selected from the displayed at least one alternative agent.

In some embodiments, the action step selection control comprises a coordinate axis control; the action step selection module 40 is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control.

The action step selection module 40 is specifically configured to: and selecting the corresponding action step when the sliding block is dragged to the target position based on the dragging operation of the sliding block on the coordinate axis.

In some embodiments, the action step selection control comprises a selection box control.

The action step selection module 40 is specifically configured to: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

In some embodiments, the action step selection control includes a coordinate axis control and a selection box control; the action step selection module 40 is specifically configured to: and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and the selection frame control.

The action step selection module 40 is further configured to, in a process of selecting an action step corresponding to a slider dragged to a target position based on a drag operation of the slider on the coordinate axis: and synchronously displaying the action steps corresponding to the positions of the sliding blocks on the coordinate axes in the display window of the selection frame control.

The action step selection module 40 is further configured to, in selecting an action step based on an operation of the up or down button in the selection box control: synchronously moving the sliding blocks on the coordinate axes to preset positions on the coordinate axes, wherein the action steps corresponding to the preset positions are consistent with the action steps displayed in the display window of the selection frame control;

Or alternatively

The action step selection module 40 is further configured to, in selecting an action step based on action step information input in the display window of the selection frame control:

In some embodiments, the agent decision information display device further includes: and the rewarding value information display module is used for displaying the rewarding value and the accumulated rewarding value of the currently selected action step in the display interface.

In some embodiments, the axes include an abscissa axis representing an action step of a selected agent and an ordinate axis representing a prize value for the selected agent at each action step;

The interface display module 10 is further configured to, when the action step selection control is provided in the display interface:

In some embodiments, the decision information display module 50 is specifically configured to: based on the selected action step, displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent when the intelligent agent executes a specific action in the action step in a preset area of a display interface.

The device of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and its implementation principle and technical effects are similar, and are not described here again.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, referring to fig. 4, the electronic device according to the embodiment includes: the device comprises a shell 41, a processor 42, a memory 43, a circuit board 44 and a power circuit 45, wherein the circuit board 44 is arranged in a space surrounded by the shell 41, and the processor 42 and the memory 43 are arranged on the circuit board 44; a power supply circuit 45 for supplying power to the respective circuits or devices of the above-described electronic apparatus; the memory 43 is for storing executable program code; the processor 42 executes a program corresponding to the executable program code by reading the executable program code stored in the memory 43, for executing the agent decision information presentation method according to any of the foregoing embodiments.

The specific implementation of the above steps by the processor 42 and the further implementation of the steps by the processor 42 through the execution of the executable program code may be referred to in the description of the embodiment of fig. 1 of the present invention, which is not repeated herein.

The embodiment of the application also provides a computer readable storage medium, which stores one or more programs, and the one or more programs can be executed by one or more processors to implement the agent decision information display method according to any of the foregoing embodiments.

According to the method, the device, the electronic equipment and the storage medium for displaying the agent decision information, an agent activity period can be selected in the displayed agent decision information display interface, and an agent is selected based on the selected agent activity period; based on the action step selection control provided in the display interface, selecting an action step from the action step set of the intelligent agent, and based on the selected action step, displaying decision information of decision made by the intelligent agent when the intelligent agent executes specific action in the action step, so that decision information of the intelligent agent can be displayed more intuitively, and a user can conveniently acquire the decision information of decision made by the intelligent agent when executing specific action.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The present application is not limited to the above embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. The agent decision information display method is characterized by comprising the following steps of:

displaying an agent decision information display interface;

Selecting a control based on the action steps, and selecting an action step from the action step set of the intelligent agent;

based on the selected action step, displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action step;

The action step selection control comprises a coordinate axis control and/or a selection frame control; the selecting a control based on the action steps, in the action step set of the agent, selects an action step, including: selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection frame control; after selecting an action step in the set of action steps of the agent based on the coordinate axis control and/or the selection box control, the method further comprises: displaying the rewarding value and the accumulated rewarding value of the action step selected currently in the display interface; the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis; and selecting an action step from the action step set of the intelligent agent based on the coordinate axis control, wherein the action step comprises the following steps: selecting a corresponding action step when the sliding block is dragged to a target position based on the dragging operation of the sliding block on the coordinate axis;

The step of displaying decision information of the agent for executing specific actions in the step of action based on the selected action step comprises the following steps: based on the selected action step, displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent when the intelligent agent executes a specific action in the action step in a preset area of a display interface;

The coordinate axes comprise an abscissa axis and an ordinate axis, wherein the abscissa axis represents the action step of the selected intelligent agent, and the ordinate axis represents the rewarding value of the selected intelligent agent in each action step; when the action step selection control is provided in the display interface, the method further comprises: connecting the rewarding values of the selected intelligent agent in each action step together in a connecting line mode to form an action step rewarding value trend curve of the selected intelligent agent, and displaying the action step rewarding value trend curve; or the rewarding value of the selected agent in each action step is displayed in a column diagram mode.

2. The agent decision information display method of claim 1, wherein each agent activity period corresponds to a respective jackpot value; an agent activity period selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent activity period selection frame;

The selecting an agent activity period in the display interface includes:

displaying at least one selected agent activity period and a corresponding cumulative prize value based on operation of a spread display operation button in the agent activity period selection frame;

An agent activity period is selected based on the cumulative prize value.

3. The agent decision information display method according to claim 1, wherein an agent selection frame is arranged in the display interface, and an expansion display operation button is arranged in the agent selection frame;

the selecting an agent based on the selected agent activity period comprises:

Displaying at least one selected agent based on operation of a spread display operation button in the agent selection frame;

an agent is selected from the displayed at least one alternative agent.

4. The method for displaying decision information of an agent according to claim 1, wherein the selecting an action step from the set of action steps of the agent based on the selection box control comprises:

an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

5. The agent decision information display method of claim 1, wherein the action step selection control comprises a coordinate axis control and a selection box control;

6. The method for presenting agent decision information according to claim 4, wherein,

The action step selection control comprises a coordinate axis control and a selection frame control;

Or alternatively

7. The agent decision information display method of claim 1, further comprising: and hovering the mouse pointer on the abscissa axis, and popping up a reward value corresponding to an action step corresponding to the hovering position of the mouse pointer.

8. An agent decision information display device, characterized by comprising:

The action step selection module is used for selecting a control based on the action steps and selecting an action step from the action step set of the intelligent agent; the action step selection control comprises a coordinate axis control and/or a selection frame control; the coordinate axis control comprises a coordinate axis and a sliding block arranged on the coordinate axis; the action step selection module is specifically configured to: selecting an action step from the action step set of the intelligent agent based on the coordinate axis control and/or the selection frame control; selecting a corresponding action step when the sliding block is dragged to a target position based on the dragging operation of the sliding block on the coordinate axis;

the rewarding value information display module is used for displaying the rewarding value and the accumulated rewarding value of the currently selected action step in the display interface;

the decision information display module is used for displaying decision information of decisions made by the intelligent agent when the intelligent agent executes specific actions in the action steps based on the selected action steps; the decision information display module is specifically configured to: based on the selected action step, displaying observation result information, action distribution information and/or action output information of a decision made by the intelligent agent when the intelligent agent executes a specific action in the action step in a preset area of a display interface;

The coordinate axes comprise an abscissa axis and an ordinate axis, wherein the abscissa axis represents the action step of the selected intelligent agent, and the ordinate axis represents the rewarding value of the selected intelligent agent in each action step; the interface display module is further configured to, when the action step selection control is provided in the display interface: connecting the rewarding values of the selected intelligent agent in each action step together in a connecting line mode to form an action step rewarding value trend curve of the selected intelligent agent, and displaying the action step rewarding value trend curve; or the rewarding value of the selected agent in each action step is displayed in a column diagram mode.

9. The agent decision information display device of claim 8, wherein each agent activity cycle corresponds to a respective jackpot value; an agent activity period selection frame is arranged in the display interface, and an unfolding display operation button is arranged in the agent activity period selection frame;

the period and agent selection module includes:

the period selection sub-module is used for displaying at least one selected intelligent agent activity period and a corresponding accumulated rewarding value based on the operation of the unfolding display operation button in the intelligent agent activity period selection frame; an agent activity period is selected based on the cumulative prize value.

10. The agent decision information display device according to claim 8, wherein an agent selection frame is arranged in the display interface, and an expansion display operation button is arranged in the agent selection frame;

the period and agent selection module includes: an agent selection sub-module for: displaying at least one selected agent based on operation of a spread display operation button in the agent selection frame; an agent is selected from the displayed at least one alternative agent.

11. The agent decision information display device of claim 8, wherein the action step selection module is specifically configured to: an action step is selected based on operation of a flip-up or a flip-down button in a selection box control, or based on action step information entered in a display window of the selection box control.

12. The agent decision information display device of claim 8, wherein the action step selection control comprises a coordinate axis control and a selection box control;

13. The agent decision information display device of claim 11 wherein,

Or alternatively

14. An electronic device, the electronic device comprising: the device comprises a shell, a processor, a memory, a circuit board and a power circuit, wherein the circuit board is arranged in a space surrounded by the shell, and the processor and the memory are arranged on the circuit board; a power supply circuit for supplying power to each circuit or device of the electronic apparatus; the memory is used for storing executable program codes; the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for performing the agent decision information presentation method of any one of the preceding claims 1-7.

15. A computer readable storage medium storing one or more programs executable by one or more processors to implement the agent decision information presentation method of any of the preceding claims 1-7.