CN112906881A

CN112906881A - Man-machine confrontation knowledge data hybrid driving type decision-making method and device and electronic equipment

Info

Publication number: CN112906881A
Application number: CN202110489056.8A
Authority: CN
Inventors: 赵美静; 黄凯奇; 尹奇跃
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-06-04
Anticipated expiration: 2041-05-06
Also published as: CN112906881B

Abstract

The invention relates to the field of artificial intelligence, in particular to a man-machine confrontation knowledge data hybrid driving type decision-making method and device, electronic equipment and a storage medium. The method comprises the following steps: and at each decision time node, firstly searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, and realizing online decision based on Monte Carlo tree search when the action tasks of the action units under the current man-machine confrontation situation do not exist in the decision rule base. The invention is suitable for giving the confrontation decision in the man-machine confrontation environment.

Description

Man-machine confrontation knowledge data hybrid driving type decision-making method and device and electronic equipment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a man-machine confrontation knowledge data hybrid driving type decision-making method and device, electronic equipment and a storage medium.

Background

The man-machine confrontation is the leading direction of artificial intelligence research, has become the hotspot of research in the field of home and abroad intelligence, and provides an effective test environment and way for searching the internal growth mechanism of machine intelligence and verifying key technology. At present, in the face of intelligent cognition and decision making requirements under complex, dynamic and antagonistic environments, assistance and support of an artificial intelligence technology using a machine as a carrier are urgently needed.

With the great development of artificial intelligence technology, more and more real world application systems will be faced, on one hand, on the basis of man-machine confrontation decision technology routes, a knowledge-driven decision method has the advantage of interpretability, but due to the existence of knowledge bottlenecks, the decision performance is restricted; on the other hand, the data-driven decision method has the performance of autonomous learning, but due to a 'black box' mechanism, the decision result faces an unexplained limitation. In a real world application scene, how to fully utilize the advantages of both a knowledge-driven decision-making method and a data-driven decision-making method enables a man-machine confrontation decision-making process to be interpretable and learnable, and has important significance for improving the degree of autonomy and the degree of intelligence of the man-machine confrontation decision-making.

Disclosure of Invention

Based on this, the embodiment of the invention provides a man-machine confrontation knowledge data hybrid drive type decision method, a device, an electronic device and a storage medium, which can make full use of the advantages of both the knowledge drive type decision method and the data drive type decision method, so that the man-machine confrontation decision process can be interpreted and learned.

In a first aspect, an embodiment of the present invention provides a human-machine confrontation knowledge data hybrid-driven decision method, where the method includes: acquiring the current man-machine confrontation situation at each decision time node; searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations; if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base, the action tasks corresponding to the action units under the current man-machine confrontation situation are determined based on Monte Carlo tree searching; and sending action tasks corresponding to the action units respectively in the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks.

Optionally, the method further includes: and if action tasks corresponding to the action units under the current man-machine confrontation situation are found in the decision rule base, sending the found action tasks to the corresponding action units so that the action units execute the action tasks.

Optionally, the determining, based on the monte carlo tree search, action tasks respectively corresponding to each action unit in the current human-computer confrontation situation includes: respectively matching an action task for each action unit according to the matching strategy to generate a first tactic; expanding the first tactics through an expansion strategy to generate at least one second tactics, wherein the action task of at least one action unit in the second tactics is different from the action task of the action unit in the first tactics; constructing a Monte Carlo tree by taking the first tactics as a root node of the Monte Carlo tree and taking the second tactics as a first-level child node of the Monte Carlo tree; continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth; searching the Monte Carlo tree for the optimal tactics under the current man-machine confrontation situation; and taking the action task corresponding to each action unit in the optimal tactics as the action task corresponding to each action unit in the current man-machine confrontation situation.

Optionally, the matching an action task for each action unit according to the matching policy respectively, and the generating of the first tactics includes: a first tactical plan is generated by randomly matching an action task for each action unit.

Optionally, the action task includes at least one task element of: the method comprises the following steps of task objects, task target points, task key points, task end time nodes and task actions; the expanding the first tactic by an expansion strategy to generate at least one second tactic comprises: and adjusting the task elements of the action tasks of at least one action unit in the first tactics to generate at least one second tactics.

Optionally, the continuing to expand the monte carlo tree according to the expansion policy until the monte carlo tree reaches the design depth includes: selecting an expansion child node from each first-level child node according to an upper confidence bound algorithm formula; expanding the second tactics corresponding to the expansion child nodes according to the expansion strategy to generate at least one third tactic; taking each third tactic as each second-level child node of the Monte Carlo tree, wherein each second-level child node is a child node of the extended child node; and continuing to select an expansion sub-node from each second-level sub-node according to the upper confidence bound algorithm formula, and expanding a third tactical corresponding to the expansion sub-node according to the expansion strategy until the Monte Carlo tree reaches the design depth.

Optionally, the searching for the optimal tactics under the current human-computer confrontation situation from the monte carlo tree includes: selecting one child node from the last-level child nodes of the Monte Carlo tree as a simulation child node; simulating the tactics corresponding to the simulated child nodes under the current man-machine confrontation situation according to a simulation strategy to obtain a simulation result; recording the simulation result of the simulation child node and adding 1 to the access times corresponding to the simulation child node; backtracking the simulation result of the simulation child node and the access times corresponding to the simulation child node to all levels of father nodes of the simulation child node so that all levels of father nodes of the simulation child node record the simulation result of the simulation child node and the access times corresponding to the simulation child node; searching the leaf node with the most access times from the Monte Carlo tree, and taking the tactics corresponding to the leaf node as the optimal tactics under the current man-machine confrontation situation.

In a second aspect, an embodiment of the present invention provides a human-machine confrontation knowledge data hybrid driving type decision device, including: the acquisition unit is used for acquiring the current man-machine confrontation situation at each decision time node; the searching unit is used for searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relations between the action units and the action tasks under the various man-machine confrontation situations; the determining unit is used for searching and determining action tasks corresponding to the action units under the current man-machine confrontation situation based on the Monte Carlo tree if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base; the first sending unit is used for sending the action tasks corresponding to the action units under the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks.

Optionally, the apparatus further comprises: and the second sending unit is used for sending each found action task to the corresponding action unit if the action task corresponding to each action unit under the current man-machine confrontation situation is found in the decision rule base, so that each action unit executes the action task.

Optionally, the determining unit includes: the matching subunit is used for respectively matching an action task for each action unit according to the matching strategy to generate a first tactic; the extension subunit is used for extending the first tactic through an extension strategy to generate at least one second tactic, wherein the action task of at least one action unit in the second tactic is different from the action task of the action unit in the first tactic; a construction subunit, configured to construct a monte carlo tree by using the first operation as a root node of the monte carlo tree and using the second operation as a first-level child node of the monte carlo tree; continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth; the searching subunit is used for searching the optimal tactics under the current man-machine confrontation situation from the Monte Carlo tree; and the determining subunit is used for taking the action task corresponding to each action unit in the optimal tactics as the action task corresponding to each action unit in the current man-machine confrontation situation.

Optionally, the matching subunit is specifically configured to: a first tactical plan is generated by randomly matching an action task for each action unit.

Optionally, the action task includes at least one task element of: the method comprises the following steps of task objects, task target points, task key points, task end time nodes and task actions; the extension subunit is specifically configured to: and adjusting the task elements of the action tasks of at least one action unit in the first tactics to generate at least one second tactics.

Optionally, the building subunit is specifically configured to: selecting an expansion child node from each first-level child node according to an upper confidence bound algorithm formula; expanding the second tactics corresponding to the expansion child nodes according to the expansion strategy to generate at least one third tactic; taking each third tactic as each second-level child node of the Monte Carlo tree, wherein each second-level child node is a child node of the extended child node; and continuing to select an expansion sub-node from each second-level sub-node according to the upper confidence bound algorithm formula, and expanding a third tactical corresponding to the expansion sub-node according to the expansion strategy until the Monte Carlo tree reaches the design depth.

Optionally, the search subunit is specifically configured to: selecting one child node from the last-level child nodes of the Monte Carlo tree as a simulation child node; simulating the tactics corresponding to the simulated child nodes under the current man-machine confrontation situation according to a simulation strategy to obtain a simulation result; recording the simulation result of the simulation child node and adding 1 to the access times corresponding to the simulation child node; backtracking the simulation result of the simulation child node and the access times corresponding to the simulation child node to all levels of father nodes of the simulation child node so that all levels of father nodes of the simulation child node record the simulation result of the simulation child node and the access times corresponding to the simulation child node; searching the leaf node with the most access times from the Monte Carlo tree, and taking the tactics corresponding to the leaf node as the optimal tactics under the current man-machine confrontation situation.

In a third aspect, an embodiment of the present invention provides an electronic device, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory for storing a computer program; the processor is configured to execute the program stored in the memory to implement the human-machine confrontation knowledge data hybrid driving type decision method of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the method for human-machine confrontation knowledge data hybrid-driven decision making according to the first aspect is implemented.

Compared with the prior art, the technical scheme provided by the embodiment of the invention has the following advantages:

according to the man-machine confrontation knowledge data hybrid drive type decision method, the device, the electronic equipment and the storage medium, the current man-machine confrontation situation is obtained at each decision time node; searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations; if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base, the action tasks corresponding to the action units under the current man-machine confrontation situation are determined based on Monte Carlo tree searching; and sending action tasks corresponding to the action units respectively in the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks. Therefore, at each decision time node, action tasks corresponding to the action units under the current man-machine confrontation situation are searched in the decision rule base, when the action tasks of the action units under the current man-machine confrontation situation do not exist in the decision rule base, on-line decision is realized based on Monte Carlo tree search, the application of two technical routes of knowledge rules and pre-performance learning in man-machine confrontation real-time decision is fully exerted, the man-machine confrontation decision process can be explained and learned, and the degree of autonomy and the degree of intelligence of the man-machine confrontation decision are effectively improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a human-machine confrontation knowledge data hybrid-driven decision method provided in an embodiment of the present invention;

FIG. 2 is a first partial flowchart of a human-machine confrontation knowledge data hybrid-driven decision method according to an embodiment of the present invention;

FIG. 3 is a second partial flowchart of a human-machine confrontation knowledge data hybrid-driven decision method according to an embodiment of the present invention;

FIG. 4 is a third flowchart illustrating a human-machine confrontation knowledge data hybrid-driven decision method according to an embodiment of the present invention;

fig. 5 is a fourth flowchart of a human-machine confrontation knowledge data hybrid-driven decision method according to an embodiment of the present invention;

fig. 6 is a fifth partial flowchart of a human-machine confrontation knowledge data hybrid driving type decision method provided in an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a human-machine confrontation knowledge data hybrid driving type decision device provided in an embodiment of the present invention;

fig. 8 is a schematic structural connection diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In a first aspect, the man-machine confrontation knowledge data hybrid driving type decision method provided by the embodiment of the invention can fully utilize the advantages of both the knowledge driving type decision method and the data driving type decision method, so that the man-machine confrontation decision process can be explained and learned.

As shown in fig. 1, an embodiment of the present invention provides a human-machine confrontation knowledge data hybrid driving type decision method, including:

s101, acquiring a current man-machine confrontation situation at each decision time node;

in this step, the decision may be made at regular intervals, or may be triggered according to other conditions, for example, according to one or some environmental characteristics of the man-machine confrontation environment.

The man-machine confrontation situation can be obtained by analyzing the environmental characteristic information under the current man-machine confrontation environment, and the environmental characteristic information under the man-machine confrontation environment can comprise the military force information of the own party, the military force position information of the own party, the military force information of the enemy party, the military force position information of the enemy party, the terrain information and the like; the human-machine confrontation situation may include a score, a threat, an intention, a face of win, and the like.

S102, action tasks corresponding to the action units under the current man-machine confrontation situation are searched in a decision rule base, wherein the decision rule base stores the corresponding relations between the action units and the action tasks under the various man-machine confrontation situations;

in this step, the decision rule base may be constructed offline, and the decision rule base is used to provide action tasks that should be selected by the action unit of one party under different human-computer confrontation situations, where the action tasks may include aggregation, reconnaissance, attack, avoidance, shelter, support, and the like.

Each of the action tasks may be represented as an octave: the system comprises a task name, a task subject, a task object, a task target point, a task key point, a task ending time interval, a task action and a task state. The task name of the action task can be a specific name of the action task, and the task body of the action task can be an application action unit for executing the action task; the task object of the action task can be a subject action unit in the execution of the action task; the task target point of the action task can be the destination of the action task; the task key point of the action task can be an important position in the execution process of the action task; the task end time node of the action task can be a time node for ending execution of the action task; the task action of the action task can be a specific action of a task body which drives the action task to execute; the task state of the action task refers to the current state of the action task. The task state of the action task may specifically include "waiting", "in progress", "completed", and the like.

The action tasks corresponding to the action units under the current man-machine confrontation situation can be searched in the decision rule base on line. When action tasks corresponding to the action units under the current man-machine confrontation situation are searched in the decision rule base, once matching is successful, matching can be stopped, and the matching relation is used as a decision result.

S103, if action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base, the action tasks corresponding to the action units under the current man-machine confrontation situation are determined based on Monte Carlo tree searching;

in this step, the action tasks of the action units can be generated through Monte Carlo tree search, and the optimal action tasks corresponding to the action units are selected from the nodes of the Monte Carlo tree and used as the action tasks corresponding to the action units under the current man-machine confrontation situation.

And S104, sending the action tasks corresponding to the action units under the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks.

Sending each action task to a corresponding action unit according to the corresponding relation between each action task and each action unit, and after each action unit acquires the action task, entering an execution program of the corresponding action task to execute the action task.

According to the man-machine confrontation knowledge data hybrid driving type decision method, the device, the electronic equipment and the storage medium, the current man-machine confrontation situation is obtained at each decision time node; searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations; if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base, the action tasks corresponding to the action units under the current man-machine confrontation situation are determined based on Monte Carlo tree searching; and sending action tasks corresponding to the action units respectively in the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks. Therefore, at each decision time node, action tasks corresponding to the action units under the current man-machine confrontation situation are searched in the decision rule base, when the action tasks of the action units under the current man-machine confrontation situation do not exist in the decision rule base, on-line decision is realized based on Monte Carlo tree search, the application of two technical routes of knowledge rules and pre-performance learning in man-machine confrontation real-time decision is fully exerted, the man-machine confrontation decision process can be explained and learned, and the degree of autonomy and the degree of intelligence of the man-machine confrontation decision are effectively improved.

Optionally, if action tasks corresponding to the action units under the current man-machine confrontation situation are found in the decision rule base, sending each found action task to the corresponding action unit, so that each action unit executes the action task.

In this embodiment, when action tasks respectively corresponding to each action unit under the current man-machine confrontation situation exist in the decision rule base, the action tasks in the decision rule base can be directly sent to the corresponding action units, so that the application of knowledge rules in man-machine confrontation real-time decision is fully exerted, and the man-machine confrontation decision process can be interpreted and learned.

As shown in fig. 2, optionally, the determining, based on the monte carlo tree search, action tasks respectively corresponding to each action unit in the current human-computer confrontation situation includes:

s1031, respectively matching an action task for each action unit according to the matching strategy to generate a first tactic;

in this step, when the action tasks of the action units under the current man-machine confrontation situation do not exist in the decision rule base, that is, when the matching between the current man-machine confrontation situation and the man-machine confrontation situation in the decision rule base fails, the decision is made based on the monte carlo tree search.

For example, n (a natural number n is greater than 1) action tasks can be numbered, and the number is b₁、b₂、...、b_n. Specifically, for the example of man-machine confrontation decision of land war chess, assume that 6 action tasks are constructed, which are respectively named as "seize control", "attack", "shield", "reconnaissance", "evasion", and "rendezvous", and are respectively numbered as b₁、b₂、b₃、b₄、b₅、b₆(ii) a Assuming that there are three types of action units, tank, chariot and infantry, a tactical mission is randomly chosen for each action unit to form a tactical mission sequence Q0= { b =₂, b₃, b₁}。

Of course, it is also possible to match a common or skilled action task for each action unit, so that each of the action units exerts its own strength. Such as matching "attack" tasks for tanks.

S1032, expanding the first tactics through an expansion strategy to generate at least one second tactic, wherein the action task of at least one action unit in the second tactic is different from the action task of the action unit in the first tactic;

in this step, the action tasks of the action units in the first tactic can be finely adjusted through the extension strategy, so as to generate a plurality of second tactics.

Optionally, the action task may include at least one task element of: the method comprises the following steps of task objects, task target points, task key points, task end time nodes and task actions; the expanding the first tactic by an expansion strategy to generate at least one second tactic comprises: and adjusting the task elements of the action tasks of at least one action unit in the first tactics to generate at least one second tactics.

In this embodiment, the task element of the action task of at least one action unit in the first tactic can be finely adjusted randomly, and an extended tactic can be obtained by changing the task target point, the task key point, the task ending time point or the task action. For example, first tactical Q0= { b = { (b) }₂, b₃, b₁In which is numbered b₂The task end time node for the "attack" tactic is 800 seconds, giving it a time interval of [ -100,100 [ ]]The random adjustment value in the value is-50, the new task ending time node is obtained and is 750 seconds, and the number of the new task ending time node is set as b₂ ¹(ii) a Number b₃The coordinates 2236 of the target point of the mission in "shield" tactics are given to the coordinates of the horizontal and vertical axes in the interval [ -5,5 [ ], respectively]The random adjustment value-2, 2 in the task object point coordinate 2034 is obtained and is numbered as b₃ ¹(ii) a Number b₁The mission key point coordinates 6633 of the 'control-by-robbing' tactics are respectively given to the horizontal and vertical coordinates of the mission key point coordinates within the range of [ -5,5 [)]The random adjustment value of-1, 1 in the table is obtained to obtain the task target point coordinate 6534, which is set as b₁ ¹(ii) a Thus, the second tactics Q01= { b } is obtained by extension₂ ¹，b₃ ¹，b₁ ¹}。

Repeating the above extension steps several times can obtain Q01= { b =₂ ¹，b₃ ¹，b₁ ¹}、Q02={b₂ ²，b₃ ²，b₁ ²}、Q03={b₂ ³，b₃ ³，b₁ ³And the second tactics.

S1033, constructing a Monte Carlo tree by taking the first tactics as a root node of the Monte Carlo tree and taking the second tactics as a first-level child node of the Monte Carlo tree;

this step, taking the above embodiment as an example, will be described as the first tactical Q0= { b = { (b) }₂, b₃, b₁As a root node of the monte carlo tree, a second tactic Q01= { b =₂ ¹，b₃ ¹，b₁ ¹}、Q02={b₂ ²，b₃ ²，b₁ ²}、Q03={b₂ ³，b₃ ³，b₁ ³Etc. as the first level child nodes of the monte carlo tree.

S1034, continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth;

in this step, based on the same principle as that in step S105, the first-level child node of the monte carlo tree may be extended according to the extension policy, specifically, for the first-level child node Q01= { b = b =₂ ¹，b₃ ¹，b₁ ¹Obtaining the first-level child node Q01= { b } according to the extension strategy extension₂ ¹，b₃ ¹，b₁ ¹Sub-node Q011= { b }₂ ¹¹，b₃ ¹¹，b₁ ¹¹}、Q012={b₂ ¹²，b₃ ¹²，b₁ ¹²}、Q013={b₂ ¹³，b₃ ¹³，b₁ ¹³And the node expansion tree is formed by analogy until the Monte Carlo tree reaches the design depth. For example, if the design depth of the monte carlo tree is 10, the monte carlo tree is expanded according to the expansion strategy until the monte carlo tree is expanded to obtain the monte carlo treeLevel 10 child nodes of the monte carlo tree.

S1035, searching the optimal tactics under the current man-machine confrontation situation from the Monte Carlo tree;

in this step, each node of the monte carlo tree may be traversed, and a tactic corresponding to a node with the highest winning rate, or the highest score, or a node with the highest access frequency is used as the optimal tactic in the current human-machine confrontation situation.

And S1036, taking the action tasks corresponding to the action units in the optimal tactics as the action tasks corresponding to the action units under the current man-machine confrontation situation.

As shown in fig. 3, optionally, in any of the above embodiments, the continuing to expand the monte carlo tree according to the expansion policy until the monte carlo tree reaches the design depth includes:

s10341, selecting an expansion sub-node from each first-level sub-node according to an upper confidence bound algorithm formula;

in this step, values corresponding to the first-level child nodes are calculated according to an upper confidence bound algorithm formula (UCB formula), and the first-level child node corresponding to the maximum value of the values is taken as an extended child node, that is, the extended child node is selected according to the following formula:

；

wherein, w_jIs the number of simulated wins, n, recorded by the jth first-level child node_jIs the number of accesses to the jth first-level child node, N is the number of accesses to the parent node of the jth first-level child node, and C is a constant, which can be chosen empirically, e.g., choice 1.44.

S10342, expanding the second tactics corresponding to the expansion sub-nodes according to the expansion strategy to generate at least one third tactic;

s10343, taking each third tactic as each second-level child node of the Monte Carlo tree, wherein each second-level child node is a child node of the expansion child node;

s10344, continuing to select an expansion sub node from each second-level sub node according to the upper confidence bound algorithm formula, and expanding a third tactical corresponding to the expansion sub node according to the expansion strategy until the Monte Carlo tree reaches a design depth.

In this embodiment, in the process of expanding the monte carlo tree, each expansion of a child node of the monte carlo tree of one level needs to select one child node from the current leaf nodes as an expansion child node, and the selection method can refer to the UCB formula to select the expansion child node, which is more beneficial to expanding a higher-quality tactics.

As shown in fig. 4, optionally, the searching for the optimal tactics under the current anti-man situation from the monte carlo tree includes:

s10351, selecting one child node from the last-level child nodes of the Monte Carlo tree as a simulation child node;

in this step, a child node may be randomly selected from the last-stage child nodes as the simulation child node.

S10352, simulating tactics corresponding to the simulated child nodes under the current man-machine confrontation situation according to a simulation strategy to obtain a simulation result;

in this step, the specific implementation method of the simulation strategy may be: introducing a Demo opponent confrontation randomly selected for each action unit at each decision time node, and simulating and executing a second tactic corresponding to the simulation sub-node in a preview mode to obtain each simulation result; the simulation result can be represented by win/negative, and can also be represented according to the score from the current decision time node to the end of the confrontation.

S10353, recording the simulation result of the simulation child node and adding 1 to the access times corresponding to the simulation child node;

in this step, each node of the monte carlo tree records two values, which represent the number of simulations of this node and its child nodes and the simulation result (the simulation result can be the number of wins, the score, or the weight calculated from the wins/scores), such as 10 simulations, and 4 wins, which is recorded as 4/10.

S10354, tracing the current simulation result of the simulation child node and the access times corresponding to the simulation child node back to all levels of father nodes of the simulation child node, so that all levels of father nodes of the simulation child node record the current simulation result of the simulation child node and the access times corresponding to the simulation child node;

s10355, searching the leaf node with the most access times from the Monte Carlo tree, and taking the tactics corresponding to the leaf node as the optimal tactics under the current man-machine confrontation situation.

In this embodiment, a leaf node of a monte carlo tree refers to a node below the node, to which a node is no longer connected, that is, the end of the monte carlo tree; the leaf node with the most access times means that the tactics corresponding to the leaf node are most likely to be the optimal tactics in the Monte Carlo tree.

As shown in fig. 5, optionally, in the above embodiment, the taking the action task corresponding to each action unit in the optimal tactics as the action task corresponding to each action unit in the current human-machine confrontation situation includes:

s10361, judging whether the optimal tactics meet design requirements or not according to a judgment strategy;

in this step, the decision strategy may be: if the simulated winning rate of the optimal tactics is greater than a first preset value, determining that the optimal tactics meets the design requirement; or

If the simulation score of the optimal tactics is larger than a second preset value, determining that the optimal tactics meets the design requirement; or

And if the weight of the optimal tactics is greater than a third preset value, determining that the optimal tactics meets the design requirement, wherein the weight of the optimal tactics can be obtained by calculation according to the simulation winning rate or the simulation score of the optimal tactics.

S10362, if the optimal tactics meet design requirements, taking action tasks corresponding to the action units in the optimal tactics as action tasks corresponding to the action units under the current man-machine confrontation situation.

In this embodiment, a threshold is set in advance for the optimal tactics obtained based on the monte carlo tree search, that is, the optimal tactics are adopted only when the optimal tactics obtained based on the monte carlo tree search satisfy a certain condition, so that the probability of winning when the optimal tactics are applied to the current man-machine confrontation situation can be improved.

As shown in fig. 6, optionally, in the above embodiment, after determining whether the optimal tactics meets the design requirement according to the decision strategy, the method may further include:

s10363, if the optimal tactics do not meet design requirements, continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth;

s10364, searching the optimal tactics under the current man-machine confrontation situation again from the Monte Carlo tree;

and S10365, taking the action tasks corresponding to the action units in the optimal tactics as the action tasks corresponding to the action units under the current man-machine confrontation situation.

In this embodiment, when the optimal tactics under the current human-computer confrontation situation searched from the current monte carlo tree do not meet the design requirements, the monte carlo tree continues to be expanded from the root node of the monte carlo tree according to the expansion strategy until the optimal tactics searched from the monte carlo tree meet the design requirements, or the number of times of constructing the monte carlo tree reaches a preset number (for example, the preset number is 100), and the optimal tactics searched from the monte carlo tree at the last time is used as the final optimal tactics.

The specific method for expanding the monte carlo tree according to the expansion strategy, which starts from the root node of the monte carlo tree, may refer to the foregoing contents, and is not described herein again.

In a second aspect, the embodiment of the invention provides a man-machine confrontation knowledge data hybrid drive type decision device, which can make full use of the advantages of both the knowledge drive type decision method and the data drive type decision method, so that the man-machine confrontation decision process can be explained and learned.

As shown in fig. 7, an embodiment of the present invention provides an ergonomic confrontation knowledge data hybrid driving type decision apparatus, including:

an obtaining unit 21, configured to obtain, at each decision time node, a current human-machine confrontation situation; the searching unit 22 is configured to search, in a decision rule base, action tasks corresponding to the action units in the current human-computer confrontation situation, where the decision rule base stores corresponding relationships between the action units and the action tasks in the various human-computer confrontation situations;

the determining unit 23 is configured to determine, based on monte carlo tree search, action tasks corresponding to the action units in the current human-computer confrontation situation if the action tasks corresponding to the action units in the current human-computer confrontation situation are not found in the decision rule base;

the first sending unit 24 is configured to send the action tasks corresponding to the action units in the current human-computer confrontation situation to the corresponding action units, so that each action unit executes the action task.

Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 8, the electronic device mainly includes: the system comprises a processor 201, a communication interface 202, a memory 203 and a communication bus 204, wherein the processor 201, the communication interface 202 and the memory 203 are communicated with each other through the communication bus 204. Wherein, the memory 203 stores programs that can be executed by the processor 201, and the processor 201 executes the programs stored in the memory 203, implementing the following steps: acquiring the current man-machine confrontation situation at each decision time node; searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations; sending each searched action task to a corresponding action unit so that each action unit executes the action task; if the action tasks corresponding to the action units under the current man-machine confrontation situation do not exist in the decision rule base, executing the following operations: respectively matching an action task for each action unit according to the matching strategy to generate a first tactic; expanding the first tactics through an expansion strategy to generate at least one second tactics, wherein the action task of at least one action unit in the second tactics is different from the action task of the action unit in the first tactics; constructing a Monte Carlo tree by taking the first tactics as a root node of the Monte Carlo tree and taking the second tactics as a first-level child node of the Monte Carlo tree; continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth; searching the Monte Carlo tree for the optimal tactics under the current man-machine confrontation situation; and respectively sending the action task corresponding to each action unit in the optimal tactics to the corresponding action unit so as to enable each action unit to execute the action task.

The communication bus 204 mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 204 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The communication interface 202 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory 203 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor 201.

The Processor 201 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.

In yet another embodiment of the present application, there is also provided a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the human-machine confrontation knowledge data hybrid-driven decision method described in the above embodiment, the method mainly comprising: acquiring the current man-machine confrontation situation at each decision time node; searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations; sending each searched action task to a corresponding action unit so that each action unit executes the action task; if the action tasks corresponding to the action units under the current man-machine confrontation situation do not exist in the decision rule base, executing the following operations: respectively matching an action task for each action unit according to the matching strategy to generate a first tactic; expanding the first tactics through an expansion strategy to generate at least one second tactics, wherein the action task of at least one action unit in the second tactics is different from the action task of the action unit in the first tactics; constructing a Monte Carlo tree by taking the first tactics as a root node of the Monte Carlo tree and taking the second tactics as a first-level child node of the Monte Carlo tree; continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth; searching the Monte Carlo tree for the optimal tactics under the current man-machine confrontation situation; and respectively sending the action task corresponding to each action unit in the optimal tactics to the corresponding action unit so as to enable each action unit to execute the action task.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes, etc.), optical media (e.g., DVDs), or semiconductor media (e.g., solid state drives), among others.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A human-machine confrontation knowledge data hybrid-driven decision making method, characterized in that the method comprises:

acquiring the current man-machine confrontation situation at each decision time node;

searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relation between the action units and the action tasks under various man-machine confrontation situations;

if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base, the action tasks corresponding to the action units under the current man-machine confrontation situation are determined based on Monte Carlo tree searching;

and sending action tasks corresponding to the action units respectively in the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks.

2. The method of claim 1, further comprising:

and if action tasks corresponding to the action units under the current man-machine confrontation situation are found in the decision rule base, sending the found action tasks to the corresponding action units so that the action units execute the action tasks.

3. The method of claim 1, wherein the determining, based on the monte carlo tree search, the action tasks respectively corresponding to the action units in the current human-machine confrontation situation comprises:

respectively matching an action task for each action unit according to the matching strategy to generate a first tactic;

expanding the first tactics through an expansion strategy to generate at least one second tactics, wherein the action task of at least one action unit in the second tactics is different from the action task of the action unit in the first tactics;

constructing a Monte Carlo tree by taking the first tactics as a root node of the Monte Carlo tree and taking the second tactics as a first-level child node of the Monte Carlo tree;

continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth;

searching the Monte Carlo tree for the optimal tactics under the current man-machine confrontation situation;

and taking the action task corresponding to each action unit in the optimal tactics as the action task corresponding to each action unit in the current man-machine confrontation situation.

4. The method of claim 3, wherein said matching an action task for each action unit according to the matching policy, generating a first tactic comprises:

a first tactical plan is generated by randomly matching an action task for each action unit.

5. The method of claim 3, wherein the action task comprises at least one of the following task elements: the method comprises the following steps of task objects, task target points, task key points, task end time nodes and task actions; the expanding the first tactic by an expansion strategy to generate at least one second tactic comprises:

and adjusting the task elements of the action tasks of at least one action unit in the first tactics to generate at least one second tactics.

6. The method of claim 5, wherein continuing to expand the Monte Carlo tree according to the expansion policy until the Monte Carlo tree reaches a design depth comprises:

selecting an expansion child node from each first-level child node according to an upper confidence bound algorithm formula;

expanding the second tactics corresponding to the expansion child nodes according to the expansion strategy to generate at least one third tactic;

taking each third tactic as each second-level child node of the Monte Carlo tree, wherein each second-level child node is a child node of the extended child node;

and continuing to select an expansion sub-node from each second-level sub-node according to the upper confidence bound algorithm formula, and expanding a third tactical corresponding to the expansion sub-node according to the expansion strategy until the Monte Carlo tree reaches the design depth.

7. The method of claim 6, wherein the searching for the optimal tactics under the current anti-man situation from the Monte Carlo tree comprises:

selecting one child node from the last-level child nodes of the Monte Carlo tree as a simulation child node;

simulating the tactics corresponding to the simulated child nodes under the current man-machine confrontation situation according to a simulation strategy to obtain a simulation result;

recording the simulation result of the simulation child node and adding 1 to the access times corresponding to the simulation child node;

backtracking the simulation result of the simulation child node and the access times corresponding to the simulation child node to all levels of father nodes of the simulation child node so that all levels of father nodes of the simulation child node record the simulation result of the simulation child node and the access times corresponding to the simulation child node;

searching the leaf node with the most access times from the Monte Carlo tree, and taking the tactics corresponding to the leaf node as the optimal tactics under the current man-machine confrontation situation.

8. An ergonomic confrontation knowledge data hybrid drive type decision-making device, the device comprising:

the acquisition unit is used for acquiring the current man-machine confrontation situation at each decision time node;

the searching unit is used for searching action tasks corresponding to the action units under the current man-machine confrontation situation in a decision rule base, wherein the decision rule base stores the corresponding relations between the action units and the action tasks under the various man-machine confrontation situations;

the determining unit is used for searching and determining action tasks corresponding to the action units under the current man-machine confrontation situation based on the Monte Carlo tree if the action tasks corresponding to the action units under the current man-machine confrontation situation are not found in the decision rule base;

the first sending unit is used for sending the action tasks corresponding to the action units under the current man-machine confrontation situation to the corresponding action units so that the action units execute the action tasks.

9. The apparatus of claim 8, further comprising:

and the second sending unit is used for sending each found action task to the corresponding action unit if the action task corresponding to each action unit under the current man-machine confrontation situation is found in the decision rule base, so that each action unit executes the action task.

10. The apparatus of claim 8, wherein the determining unit comprises:

the matching subunit is used for respectively matching an action task for each action unit according to the matching strategy to generate a first tactic;

the extension subunit is used for extending the first tactic through an extension strategy to generate at least one second tactic, wherein the action task of at least one action unit in the second tactic is different from the action task of the action unit in the first tactic;

a construction subunit, configured to construct a monte carlo tree by using the first operation as a root node of the monte carlo tree and using the second operation as a first-level child node of the monte carlo tree; continuing to expand the Monte Carlo tree according to the expansion strategy until the Monte Carlo tree reaches the design depth;

the searching subunit is used for searching the optimal tactics under the current man-machine confrontation situation from the Monte Carlo tree;

and the determining subunit is used for taking the action task corresponding to each action unit in the optimal tactics as the action task corresponding to each action unit in the current man-machine confrontation situation.

11. The apparatus according to claim 10, wherein the matching subunit is specifically configured to:

12. The apparatus of claim 10, wherein the action task comprises at least one of the following task elements: the method comprises the following steps of task objects, task target points, task key points, task end time nodes and task actions; the extension subunit is specifically configured to:

13. The apparatus according to claim 12, wherein the building subunit is specifically configured to:

14. The apparatus according to claim 13, wherein the search subunit is specifically configured to:

15. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory for storing a computer program;

the processor is used for executing the program stored in the memory to realize the man-machine confrontation knowledge data hybrid driving type decision method of any one of claims 1 to 7.

16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the human-machine-confrontation knowledge data hybrid-driven decision method of any one of claims 1 to 7.