CN116432895A

CN116432895A - Fight method and device

Info

Publication number: CN116432895A
Application number: CN202111667926.2A
Authority: CN
Inventors: 张秉桢
Original assignee: Beijing Qishun Technology Co ltd
Current assignee: Beijing Qishun Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2023-07-14

Abstract

The disclosure provides a fight method and device, the fight method comprises: receiving simulation information sent by a simulator in a current scene; based on the simulation information, obtaining information of at least one combat unit aiming at the current scene; for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit into a first agent corresponding to the combat unit to obtain at least one first task; inputting at least one first task into each corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks; all the second tasks are returned to the simulator so that the simulator completes one combat by executing all the second tasks.

Description

Fight method and device

Technical Field

The present application relates to the field of reinforcement learning, and the following description relates to a combat method and device.

Background

The soldier chess is a tool for simulating the fight activities of all parties under the full elements and the full flow, and can be used for making and evaluating fight plan schemes, demonstrating weapon equipment effectiveness, testing and checking fight plans and training officer command decision-making capability through the mutual influence of the fight decisions and mutual game. The modern computer chess system is combined with the simulation technology and the information technology, scientifically content in operational analysis is fused, novel technical elements such as cloud computing and artificial intelligence are added, and large-scale combined operational actions can be simulated.

At present, three methods are commonly adopted in the research of game deduction methods aiming at a chess system: 1. expert rule deduction method based on behavior tree; 2. a traditional military operation research deduction method is used; 3. the method is deduced by using an end-to-end reinforcement learning algorithm. The 3 rd method, namely the end-to-end reinforcement learning algorithm deduction method, is to model the decision problem in the soldier chess deduction into a Markov decision process, interact an intelligent body with the environment, and determine the output decision action through the observation state of the current step and the rewards of the previous action.

Disclosure of Invention

Exemplary embodiments of the present disclosure may or may not solve at least the above-described problems.

According to a first aspect of the present disclosure, there is provided a fight method comprising: receiving simulation information sent by a simulator in a current scene; based on the simulation information, obtaining information of at least one combat unit aiming at the current scene; for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit into a first agent corresponding to the combat unit to obtain at least one first task; inputting at least one first task into each corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks; all the second tasks are returned to the simulator so that the simulator completes one combat by executing all the second tasks.

Optionally, the first agent is a reinforcement learning agent if the combat unit meets a preset condition; under the condition that the combat unit does not accord with the preset condition, the first agent is a knowledge rule agent.

Optionally, in the case where the first agent is a reinforcement learning agent, the first agent trains by: under the condition that the number of times of completing the fight by the simulator in the current scene does not exceed a preset threshold value, executing the following cycle: receiving simulation information sent by a simulator in a current scene; based on the simulation information, obtaining information of at least one combat unit aiming at the current scene; for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit into a first intelligent agent corresponding to the combat unit to obtain at least one estimated first task; inputting at least one estimated first task into each corresponding second intelligent agent to obtain at least one estimated second task corresponding to each estimated first task, wherein the at least one estimated second task is a plurality of tasks for realizing the corresponding estimated first tasks; returning all the estimated second tasks to the simulator so that the simulator can complete one-time fight by executing all the estimated second tasks; adjusting the first agent based on a preset rewarding function and an intermediate rewarding value obtained by simulation information; and under the condition that the number of times of completing the fight by the simulator in the current scene exceeds a preset threshold value, finishing training of the first intelligent agent.

Optionally, based on the simulation information, obtaining information of at least one combat unit for the current scene includes: acquiring a basic situation information set in a standard format from simulation information, wherein the standard format is consistent with the input format of a first intelligent agent; based on the basic situation information set, evaluation analysis is carried out, and a combat plan corresponding to at least one combat unit of the current scene is determined; and obtaining information of at least one combat unit aiming at the current scene based on the combat plan and the basic situation information set.

Optionally, in the case that the first agent is a knowledge rule agent, inputting information for the combat unit to the first agent corresponding to the combat unit, to obtain at least one first task, including: based on a combat plan in information of a combat unit, combat strategies corresponding to the combat unit are determined, and basic situation information aiming at the combat unit in a combat strategy corresponding to the combat unit and basic situation information set is input to a first agent corresponding to the combat unit to obtain at least one first task, wherein the first agent comprises task information for realizing the combat strategies.

Optionally, in the case that the second agent is a knowledge rule agent, task information for implementing the corresponding first task is included in the second agent.

Optionally, in the case that the first agent is a reinforcement learning agent, inputting information for the combat unit to the first agent corresponding to the combat unit, to obtain at least one first task, including: and inputting basic situation information aiming at the fight unit in a fight plan and basic situation information set corresponding to the fight unit to a first agent corresponding to the fight unit to obtain at least one first task.

According to a second aspect of the present disclosure there is provided a combat device, the device comprising: the simulation information receiving unit is configured to receive simulation information sent by the simulator in the current scene; the information acquisition unit is configured to obtain information of at least one combat unit aiming at the current scene based on the simulation information; a task acquisition unit configured to perform, for each of the at least one combat unit, the following operations: inputting information aiming at the combat unit into a first agent corresponding to the combat unit to obtain at least one first task; inputting at least one first task into each corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks; and a task transmitting unit configured to return all the second tasks to the simulator so that the simulator completes one combat by executing all the second tasks.

Optionally, the information obtaining unit is further configured to obtain a basic situation information set in a standard format from the simulation information, where the standard format is consistent with the input format of the first agent; based on the basic situation information set, evaluation analysis is carried out, and a combat plan corresponding to at least one combat unit of the current scene is determined; and obtaining information of at least one combat unit aiming at the current scene based on the combat plan and the basic situation information set.

Optionally, in the case that the first agent is a knowledge rule agent, the task acquisition unit is further configured to determine a combat strategy corresponding to the combat unit based on the combat plan in the information of the combat unit; and inputting the basic situation information aiming at the combat unit in the combat strategy and basic situation information set corresponding to the combat unit to a first agent corresponding to the combat unit to obtain at least one first task, wherein the first agent comprises first task information for realizing the combat strategy.

Optionally, in the case that the second agent is a knowledge rule agent, the second agent includes second task information for implementing the corresponding first task.

Optionally, in the case that the first agent is a reinforcement learning agent, the task obtaining unit is further configured to input basic situation information for the combat unit in the combat plan and the basic situation information set corresponding to the combat unit to the first agent corresponding to the combat unit, so as to obtain at least one first task.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium storing instructions, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform a combat method as above.

According to a fourth aspect of the present disclosure, there is provided a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method of fighting as above.

According to the fight method and device of the present exemplary embodiment, instead of merely inputting a final decision, the original game model is layered, that is, the game model is divided into a first agent and a second agent, where the first agent may output a first task (such as a macro task) to be done for a current scene, and the second agent may output at least one second task (such as an atomic task) to be done for the first task, so that, based on the macro task output by the first agent, a fight plan test and a training officer command decision capability of a fight can be performed well, and the fight method is more interpretable compared with a reinforcement learning algorithm deduction method from end to end.

Drawings

These and/or other aspects and advantages of the present invention will become apparent from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates a flow chart of a method of fight in accordance with an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a background schematic diagram of a designated warfare concept in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of the processing of the RL adapter versus situation information according to an exemplary embodiment of the disclosure;

FIG. 4 illustrates a pattern architecture diagram of reinforcement learning agent and knowledge-rule agent mixed decisions in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a system architecture diagram for training a reinforcement learning agent according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a training framework diagram of a reinforcement learning agent according to an exemplary embodiment of the present disclosure;

fig. 7 shows a block diagram of a fight device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the invention defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

Embodiments of the present disclosure will be described below in order to explain the present disclosure by referring to fig. 1 to 7.

Fig. 1 shows a flowchart of a fight method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the fight method includes the steps of:

in step S101, simulation information transmitted by a simulator in a current scene is received. The current scenario may be a military war simulation scenario or a chess game scenario, as this disclosure is not limited in this regard.

For example, taking a simulator of a chess simulation system as an example, a designated design can be preset, and the designated design background of the designated design of the fight concept can be shown as a figure 2, wherein the targets on the my side (defenders, namely black font sides in figure 2) are 2 command posts which depend on the ground, sea surface and sky three-dimensional air fire, namely key defender targets; the enemy target (the attacking party, namely the white font party in the figure) breaks through the my air defense system and destroys the key targets of 2 command posts in the my for comprehensively utilizing the sea-air assaults and the support and guarantee forces. After the specified design is preset, the specified environment can be started and initialized, the simulation rate is set, and then the battlefield real-time situation information (namely the simulation information) is generated.

In step S102, information of at least one combat unit for the current scene is obtained based on the simulation information.

According to an exemplary embodiment of the present disclosure, obtaining information of at least one combat unit for a current scene based on simulation information includes: acquiring a basic situation information set in a standard format from simulation information, wherein the standard format is consistent with the input format of a first intelligent agent; based on the basic situation information set, evaluation analysis is carried out, and a combat plan corresponding to at least one combat unit of the current scene is determined; and obtaining information of at least one combat unit aiming at the current scene based on the combat plan and the basic situation information set. According to the embodiment, accurate information of at least one combat unit can be obtained quickly. The information of the fight unit may include fight plan and basic situation information corresponding to the fight unit in the case where the first agent is a reinforcement learning agent, and the information of the fight unit may include fight policy and basic situation information corresponding to the fight unit in the case where the first agent is a knowledge rule agent. The acquisition of the above-described basic situation information, combat plan, and combat strategy will be discussed below and will not be discussed here. It should be noted that the present disclosure is not limited to information of the combat unit, and may be any information that may be input into an agent to obtain a corresponding task.

Specifically, after situation information sent by the simulator is received, the situation information can be sent to the RL adapter, the situation information is firstly converted into a basic situation information set in a standard format through a data adaptation module in the RL adapter, and then the basic situation information is processed to obtain comprehensive evaluation of the situation of the battlefield. The flow of the RL adapter for processing the situation information is shown in fig. 3, the RL adapter mainly performs situation analysis, and the situation analysis involves a very large number of factors, and different types such as sky, sea surface, islands and the like are distinguished from the battlefield environment; the armament comprises fighter plane, bomber, pre-warning machine, jammer, guard ship, ground guard vehicle, ground radar and other types; tactically, the method is divided into various actions such as patrol, guard, target assault, regional assault and the like; meanwhile, much information of the opponent on the battlefield is unknown, the actions of the opponent are difficult to predict, and the complexity of situation analysis changes exponentially along with the change of the battlefield situation. The situation analysis process of the embodiment can be divided into two parts, namely information fusion and evaluation analysis:

1) And information fusion, namely extracting the characteristics of the situation elements from the situation information. A series of original situation features, such as environmental features, battle unit types and formation states (patrol, assault, convoy, etc.), can be extracted from all information of battle field environments and battle units, and can be reorganized into a specific data structure and cached according to the dimension required by evaluation analysis; for some features that need to be estimated (for example, whether an enemy has a usable bomber), a plurality of features can be mutually verified to obtain a prediction result, that is, the estimated features. And fusing the original situation characteristics and the estimated characteristics to generate situation characteristic vectors.

2) In the evaluation analysis, namely in the simulation deduction process, firstly, after analyzing the current task situation, according to situation feature vectors generated by information fusion, meanwhile, analyzing the current situation by combining with related military rules (namely: all influencing factors of the battlefield environment, comparison of forces of two parties and battlefield emergencies are considered, deployment and possible military operations and reasons of the military operations are analyzed, and battlefield attempts (trapping, surmounting, front strength attack and the like) of the enemy are considered. The result of the evaluation analysis is the my combat plan, i.e. the combat plan corresponding to at least one combat unit in the above embodiment.

In step S103, for each of the at least one combat unit, the following operations are performed: inputting information aiming at the combat unit into a first agent corresponding to the combat unit to obtain at least one first task; and inputting at least one first task into the second intelligent agent corresponding to each first task to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks.

For example, FIG. 4 illustrates a pattern architecture diagram of reinforcement learning agents and knowledge rule agents mixed decision according to an exemplary embodiment of the present disclosure, which pattern architecture may exist in each decision layer (task layer and execution layer) both reinforcement learning agents (i.e., reinforcement learning in FIG. 4) and knowledge rule agents (i.e., knowledge rule base in FIG. 4) from a vertical perspective; from a lateral perspective, the system decisions are divided into a team level agent (task layer, i.e., a team level decision maker, a virtual director that can direct a plurality of team level agents) and a team level agent (execution layer, i.e., a team level decision maker, which is the smallest director) respectively responsible for the selection and implementation of a tactic (i.e., the first task), e.g., the team level agent can determine the tactic to be adopted, send the determined tactic as a task to the team level agent, and the team level agent divides the tactic into specific, executable tasks and sends the specific, executable tasks to the simulator.

According to an exemplary embodiment of the present disclosure, in a case where the combat unit meets a preset condition, the first agent is a reinforcement learning agent; under the condition that the combat unit does not accord with the preset condition, the first agent is a knowledge rule agent. For example, the preset condition may be that the control of the combat unit is complex, and the judgment criterion of the control complexity may be set to a different criterion according to the scene. According to the embodiment, whether the intelligent agent adopts the reinforcement learning intelligent agent or the knowledge rule intelligent agent can be determined according to the complex or simple control of the fight unit, and if the control of the fight unit is simple, the knowledge rule intelligent agent can be directly adopted without training. It should be noted that, the second agent may also be a reinforcement learning agent or a knowledge rule agent, and specifically, which agent is adopted may also be according to the complexity of the control of the corresponding combat unit, which will not be discussed here.

For example, for some combat units with complex control, the combat unit is obtained through reinforcement learning training to correspond to a team level agent, and combat tasks (which can also be understood as tactics or macro tasks) are output to the team level agent based on the knowledge rule base for execution, and the team level agent divides the tactics or macro tasks into specific executable tasks and sends the specific executable tasks to the simulator; for some combat units which are simple to control or difficult to use reinforcement learning modeling, the team level agents and the team level agents can be constructed in a manner based on a knowledge rule base, and the functions of the team level agents and the team level agents are similar to those of agents responsible for complex combat units, namely, the team level agents output combat tasks, and the team level agents divide tactics or macro tasks into specific executable tasks and send the specific executable tasks to the simulator, so that a relatively complete and intelligent AI decision system can be obtained on the whole.

According to an exemplary embodiment of the present disclosure, in the case where the first agent is a reinforcement learning agent, the first agent trains by: under the condition that the number of times of completing the fight by the simulator in the current scene does not exceed a preset threshold value, executing the following cycle: receiving simulation information sent by a simulator in a current scene; based on the simulation information, obtaining information of at least one combat unit aiming at the current scene; for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit into a first intelligent agent corresponding to the combat unit to obtain at least one estimated first task; inputting at least one estimated first task into each corresponding second intelligent agent to obtain at least one estimated second task corresponding to each estimated first task, wherein the at least one estimated second task is a plurality of tasks for realizing the corresponding estimated first tasks; returning all the estimated second tasks to the simulator so that the simulator can complete one-time fight by executing all the estimated second tasks; adjusting the first agent based on a preset rewarding function and an intermediate rewarding value obtained by simulation information; and under the condition that the number of times of completing the fight by the simulator in the current scene exceeds a preset threshold value, finishing training of the first intelligent agent. According to the embodiment, the first intelligent agent and/or the second intelligent agent can be trained conveniently and rapidly.

FIG. 5 illustrates a system architecture diagram for training reinforcement learning agents according to an exemplary embodiment of the present disclosure, as shown in FIG. 5, state refers to raw state information returned by a simulator (i.e., the above-described simulator or battlefield real-time situation information); obs refers to the observed information of reinforcement learning agents (i.e., information of at least one combat unit for the current scenario described above); reward refers to the reward value (i.e., the above-described intermediate reward value) that the reinforcement learning agent obtains; action_ai refers to a team-level tactical task output by a reinforcement learning agent (agent); the knowledge rule agent (i.e., the knowledge rule in fig. 5) functions to convert the team-level tactical task into an underlying instruction, where action represents the underlying instruction corresponding to the tactical task output by the reinforcement learning agent; the rule engine refers to a module for outputting instructions based on knowledge rules, and the rule engine module comprises a first agent and a second agent which are knowledge rule agents, and the rule engine module can be used for combining the first agent and the second agent into one agent when the first agent and the second agent are knowledge rule agents, namely, a knowledge rule base is adopted; action_rule represents an instruction output by the rule engine; the instruction set receives all instructions and sends them to the emulator. It should be noted that the training process may be performed by a server, and the disclosure is not limited to this

For example, in the following process of acquiring a specific instruction (i.e., at least one second task) by taking fighter plane combat unit as an example, in a fighter plane strategy based on reinforcement learning, the fighter plane combat unit (such as fighter plane) is divided into a plurality of teams, each team is used as a reinforcement learning agent, and a longitudinal combination mode is adopted to output an upper tactical task, and the upper tactical task is converted into the specific instruction through a knowledge rule base. Wherein, the observation space of the reinforcement learning agent corresponding to each formation is defined as: [ MeanPos ] _team ，FDI _enemy ，DADI _enemy ，FDI _ally ，DADI _ally ]，MeanPos _team Representing my formation average coordinate location information, FDI _enemy Representing enemy force distribution information, DADI _enemy Representing enemy threat zone information, FDI _all y represents the information of the distribution of the force of the weapon and DADI _ally Representing my threat zone distribution information; the action space of each reinforcement learning agent is defined as: [ D _north ，D _south ，D _middle ，D _northship ，D _southship ，A _north ，A _south ，donothing]Wherein D is _north Representing a north defensive tactical mission, D _south Representing a southern defense tactical task, D _middle Representing middle defense tactical tasks, D _northship Representing a guard north-part battle-expelling mission, D _southship Representing a guard south expelling tactical mission, A _north Representing buried North tactical mission, A _south Representing buried south tactical tasks, and dozing represents a soldier tactical task. The bonus function (i.e., the preset bonus function described above) may be designed as:

Reward＝R _blue (obs)+R _red (obs)+R _intristic

Wherein R is _blue ＝Air_num _blue *1，Air_num _blue Representing the number of remaining fighters on the my side;

representing the number of fighters remaining by enemy, < +.>

Indicating the number of enemy remaining bombers,

indicating the number of residual pre-warning machines of enemy, jam _red Representing the number of enemy residual jammers; r is R _intristic An inherent prize indicating a win-or-lose of the game.

For another example, the process of acquiring a specific instruction (i.e., at least one second task) is described below by taking the pre-warning device operation unit as an example, where the pre-warning device is used as an information unit, and is independent from other operation units, and because the action space is small, the pre-warning device can directly use the reinforcement learning model to output actions, that is, use the reinforcement learning agent to output the tactical task at the upper layer, and convert the task into the specific instruction through the knowledge rule base. Wherein, define the observation space as: [ Pos, pos _relative ，Dis _enemy ，Dis _target ]Wherein Pos represent coordinate positions of the early warning machine, and Pos _relative Indicating the relative position of the early warning machine and the enemy threat unit, dis _enemy Dis representing distance information between early warning machine and enemy threat unit _target Distance information representing the early warning machine and the target point; defining an action space as: [ D _nort ，D _south ，D _middle ，D _northship ，D _southship ，A _north ，A _south ，donothing]Wherein D is _north Representing a north defensive tactical mission, D _south Representing a southern defense tactical task, D _middle Representing middle defense tactical tasks，D _n o _rthship Representing a guard north-part battle-expelling mission, D _southship Representing a guard south expelling tactical mission, A _north Representing buried North tactical mission, A _south Representing buried south tactical tasks, and dozing represents a soldier tactical task. The bonus function may be designed to:

wherein dis refers to the distance between the early warning machine and the target point, dis _max Refer to the maximum distance between the early warning machine and the target point, R _intrinsic Meaning that the inherent prize after game winning.

Other combat units, such as a ship-expelling unit, can be controlled directly using knowledge rule-based agents, i.e. by knowledge rule agents, because of their simple operation and poor flexibility. The expelling ship strategy can use a priority-based striking strategy, namely, a functional unit such as an enemy early warning machine, an interference machine and the like is struck preferentially, a ground attack unit such as an enemy bomber and the like is struck preferentially, and an air fighter unit such as an enemy fighter plane and the like is struck finally. The bomber uses the strategy of bombing enemy airports, paralysis enemy fight ability and delay time.

After the specific instructions corresponding to the combat units are obtained, the specific instructions are stored in an instruction set together, the combat instructions in the instruction set are sent to a combat simulation system, the combat simulation system simulates, and the real-time situation and combat results are updated. And repeating the process of acquiring the specific instruction until the wheel combat is finished, finishing training of the intelligent body, and starting the next training if necessary. After the appointed design is preset, the simulation rate is set, meanwhile, the fight times can be monitored, if the fight times reach the threshold value, deduction is finished, namely training is finished, and otherwise, a battlefield real-time situation is generated.

Fig. 6 illustrates a training framework diagram of a reinforcement learning agent according to an exemplary embodiment of the present disclosure, which, as shown in fig. 6, is mainly composed of the following four parts:

cpu interconnect node (Actor): and the team level agent is responsible for interaction with the simulation environment of the chess and generates combat data. The team level agents make decisions to obtain a combat plan, the combat plan is executed by the team level agents, and specific combat instructions aiming at the combat plan are returned to the chess simulation environment. The deduction environment (namely, the simulation environment) simulates the influence of actions of both sides of a friend and foe (the adversary agent is a rule agent which is set in advance), gives out new situations and rewards, is received by the team level agent, outputs the next macro task, and reciprocates in this way, and when each deduction is finished, the deduction environment can inform the agent of the winning or losing result of the deduction at the time.

2. Data manager (Data Management): is responsible for the collection, storage and transmission of data. And collecting a large amount of interaction data generated by the CPU interaction node, processing the interaction data, and sending the interaction data to the GPU training node.

Gpu training node (Learner): is responsible for training of high-level agents (neural network parameter update). After receiving the data sent by the data manager, calculating a neural network loss function, carrying out gradient back propagation, updating the neural network, and after the neural network is updated, sending a new high-level agent copy to each CPU interaction node.

4. Controller (Controller): is responsible for controlling the whole training process.

According to an exemplary embodiment of the present disclosure, in a case where the first agent is a knowledge rule agent, inputting information for a combat unit to a first agent corresponding to the combat unit, obtaining at least one first task includes: based on a combat plan in information of a combat unit, combat strategies corresponding to the combat unit are determined, and basic situation information aiming at the combat unit in a combat strategy and basic situation information set corresponding to the combat unit is input to a first agent corresponding to the combat unit to obtain at least one first task, wherein the first agent comprises task information for realizing the combat strategies. According to this embodiment, the knowledge rule agent does not need training, and only determines the corresponding execution logic, that is, the corresponding first task, for the corresponding combat strategy.

For example, in the context of the designated and intended combat concept shown in fig. 2, my want to protect 2 command posts, the combat plan may be to send out early warning machines at time T1 to a predetermined scout point to perform scout tasks for early warning, and send out bombers at time T2 to a predetermined point to perform air attack tasks; the combat strategy may be a scout strategy and a bomber strategy; for the bomber strategy, the macro task (i.e., the first task) may perform a void task for dispatching 10 bombers to a predetermined location, and the atomic task (i.e., the second task) may be a bomber take-off, an airport where the bomber bombes the predetermined location, a bomber return, etc.

According to an exemplary embodiment of the present disclosure, in the case where the second agent is a knowledge rule agent, task information for implementing a first task corresponding to at least one second task is included in the second agent. According to this embodiment, the knowledge rule agent does not need training, and only determines the corresponding execution logic for the corresponding first task, that is, the corresponding second task.

It should be noted that, when the first agent and the second agent are both knowledge rule agents, the first agent and the second agent may be combined into one agent, that is, one knowledge rule base may be adopted.

According to an exemplary embodiment of the present disclosure, in a case where the first agent is a reinforcement learning agent, inputting information for a combat unit to a first agent corresponding to the combat unit, obtaining at least one first task, includes: and inputting basic situation information aiming at the fight unit in a fight plan and basic situation information set corresponding to the fight unit to a first agent corresponding to the fight unit to obtain at least one first task. According to this embodiment, the reinforcement learning agent may also input the operation plan for evaluation and analysis together to assist in obtaining the more preferable first task.

The following description will be made taking as an example that the first agent and the second agent are knowledge rule agents.

After the information of the fight unit is obtained, the fight policy corresponding to the fight unit may be determined based on the information of the fight unit. Generally, the parties shown in fig. 2 can be generally divided into fighter plane strategies, early warning machine strategies, guard ship strategies, land camping strategies and bomber strategies, wherein the fighter plane strategies are generally modeled by reinforcement learning, and other strategies are basically constructed based on knowledge rules because the fighter plane strategies are relatively complex to control.

When the combat strategy is an early warning machine strategy, the early warning machine starts a sea Liu Tance mode and performs a reconnaissance task to a reconnaissance point; when the impact of a bullet is met or the distance between the impact engine and an enemy fighter plane/a ship to be expelled reaches a safe distance, the escape action is executed.

When the combat strategy is a guard ship strategy, two guard ships are deployed in the first two minutes of the situation, and the hitting distance of the guard ship is set according to the attack intention of the enemy. If the situation time of the enemy guard ship on the left side of the first preset point is more than 30min and the number of the enemy fighters is equal to zero, the enemy fighters are not hit; if the number of the enemy guard vessels is equal to 2 at the left side of the second preset point, the enemy guard vessels cancel to strike the enemy fighter plane; if the enemy guard ship is on the left side of the first preset point of the X-axis, the number of the enemy guard ship is equal to 2, and no fighter plane with a bullet exists in 145km of the enemy guard ship, the enemy guard ship cancels the fighter plane with the enemy; if the enemy initiates the total attack, no enemy bomber exists in the situation, no enemy jammer exists in the situation, and the number of the enemy guard ship ammunition in the starting total attack direction is more than 6, the enemy guard ship strikes the fighter plane.

When the combat strategy is a camping guiding strategy, starting a camping guiding radar when the situation starts. If the number of ammunition of the land guidance is larger than the expected value, the damage degree of the land guidance is smaller than 100%, the south/north guard ship exists, the damage degree is smaller than 66%, and the number of ammunition of the guard ship is larger than the default value, the south/north land guidance only hits the bomber.

When the combat strategy is a bomber strategy, the bombers are in default stay at the airport at the beginning of the situation, and take off from the bombers at the beginning. If the number of the enemy guard ships in the western number is equal to 2 in the first preset battlefield line, and the enemy guard ships are not in the attack range of the enemy guard ships and have the bullet, executing a bullet-attracting strategy; if the number of the enemy guard ships in the western number is equal to 2 and meets the bombing airport conditions on the first preset battlefield line, bombing the enemy airport by a bomber; if the enemy is preferred to attack the south command post, returning the my bomber; if the enemy hits the north command post or hits the two command posts simultaneously, the enemy hitting guard ship strategy is executed; if the ship is driven out after the enemy is struck, and the enemy jammer exists, going to the point where the jammer is located to make a regional patrol; if the enemy expelling ship is completely hit and the enemy jammer does not exist, executing a bullet inducing strategy.

In step S104, all the second tasks are returned to the simulator so that the simulator completes one combat by executing all the second tasks.

In summary, the upper layer team level agent in the hierarchical reinforcement learning of the present disclosure outputs macro tasks, the lower layer team level agent decomposes macro tasks based on expert knowledge, outputs atomic tasks, and sends the tasks to the simulator for execution, which is more interpretable than the end-to-end reinforcement learning method. The training is performed through the deep reinforcement learning method, and the multiple agents work cooperatively, so that competition and cooperation of the multiple agents in the my under constraint conditions can be embodied, and compared with the expert rule deduction method based on a behavior tree, the traditional machine learning method, the heuristic algorithm and the like, the intelligent performance and accuracy are improved.

Fig. 7 shows a block diagram of a fight device according to an exemplary embodiment of the present disclosure. As shown in fig. 7, the data query device includes: a simulation information receiving unit 70, an information acquiring unit 72, a task acquiring unit 74, and a task transmitting unit 76.

A simulation information receiving unit 70 configured to receive simulation information transmitted by the simulator in the current scene; an information acquisition unit 72 configured to obtain information of at least one combat unit for the current scene based on the simulation information; a task acquisition unit 74 configured to perform, for each of the at least one combat unit, the following operations: inputting information aiming at the combat unit into a first agent corresponding to the combat unit to obtain at least one first task; inputting at least one first task into each corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks; the task transmitting unit 76 is configured to return all the second tasks to the simulator so that the simulator completes one combat by executing all the second tasks.

According to an exemplary embodiment of the present disclosure, in a case where the combat unit meets a preset condition, the first agent is a reinforcement learning agent; under the condition that the combat unit does not accord with the preset condition, the first agent is a knowledge rule agent.

According to an exemplary embodiment of the present disclosure, in the case where the first agent is a reinforcement learning agent, the first agent trains by: under the condition that the number of times of completing the fight by the simulator in the current scene does not exceed a preset threshold value, executing the following cycle: receiving simulation information sent by a simulator in a current scene; based on the simulation information, obtaining information of at least one combat unit aiming at the current scene; for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit into a first intelligent agent corresponding to the combat unit to obtain at least one estimated first task; inputting at least one estimated first task into each corresponding second intelligent agent to obtain at least one estimated second task corresponding to each estimated first task, wherein the at least one estimated second task is a plurality of tasks for realizing the corresponding estimated first tasks; returning all the estimated second tasks to the simulator so that the simulator can complete one-time fight by executing all the estimated second tasks; adjusting the first agent based on a preset rewarding function and an intermediate rewarding value obtained by simulation information; and under the condition that the number of times of completing the fight by the simulator in the current scene exceeds a preset threshold value, finishing training of the first intelligent agent.

According to an exemplary embodiment of the present disclosure, the information obtaining unit 72 is further configured to obtain a basic situation information set in a standard format from the simulation information, the standard format being consistent with the input format of the first agent; based on the basic situation information set, evaluation analysis is carried out, and a combat plan corresponding to at least one combat unit of the current scene is determined; and obtaining information of at least one combat unit aiming at the current scene based on the combat plan and the basic situation information set.

According to an exemplary embodiment of the present disclosure, in the case where the first agent is a knowledge rule agent, the task acquisition unit 74 is further configured to determine a combat strategy corresponding to the combat unit based on the combat plan in the information of the combat unit; and inputting the basic situation information aiming at the combat unit in the combat strategy and basic situation information set corresponding to the combat unit to a first agent corresponding to the combat unit to obtain at least one first task, wherein the first agent comprises first task information for realizing the combat strategy.

According to an exemplary embodiment of the present disclosure, in the case where the second agent is a knowledge rule agent, second task information for realizing the corresponding first task is included in the second agent.

According to an exemplary embodiment of the present disclosure, in the case where the first agent is a reinforcement learning agent, the task obtaining unit 74 is further configured to input basic situation information for the fight unit in the fight plan and the basic situation information set corresponding to the fight unit to the first agent corresponding to the fight unit, to obtain at least one first task.

The fight method and apparatus according to exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 7.

The various elements of the fight device shown in fig. 7 may be configured as software, hardware, firmware, or any combination thereof that perform certain functions. For example, each unit may correspond to an application specific integrated circuit, may correspond to a pure software code, or may correspond to a module in which software is combined with hardware. Furthermore, one or more functions implemented by the respective units may also be uniformly performed by components in a physical entity device (e.g., a processor, a client, a server, or the like).

Further, the fight method described with reference to fig. 1 may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may be provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform a fight method according to the present disclosure.

The computer program in the above-described computer-readable storage medium may be run in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may also be used to perform additional steps other than the above-described steps or to perform more specific processes when the above-described steps are performed, and the contents of these additional steps and further processes have been mentioned in the description of the related method with reference to fig. 1, so that a repetition will not be repeated here.

It should be noted that each unit in the fight device according to the exemplary embodiments of the present disclosure may completely rely on the execution of the computer program to implement the corresponding function, i.e., each unit corresponds to each step in the functional architecture of the computer program, so that the entire system is called through a special software package (e.g., lib library) to implement the corresponding function.

On the other hand, the respective units shown in fig. 7 may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium, such as a storage medium, so that the processor can perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component and a processor, the storage component having stored therein a set of computer-executable instructions that, when executed by the processor, perform a method of fighting according to exemplary embodiments of the present disclosure.

In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the above set of instructions.

Here, the computing device is not necessarily a single computing device, but may be any device or aggregate of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with locally or remotely (e.g., via wireless transmission).

In a computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the fight method according to the exemplary embodiment of the present disclosure may be implemented in software, some of the operations may be implemented in hardware, and furthermore, the operations may be implemented in a combination of software and hardware.

The processor may execute instructions or code stored in one of the memory components, where the memory component may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor is able to read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via buses and/or networks.

The fight method according to exemplary embodiments of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operate at non-exact boundaries.

Thus, the method of fight described with reference to fig. 1 may be implemented by a system comprising at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present disclosure, the at least one computing device is a computing device for performing the fight method according to an exemplary embodiment of the present disclosure, the storage device having stored therein a set of computer-executable instructions that, when executed by the at least one computing device, perform the fight method described with reference to fig. 1.

The foregoing description of exemplary embodiments of the present disclosure has been presented only to be understood as illustrative and not exhaustive, and the present disclosure is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Accordingly, the scope of the present disclosure should be determined by the scope of the claims.

Claims

1. A method of fight, wherein the method of fight comprises:

receiving simulation information sent by a simulator in a current scene;

based on the simulation information, obtaining information of at least one combat unit aiming at the current scene;

for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit to a first agent corresponding to the combat unit to obtain at least one first task; inputting the at least one first task into the respective corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks;

and returning all the second tasks to the simulator so that the simulator can complete one combat by executing all the second tasks.

2. The method of fighting according to claim 1, wherein,

under the condition that the combat unit meets preset conditions, the first agent is a reinforcement learning agent;

and under the condition that the combat unit does not accord with the preset condition, the first agent is a knowledge rule agent.

3. The method of fighting according to claim 2, wherein, in the case where the first agent is a reinforcement learning agent, the first agent is trained by:

and under the condition that the number of times of completing the fight by the simulator in the current scene does not exceed a preset threshold value, executing the following cycle:

receiving simulation information sent by a simulator in the current scene;

for each of the at least one combat unit, performing the following: inputting information aiming at the combat unit to a first agent corresponding to the combat unit to obtain at least one estimated first task; inputting the at least one estimated first task to the respective corresponding second intelligent agent to obtain at least one estimated second task corresponding to each estimated first task, wherein the at least one estimated second task is a plurality of tasks for realizing the corresponding estimated first tasks;

returning all the estimated second tasks to the simulator so that the simulator can complete one combat by executing all the estimated second tasks;

Adjusting the first intelligent agent based on a preset rewarding function and an intermediate rewarding value obtained by the simulation information;

and ending the training of the first intelligent agent under the condition that the number of times of completing the fight by the simulator in the current scene exceeds the preset threshold value.

4. The fight method according to claim 1 or 2, wherein the deriving information of at least one fight unit for the current scene based on the simulation information comprises:

acquiring a basic situation information set in a standard format from the simulation information, wherein the standard format is consistent with the input format of the first agent;

performing evaluation analysis based on the basic situation information set, and determining a combat plan corresponding to at least one combat unit of the current scene;

and obtaining information of at least one combat unit aiming at the current scene based on the combat plan and the basic situation information set.

5. The method of fighting according to claim 4, wherein, in the case where the first agent is a knowledge rule agent, the inputting the information for the fight unit to the first agent corresponding to the fight unit, obtaining at least one first task includes:

Determining a combat strategy corresponding to the combat unit based on the combat plan in the information of the combat unit;

and inputting the fight strategy corresponding to the fight unit and the basic situation information aiming at the fight unit in the basic situation information set to a first agent corresponding to the fight unit to obtain at least one first task, wherein the first agent comprises first task information for realizing the fight strategy.

6. The fight method as set forth in claim 5, wherein, in the case where the second agent is a knowledge rule agent, second task information for realizing the corresponding first task is included in the second agent.

7. The method of fighting according to claim 4, wherein, in the case where the first agent is a reinforcement learning agent, the inputting the information for the fight unit to the first agent corresponding to the fight unit, at least one first task is obtained, including:

and inputting the fight plan corresponding to the fight unit and the basic situation information aiming at the fight unit in the basic situation information set to a first agent corresponding to the fight unit to obtain at least one first task.

8. A combat device, wherein the device comprises:

the simulation information receiving unit is configured to receive simulation information sent by the simulator in the current scene;

an information acquisition unit configured to obtain information of at least one combat unit for the current scene based on the simulation information;

a task acquisition unit configured to perform, for each of the at least one combat unit, the following operations: inputting information aiming at the combat unit to a first agent corresponding to the combat unit to obtain at least one first task; inputting the at least one first task into the respective corresponding second agent to obtain at least one second task corresponding to each first task, wherein the at least one second task is a plurality of tasks for realizing the corresponding first tasks;

and the task sending unit is configured to return all the second tasks to the simulator so that the simulator can complete one combat by executing all the second tasks.

9. A computer readable storage medium storing instructions which, when executed by at least one computing device, cause the at least one computing device to perform the method of fighting according to any one of claims 1 to 7.

10. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of fighting according to any one of claims 1 to 7.