CN116911202B

CN116911202B - Agent training method and device based on multi-granularity simulation training environment

Info

Publication number: CN116911202B
Application number: CN202311165956.2A
Authority: CN
Inventors: 彭渊; 吴京辉; 曹扬; 李世添; 赵思聪; 李冬雪; 贾帅楠; 薛源
Original assignee: Beijing Aerospace Chenxin Technology Co ltd
Current assignee: Beijing Aerospace Chenxin Technology Co ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-11-17
Anticipated expiration: 2043-09-11
Also published as: CN116911202A

Abstract

The application provides an agent training method and device based on a multi-granularity simulation training environment, and belongs to the technical field of artificial intelligence. According to the embodiment of the application, the training task of the decision-making agent and the element information of the optimization target and the simulation training scene are obtained, the target granularity simulation training environment consisting of the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model can be constructed based on the task level of the training task, meanwhile, the decision-making agent can timely obtain the required current simulation state information and the current reward score, and a control instruction of the target granularity model is transmitted to the target granularity simulation training environment based on the training task and the optimization target, so that the operation of simulation deduction is driven in the target granularity simulation training environment, the interaction process of the decision-making agent and the target granularity simulation training environment is further realized, and the research and development efficiency of the decision-making agent under complex large-scale and different task levels is effectively improved.

Description

Agent training method and device based on multi-granularity simulation training environment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an agent training method and device based on a multi-granularity simulation training environment.

Background

Deep reinforcement learning is a branch of machine learning, and is characterized in that decision-making agents need to learn in the process of constantly interacting with a simulation training environment, the decision-making agents constantly learn new knowledge according to obtained rewards or punishments in the process of interacting with the simulation training environment, and the simulation training environment is critical to training of the decision-making agents.

For example, in the attack and defense countermeasure scenes, in different countermeasure stages, training tasks and role levels of decision-making intelligent agents are different, in the countermeasure starting stage, the decision-making intelligent agents only need to make decisions on macroscopic level strategies of the attack and defense parties, specific action behaviors of the attack and defense parties do not need to be concerned, and at the moment, the training of the macroscopic level decision-making intelligent agents can be met by simulating the training environment granularity to achieve macroscopic task level simulation; in the process after the countermeasure is started, the decision-making agent needs to make a decision on the microscopic level strategies of both the attack and the defense, for example, specific actions such as the moving direction, the moving distance and the like of both the attack and the defense are controlled, and the simulation training environment granularity needs to achieve microscopic functional level simulation so as to meet the training of the microscopic level decision-making agent.

However, the existing simulation training environments are all rigid integrated and single in granularity, the training requirements of decision-making intelligent agents of different task levels cannot be adaptively matched as required, when different-level intelligent game countermeasure tasks such as macro-level strategic countermeasure and micro-level specific action selection are oriented, the simulation training environments with corresponding granularity are required to be reconstructed, and the operations such as interface debugging and docking of the simulation training environments and decision-making intelligent agent algorithm are performed again, so that time and energy are consumed, and the research and development and training efficiency of the decision-making intelligent agent are low; meanwhile, the current simulation training environment usually carries out the countermeasure and simulation deduction of the attack and defense parties frame by frame according to a certain time step, so that the decision-making intelligent body cannot acquire the required state, rewards and other information in time, and the rapid training of the large-scale decision-making intelligent body under the complex scene is difficult to support.

Disclosure of Invention

The application provides an agent training method and device based on a multi-granularity simulation training environment, which are used for solving the problem that the training efficiency of decision-making agents oriented to different task levels is low due to single granularity of the conventional simulation training environment.

In order to solve the problems, the application adopts the following technical scheme:

In a first aspect, an embodiment of the present application provides an agent training method based on a multi-granularity simulation training environment, where the method includes:

acquiring element information of a training task and an optimization target of the decision-making agent and a simulation training scene; the element information comprises attack party model element information, countermeasure party model element information and simulation environment model element information;

determining a target granularity simulation training environment in a preset multi-granularity simulation training environment library based on a task level of the training task; wherein, different task levels correspond to simulation training environments with different granularities;

determining a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model corresponding to the target granularity simulation training environment based on the element information;

controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain current simulation state information and current rewarding points;

inputting the current simulation state information into the decision-making intelligent body so that the decision-making intelligent body can output and obtain a target granularity model control instruction based on the training task and the optimization target;

Updating situation information corresponding to the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model respectively based on the target granularity model control instruction, and executing the step of controlling the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model to carry out simulation deduction until the current reward score meets a training cut-off condition, so as to complete training of the decision-making agent.

In an embodiment of the present application, the multi-granularity simulation training environment library includes a macro task level simulation training environment and a micro function level simulation training environment; based on the task level of the training task, determining a target granularity simulation training environment in a preset multi-granularity simulation training environment library, wherein the step comprises the following steps:

under the condition that the task level is a macroscopic task level, determining the target granularity simulation training environment as the macroscopic task level simulation training environment;

and in the case that the task level comprises a micro-function level, determining the target granularity simulation training environment as the micro-function level simulation training environment.

In an embodiment of the present application, the step of determining, based on the element information, a target granularity attack model, a target granularity countermeasure model, and a target granularity simulation environment model corresponding to the target granularity simulation training environment includes:

determining a target granularity entity model library corresponding to the target granularity simulation training environment;

determining the target granularity attack model in the target granularity entity model library based on the attack model element information;

determining the target granularity countermeasure model in the target granularity entity model library based on the countermeasure model element information;

and determining the target granularity simulation environment model in the target granularity entity model library based on the simulation environment model element information.

In an embodiment of the present application, the step of controlling the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model to perform simulation deduction to obtain current simulation state information and current reward points includes:

determining a target countermeasure rule based on the target granularity attack party model and the target granularity countermeasure party model;

Based on the target countermeasure rules, controlling the target granularity attack party model and the target granularity countermeasure party model to carry out simulation deduction in the target granularity simulation environment model so as to obtain the current simulation state information;

and determining the current bonus points based on the current simulation state information and a preset bonus rule.

In an embodiment of the present application, the step of inputting the current simulation state information into the decision-making agent, so that the decision-making agent outputs a control instruction of a target granularity model based on the training task and the optimization target, includes:

acquiring the current simulation state information through a state observation interface, and inputting the current simulation state information into the decision-making intelligent body so that the decision-making intelligent body outputs and obtains a target granularity model control instruction based on the training task and the optimization target; the target granularity model control instruction comprises a target granularity strategy control instruction and/or a target granularity action control instruction.

In an embodiment of the present application, based on the target granularity model control instruction, updating situation information corresponding to each of the target granularity attack party model, and the target granularity simulation environment model includes:

Under the condition that the task level is a macroscopic task level, acquiring the target granularity model control instruction through a task level model control interface, and sending the target granularity model control instruction to a step updating interface so that the step updating interface updates situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction;

and under the condition that the task level comprises a micro function level, acquiring the target granularity model control instruction through a function level model control interface, and sending the target granularity model control instruction to a step updating interface so that the step updating interface updates situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction.

In an embodiment of the present application, the method further includes:

and acquiring an initialization instruction sent by a scene reset interface, and responding to the initialization instruction, and initializing situation information of a target granularity attack party model, the target granularity countermeasure party model and the target granularity simulation environment model.

In an embodiment of the present application, the method further includes:

and sending the current simulation state information to a visual interface so that the visual interface displays the current simulation state information in real time.

In an embodiment of the present application, the method further includes:

storing the current simulation state information, the target granularity model control instruction and countermeasure result information between the target granularity attack party model and the target granularity countermeasure party model;

and generating a training sample based on the current simulation state information, the target granularity model control instruction and the countermeasure result information.

In a second aspect, based on the same inventive concept, an embodiment of the present application provides an agent training device based on a multi-granularity simulation training environment, the device including:

the information acquisition module is used for acquiring the training task and the optimization target of the decision-making agent and element information of the simulation training scene; the element information comprises attack party model element information, countermeasure party model element information and simulation environment model element information;

the environment determining module is used for determining a target granularity simulation training environment in a preset multi-granularity simulation training environment library based on the task level of the training task; wherein, different task levels correspond to simulation training environments with different granularities;

The model determining module is used for determining a target granularity attack party model, a target granularity countermeasure party model and a target granularity simulation environment model corresponding to the target granularity simulation training environment based on the element information;

the simulation deduction module is used for controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain current simulation state information and current rewarding points;

the instruction output module is used for inputting the current simulation state information into the decision-making intelligent body so that the decision-making intelligent body can output and obtain a target granularity model control instruction based on the training task and the optimization target;

and the stepping updating module is used for updating situation information corresponding to the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model based on the target granularity model control instruction, and executing the step of controlling the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model to carry out simulation deduction until the current reward score meets a training cut-off condition, so as to complete training of the decision agent.

In an embodiment of the present application, the multi-granularity simulation training environment library includes a macro task level simulation training environment and a micro function level simulation training environment; the environment determination module includes:

the macro task level environment determination submodule is used for determining that the target granularity simulation training environment is the macro task level simulation training environment under the condition that the task level is the macro task level;

and the micro function level environment determination submodule is used for determining the target granularity simulation training environment as the micro function level simulation training environment under the condition that the task level comprises the micro function level.

In one embodiment of the present application, the model determination module includes:

the model library determining submodule is used for determining a target granularity entity model library corresponding to the target granularity simulation training environment;

an attack party model determination submodule, configured to determine, in the target granularity entity model library, the target granularity attack party model based on the attack party model element information;

an opposite side model determining submodule, configured to determine, in the target granularity entity model library, the target granularity opposite side model based on the opposite side model element information;

And the environment model determining submodule is used for determining the target granularity simulation environment model in the target granularity entity model library based on the simulation environment model element information.

In an embodiment of the present application, the simulation deduction module includes:

the countermeasure rule determining submodule is used for determining a target countermeasure rule based on the target granularity attack party model and the target granularity countermeasure party model;

the state information acquisition module is used for controlling the target granularity attack party model and the target granularity attack party model to carry out simulation deduction in the target granularity simulation environment model based on the target countermeasure rule so as to obtain the current simulation state information;

and the bonus point determining module is used for determining the current bonus point based on the current simulation state information and a preset bonus rule.

In one embodiment of the present application, the instruction output module includes:

the state information transfer module is used for acquiring the current simulation state information through a state observation interface, inputting the current simulation state information into the decision-making intelligent body, and enabling the decision-making intelligent body to output and obtain a target granularity model control instruction based on the training task and the optimization target; the target granularity model control instruction comprises a target granularity strategy control instruction and/or a target granularity action control instruction.

In an embodiment of the present application, the step update module includes:

the first control instruction transfer module is used for acquiring the target granularity model control instruction through a task-level model control interface and sending the target granularity model control instruction to a step updating interface under the condition that the task level is a macroscopic task level, so that the step updating interface updates situation information corresponding to the target granularity attack model, the target granularity counterside model and the target granularity simulation environment model respectively based on the target granularity model control instruction;

and the second control instruction transfer module is used for acquiring the target granularity model control instruction through a function level model control interface and sending the target granularity model control instruction to a step updating interface under the condition that the task level comprises a micro function level, so that the step updating interface updates situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction.

In an embodiment of the present application, the agent training device based on the multi-granularity simulation training environment further includes:

The initialization module is used for acquiring an initialization instruction sent by the scene reset interface, and responding to the initialization instruction to perform initialization operation on situation information of the target granularity attack party model, the target granularity countermeasure party model and the target granularity simulation environment model.

and the state display module is used for sending the current simulation state information to a visual interface so that the visual interface displays the current simulation state information in real time.

the information storage module is used for storing the current simulation state information, the target granularity model control instruction and countermeasure result information between the target granularity attack party model and the target granularity countermeasure party model;

and the sample generation module is used for generating training samples based on the current simulation state information, the target granularity model control instruction and the countermeasure result information.

Compared with the prior art, the application has the following advantages:

According to the intelligent agent training method based on the multi-granularity simulation training environment, through acquiring the training task and the optimization target of the decision intelligent agent and the element information of the simulation training scene, the target granularity simulation training environment can be determined in the preset multi-granularity simulation training environment library based on the task level of the training task, and the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model corresponding to the target granularity simulation training environment are determined based on the element information; further controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain current simulation state information and current rewarding points; and then inputting the current simulation state information into the decision-making agent so that the decision-making agent can output and obtain a target granularity model control instruction based on a training task and an optimization target, and updating situation information corresponding to a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model respectively based on the target granularity model control instruction until the current reward score meets a training cut-off condition, thereby completing training of the decision-making agent. When the training tasks of decision-making agents facing different task levels are performed, the embodiment of the application can be used for quickly matching and constructing a proper target granularity simulation training environment, and meanwhile, the decision-making agents can timely acquire the required current simulation state information and current rewarding points so as to drive the operation of the target granularity simulation training environment, so that the interaction process of the decision-making agents and the target granularity simulation training environment is quickly realized, and further, the research and development and training efficiency of the decision-making agents under complex large-scale and different task levels are effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of an agent training method based on a multi-granularity simulation training environment in an embodiment of the application.

FIG. 2 is a schematic diagram of functional modules of an intelligent agent training apparatus based on a multi-granularity simulation training environment according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

At present, when training is performed on decision-making agents facing different task levels, simulation training environments corresponding to the task levels need to be matched.

The following describes an unmanned aerial vehicle attack and defense countermeasure scenario in a no-fly area as an example. The red party is set as an unmanned aerial vehicle capturing device, the blue party is an unmanned aerial vehicle, and the training task is to make a decision on an intelligent body through training the red party, so that the intelligent body can fight against the blue party, and the purpose of successfully capturing the unmanned aerial vehicle is achieved. In different countermeasure phases, the levels of training tasks of the red party intelligent agent are different, in the beginning phase of countermeasure, the intelligent agent is deployed mainly by a training red party scheme, the training target is to train macro-level arranging and arranging strategy of the number, the type, the distribution and the like of the red party capturing devices participating in the countermeasure on the premise of available resources constraint of the red party, the capability attribute and the application range of each capturing device are mainly focused, the capturing time, the movement track and the like of a certain capturing device in each simulation step of a scene do not need to be focused in detail, and the training of the intelligent agent deployment by the macro-level red party scheme can be met by realizing task-level simulation of the simulation training environment granularity; in the process after the countermeasure begins, a specific action decision-making agent of a certain capturing device of a training red square is taken as a main component, and the training target is to require the decision-making agent to accurately control specific actions of the certain capturing device, such as actions of acceleration, turning, descent, lift-off and the like, and to pay attention to more detailed functions, and the training of the micro-level red square hero action decision-making agent can be met by realizing functional level simulation of the granularity of a simulation training environment.

When a macroscopic task level intelligent agent is trained, the requirement can be met without paying attention to details, if the macroscopic task level simulation training environment is interacted with the macroscopic task level intelligent agent, unnecessary time waste is caused, the training efficiency is low, and the speed is low; when training the micro-functional level intelligent agent, specific details need to be paid attention to, and the micro-functional level simulation training environment is needed, so that the task level simulation training environment cannot meet the training requirements.

However, because the existing simulation training environments are all rigidly integrated and single in granularity, when different-level intelligent game countermeasure tasks such as macro-level strategic countermeasure and micro-level specific action selection are oriented, the simulation training environments with corresponding granularity are required to be reconstructed, and the operations such as interface debugging and butt joint of the simulation training environments and decision-making agent algorithms are performed again, so that time and energy are consumed, and the research and development and training efficiency of decision-making agents are low; meanwhile, the current simulation training environment usually carries out the countermeasure and simulation deduction of the attack and defense parties frame by frame according to fixed countermeasure rules and a certain time step, so that the decision-making agent cannot timely acquire the required state, rewards and other information, and the rapid training of the large-scale decision-making agent under the complex scene is difficult to support.

Aiming at the problem that the training efficiency of decision-making agents facing different task levels is low due to single granularity of the existing simulation training environment, the application aims to provide an agent training method based on multi-granularity simulation training environments, which can quickly match and construct a proper target granularity simulation training environment when the decision-making agents facing different task levels train tasks, and meanwhile, the decision-making agents can timely acquire the required current simulation state information and current reward points so as to drive the operation of the target granularity simulation training environment, so that the interaction process of the decision-making agents and the target granularity simulation training environment is quickly realized, and further, the research and development and training efficiency of the decision-making agents under complex large-scale and different task levels is effectively improved.

Referring to fig. 1, there is shown an agent training method based on a multi-granularity simulation training environment of the present application, which may include the steps of:

s101: and acquiring the training task and the optimization target of the decision-making agent and element information of the simulation training scene.

In the present embodiment, the element information includes attack-side model element information, countermeasure-side model element information, and simulation environment model element information. Specifically, the offender model element information includes the number, type, and location of the offender models; the opponent model element information includes the number, type and position of the opponent models; the simulation environment model element information includes the number, type and location of the simulation environment models.

In one example, taking an unmanned aerial vehicle attack and defense countermeasure scene as an example, an attack party is set as an unmanned aerial vehicle capturing device, a countermeasure party is an unmanned aerial vehicle, and a training target is to train the attack party to make a decision on an intelligent agent so as to be capable of antagonizing with the countermeasure party, so as to successfully capture the unmanned aerial vehicle. The training task of the decision-making agent can be the best unmanned aerial vehicle capturing device resource deployment scheme, the best unmanned aerial vehicle capturing device task allocation scheme, the best unmanned aerial vehicle capturing device capturing time and the like; the optimization targets are behavior preference of the unmanned aerial vehicle capturing devices in the countermeasure process in the training task, for example, the optimization targets of some unmanned aerial vehicle capturing devices are interference, and the optimization targets of some unmanned aerial vehicle capturing devices are capturing; the attack party model element information can comprise the number, the type and the take-off position of unmanned aerial vehicle capturing device models; the countermeasure model element information may include the number, type, and takeoff position of the unmanned aerial vehicle model; the simulation environment model element information may include information such as the shape and position of the terrain model, the weather condition and duration of the weather model, and the like.

It should be noted that, aiming at the training task and the optimization target of the decision-making agent and the element information of the simulation training scene, the user can perform the custom setting according to the actual application scene of the decision-making agent, and the embodiment does not make specific restrictions on the parameter information.

S102: and determining a target granularity simulation training environment in a preset multi-granularity simulation training environment library based on a task level of the training task.

In this embodiment, the training tasks have corresponding task levels, and different task levels correspond to different granularity simulation training environments. Wherein the task hierarchy can be divided into two classes of hierarchy, macroscopic task level and microscopic function level, wherein macroscopic task level represents that training task focuses on overall strategy and does not focus on specific details; the micro-functionality level then focuses on the specific details under the overall strategy.

In this embodiment, taking the unmanned aerial vehicle attack and defense countermeasure scenario as an example, the macro task level training task may be a training task with an optimal unmanned aerial vehicle capturing device resource deployment scheme and/or an optimal unmanned aerial vehicle capturing device task allocation scheme, and the micro function level training task may be a training task with an optimal unmanned aerial vehicle capturing device acceleration timing, an optimal turning timing, an optimal descending timing and/or an optimal lift-off timing.

In a specific implementation, to satisfy the macro task level and the micro function level training tasks, the multi-granularity simulation training environment library may specifically include a macro task level simulation training environment and a micro function level simulation training environment. Further, under the condition that the task level is a macroscopic task level, determining that the target granularity simulation training environment is a macroscopic task level simulation training environment; in the case that the task level includes a micro-functional level, the target granularity simulation training environment is determined to be a micro-functional level simulation training environment.

S103: and determining a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model corresponding to the target granularity simulation training environment based on the element information.

In this embodiment, after determining the target granularity simulation training environment, a target entity model of the target granularity required for generating the target granularity simulation training environment is constructed based on the element information. The target entity model specifically comprises a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model.

In the embodiment, the target granularity simulation training environment can be quickly constructed by calling the target entity model with the corresponding granularity, so that the countermeasure simulation deduction of the countermeasure two parties on the corresponding task level is realized in the target granularity simulation training environment.

S104: and controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain the current simulation state information and the current rewarding score.

In this embodiment, by controlling the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model to perform simulation deduction, current simulation state information can be output in real time, where the current simulation state information reflects situation information corresponding to each of the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model, and the situation information is model state information required by the decision agent to make a decision.

In an exemplary case of the unmanned aerial vehicle attack and defense countermeasure scenario, the current simulation state information may include position information of the unmanned aerial vehicle capturing device and position information of the unmanned aerial vehicle, and the corresponding reward score may be calculated based on the position information of the unmanned aerial vehicle capturing device and the position information of the unmanned aerial vehicle.

S105: and inputting the current simulation state information into the decision-making agent so that the decision-making agent can output and obtain a control instruction of the target granularity model based on the training task and the optimization target.

In this embodiment, by inputting the current simulation state information into the decision-making agent, the decision-making agent can quickly analyze and make a decision on the current simulation state information based on the training task and the optimization target, and generate a corresponding target granularity model control instruction for instructing the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model in the target granularity simulation training environment to make a next decision action.

S106: updating situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model based on the target granularity model control instruction, and executing the step of controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction until the current reward score meets the training cut-off condition, so as to complete training of the decision intelligent agent.

In this embodiment, after the decision-making agent makes a decision, a target granularity model control instruction is transmitted to the target granularity simulation training environment to update situation information of a target entity model in the target granularity simulation training environment, and steps S104 to S106 are repeatedly executed to implement continuous interaction between the decision-making agent and the target granularity simulation training environment until the current reward score meets a training cut-off condition, thereby completing training of the decision-making agent; wherein, training deadline conditions may be: the current bonus point is greater than the bonus point threshold. In this embodiment, by constructing a corresponding target granularity simulation training environment according to a task hierarchy of a training task, and controlling a target entity model with target granularity in the target granularity simulation training environment to perform simulation deduction, current simulation state information and current reward points can be output in real time, and a decision-making agent can quickly make a decision based on the current simulation state information and the current reward points, and a corresponding target granularity model control instruction is transmitted to the target granularity simulation training environment to drive the target entity model in the target granularity simulation training environment to perform simulation deduction. Compared with the traditional single-granularity simulation training environment, the method for carrying out the countermeasure and simulation deduction of the attack and defense parties frame by frame according to the fixed countermeasure rules and a certain time step, the method can quickly realize the interaction process of the decision-making agent and the target-granularity simulation training environment, and effectively improve the research and development and training efficiency of the decision-making agent in a complex large scale and at different task levels.

In a possible embodiment, S103 may specifically include the following substeps:

s103-1: and determining a target granularity entity model library corresponding to the target granularity simulation training environment.

In this embodiment, simulation training environments with different granularities are provided with pre-built solid model libraries with different granularities, and attack party models, countermeasure party models and simulation environment models with different granularities are stored in the solid model libraries with different granularities.

S103-2: and determining a target granularity attack model in a target granularity entity model library based on the attack model element information.

In this embodiment, based on the number and type information of the attack models corresponding to the attack model element information, the attack models of the corresponding number and types are called from the target granularity entity model library, for example, three type a attack models and five type B attack models are called.

S103-3: and determining a target granularity countermeasure model in a target granularity entity model library based on the countermeasure model element information.

In this embodiment, based on the number and type information of the opposite party models corresponding to the opposite party model element information, the opposite party models of the corresponding number and type are called from the target granularity entity model library, for example, two C-type opposite party models are called.

S103-4: and determining a target granularity simulation environment model in a target granularity entity model library based on the simulation environment model element information.

In this embodiment, based on the number and type information of the simulation environment models corresponding to the element information of the simulation environment models, the simulation environment models with the corresponding number and types, for example, two D-type simulation environment models and five E-type simulation environment models, are called from the target granularity entity model library.

Taking an unmanned aerial vehicle attack and defense countermeasure scene as an example, three A-type unmanned aerial vehicle capturing devices and five B-type unmanned aerial vehicle capturing devices can be called in a target granularity entity model library based on the attack model element information; calling two C-type unmanned aerial vehicles based on the countermeasure model element information; based on the element information of the simulation environment model, two river models and five mountain models are called. And further, in a capturing scene formed by two river models and five mountain models, three A-type unmanned aerial vehicle capturing devices and five B-type unmanned aerial vehicle capturing devices are controlled to carry out capturing simulation on two C-type unmanned aerial vehicles.

It should be noted that, when a certain type of entity model is needed in the simulation training scene and the type of entity model does not exist in the target granularity entity model library, the user can introduce the type of entity model into the target granularity entity model library after constructing the type of entity model, so as to realize the call of the type of entity model. And the user can update, delete and add the entity model in the target granularity entity model library according to actual needs.

In a possible embodiment, S104 may specifically include the following substeps:

s104-1: a target countermeasure rule is determined based on the target granularity attack party model and the target granularity countermeasure party model.

It should be noted that the attack party models with different granularities have different attack strategies, and the countermeasure party models with different granularities have different countermeasure strategies.

In a specific implementation, determining a target attack strategy corresponding to the target granularity attack party model and a target countermeasure strategy corresponding to the target granularity countermeasure party model; and further obtaining the target countermeasure rule based on the target attack strategy and the target countermeasure strategy.

S104-2: and based on the target countermeasure rules, controlling the target granularity attack party model and the target granularity countermeasure party model to carry out simulation deduction in the target granularity simulation environment model so as to obtain the current simulation state information.

In a specific implementation, the target granularity attack model performs attack operation in the target granularity simulation environment model according to the target attack strategy, and the target granularity countermeasure model performs countermeasure in the target granularity simulation environment model according to the target countermeasure strategy, so that game countermeasure between the target granularity attack model and the target granularity countermeasure model is realized, and current simulation state information is output after each round of game countermeasure is performed.

S104-3: and determining the current bonus points based on the current simulation state information and a preset bonus rule.

In this embodiment, the current simulation state information includes model state information of the target granularity attack side model and the target granularity countermeasure side model.

In a specific implementation, according to a preset rewarding rule, the current rewarding score is calculated through model state information of a target granularity attack party model and a target granularity countermeasure party model.

It should be noted that, the reward rule is associated with the training task of the decision-making agent, if the decision made by the decision-making agent makes the training task develop to a better direction, a positive reward is given; conversely, a negative reward, i.e., a penalty, is given.

The current simulation state information may include position information of the unmanned aerial vehicle capturing device and position information of the unmanned aerial vehicle, and based on the position information of the unmanned aerial vehicle capturing device and the position information of the unmanned aerial vehicle, a relative distance between the unmanned aerial vehicle capturing device and the unmanned aerial vehicle may be calculated, and further based on the relative distance and a corresponding rewarding rule, a corresponding rewarding score may be calculated. Wherein, the rewarding rule can be: in the case that the relative distance is less than the distance threshold, a positive bonus point is input, and the smaller the relative distance is, the larger the bonus point is; in the case where the relative distance is greater than or equal to the distance threshold, a negative bonus point is entered, and the greater the relative distance, the smaller the bonus point.

In one possible embodiment, S105 may specifically include the following substeps:

s105-1: the current simulation state information is obtained through the state observation interface, and is input into the decision-making agent, so that the decision-making agent outputs and obtains a control instruction of the target granularity model based on the training task and the optimization target.

In this embodiment, in order to enable the target granularity simulation training environment to timely feed back the current simulation state information to the decision-making agent, the state observation interface is configured to obtain the current simulation state information of the target granularity simulation training environment.

In a specific implementation, a state observation interface can be constructed based on a GetState function, a GetState command is sent to a target granularity simulation training environment through the state observation interface, so that the target granularity simulation training environment returns current simulation state information, the current simulation state information is transmitted to a decision-making agent, the decision-making agent can quickly make a decision, and a target granularity model control instruction is output and obtained.

It should be noted that, the target granularity model control instruction includes a target granularity policy control instruction and/or a target granularity action control instruction, that is, the decision-making agent may transmit a policy and/or an action instruction with a corresponding granularity to the target granularity simulation training environment, so that the target entity model in the target granularity simulation training environment may make operations such as policy change and/or action change.

In one possible embodiment, S106 may specifically include the following substeps:

s106-1: under the condition that the task level is a macroscopic task level, a target granularity model control instruction is obtained through a task level model control interface, and the target granularity model control instruction is sent to a step updating interface, so that the step updating interface updates situation information corresponding to a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model respectively based on the target granularity model control instruction.

In this embodiment, to realize control of the decision-making agent on the simulation training environments with different granularities, model control interfaces of corresponding levels are set based on task levels, for example, in the case that the task levels include a macroscopic task level and a microscopic function level, a task-level model control interface and a function-level model control interface are correspondingly set, and meanwhile, data interaction between the decision-making agent and the macroscopic task-level simulation training environment and data interaction between the decision-making agent and the microscopic function-level simulation training environment are realized.

In a specific implementation, under the condition that the task level is only a macroscopic task level, as a model control instruction of a microscopic function level is not required to be transmitted, a target granularity model control instruction output by a decision agent is acquired through a task level model control interface, after a corresponding task level Action command is generated, the task level Action command is sent to a Step update interface, and the Step update interface generates a corresponding task level Step command based on the task level Action command and transmits the corresponding task level Step command to a target granularity simulation training environment, so that a target entity model in the target granularity simulation training environment performs operations such as strategy change and/or Action change, for example, new task generation, new scheme generation, task change and other situation update operations.

S106-2: under the condition that the task level comprises a micro function level, a target granularity model control instruction is obtained through a function level model control interface, and the target granularity model control instruction is sent to a step updating interface, so that the step updating interface updates situation information corresponding to a target granularity attack model, a target granularity countermeasure model and a target granularity simulation environment model respectively based on the target granularity model control instruction.

It should be noted that, the task level including the micro function level indicates that the task level includes only the micro function level and includes both the macro task level and the micro function level, and in either case, the function level model control interface is required to implement the interaction between the decision-making agent and the target granularity simulation training environment.

In a specific implementation, under the condition that the task level comprises a micro function level, because a model control instruction of the micro function level needs to be transmitted, a target granularity model control instruction output by a decision-making agent is acquired through a function level model control interface, after a corresponding function level Action command is generated, the function level Action command is sent to a Step update interface, and the Step update interface generates a corresponding function level Step command based on the function level Action command and transmits the corresponding function level Step command to a target granularity simulation training environment, so that a target entity model in the target granularity simulation training environment performs operations such as strategy change and/or Action change, for example, situation update operations such as acceleration, turning, descending, lifting, decoy, attack, defense and the like are performed.

In one possible embodiment, the agent training method based on the multi-granularity simulation training environment may further include the following steps:

s201: and acquiring an initialization instruction sent by the scene reset interface, and responding to the initialization instruction, and initializing situation information of the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model.

In this embodiment, considering that in the actual training process, performing a complete simulation deduction only in the target granularity simulation environment model may not be enough to complete training of the decision-making agent, a scene reset interface will be set to initialize the target granularity simulation training environment and perform a next simulation deduction, thereby improving the training efficiency of the decision-making agent.

In a specific implementation, the user may manually trigger the initialization instruction, or automatically trigger the initialization instruction if it is detected that the simulation deduction is completed and the current bonus point does not meet the training cut-off condition. The scene Reset interface sends a Reset command to the target granularity simulation training environment after use, and can initialize the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model so as to enable situation information of the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to return to an initial state. After the initialization operation of the target granularity simulation training environment is completed, a new round of simulation deduction can be automatically triggered until the current reward points meet the training cut-off condition in the simulation deduction process of a certain round, and the training of the decision-making intelligent agent is automatically stopped.

s301: and sending the current simulation state information to a visual interface so that the visual interface displays the current simulation state information in real time.

In this embodiment, in order to implement visualization of simulation deduction, a visualization interface is set, and a Render command is sent to the target granularity simulation training environment through the visualization interface, so that the target granularity simulation training environment returns current simulation state information to the visualization interface, and the visualization interface performs a visualization operation on the current simulation state information and displays the current simulation state information through a man-machine interaction interface. The current simulation state information comprises current situation information of a target granularity attack party model, a target granularity attack party model and a target granularity simulation environment model.

In the embodiment, the current simulation state information is subjected to visualization processing through the visualization interface, so that the current situation information of the target granularity simulation training environment can be intuitively displayed to a user, and the user can monitor and control the simulation deduction process in real time.

In the embodiment, through the standardized intelligent APIs (Application Program Interface, application program interfaces) such as the state observation interface, the task-level model control interface, the function-level model control interface, the step update interface, the scene reset interface, the visualization interface and the like, the decision-making agent can quickly acquire the current simulation state information of the target granularity simulation training environment and make corresponding decision actions, and the target granularity model control instructions are transmitted to the target granularity simulation training environment so as to drive the operation of each target entity model in the target granularity simulation training environment, so that the interaction process of the decision-making agent and the target granularity simulation training environment is quickly realized, and the training and application of the decision-making agent are quickly completed.

s401: and storing the current simulation state information, the target granularity model control instruction and the countermeasure result information between the target granularity attack party model and the target granularity countermeasure party model.

S402: and generating a training sample based on the current simulation state information, the target granularity model control instruction and the countermeasure result information.

In this embodiment, in the simulation deduction process, relevant information of each game countermeasure of the target granularity attack side model and the target granularity countermeasure side model is collected, which specifically includes current simulation state information input to the decision-making agent, a target granularity model control instruction output based on the current simulation state information, and countermeasure result information generated by the target granularity attack side model and the target granularity countermeasure side model performing game countermeasure based on the target granularity model control instruction.

In a specific implementation, training samples may be numbered based on the number of game opponents between the target granularity attack side model and the target granularity opponent side model, for example, 100 game opponents are performed, and 100 training samples may be generated, where each training sample includes current simulation state information, a target granularity model control instruction, and opponent result information related to the corresponding game opponents.

In the embodiment, the model state information, the agent decision information, the countermeasure results and other intermediate process data required by the decision agent training in each game countermeasure process are automatically stored and processed, so that a high-quality training sample set required by the training decision agent can be formed, direct interaction with an external agent algorithm can be performed, and the training requirements of more decision agents can be effectively met.

In a second aspect, based on the same inventive concept, referring to fig. 2, an embodiment of the present application provides an agent training device 200 based on a multi-granularity simulation training environment, the agent training device 200 based on the multi-granularity simulation training environment comprising:

the information acquisition module 201 is configured to acquire element information of training tasks and optimization targets of decision-making agents and simulation training scenes; the element information includes attack side model element information, countermeasure side model element information, and simulation environment model element information.

The environment determining module 202 is configured to determine a target granularity simulation training environment in a preset multi-granularity simulation training environment library based on a task level of a training task; wherein different task levels correspond to simulation training environments of different granularities.

The model determining module 203 is configured to determine, based on the element information, a target granularity attack model, a target granularity countermeasure model, and a target granularity simulation environment model corresponding to the target granularity simulation training environment.

The simulation deduction module 204 is used for controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain the current simulation state information and the current rewarding points.

The instruction output module 205 is configured to input the current simulation state information into the decision-making agent, so that the decision-making agent outputs and obtains a control instruction of the target granularity model based on the training task and the optimization target.

The step updating module 206 is configured to update situation information corresponding to the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model based on the target granularity model control instruction, and perform a step of controlling simulation deduction of the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model until the current reward score meets a training cut-off condition, thereby completing training of the decision agent.

In one embodiment of the application, the multi-granularity simulation training environment library comprises a macroscopic task level simulation training environment and a microscopic function level simulation training environment. The environment determination module 202 includes:

and the macro task level environment determination submodule is used for determining that the target granularity simulation training environment is a macro task level simulation training environment under the condition that the task level is a macro task level.

And the micro function level environment determination submodule is used for determining that the target granularity simulation training environment is a micro function level simulation training environment under the condition that the task level comprises the micro function level.

In one embodiment of the present application, the model determining module 203 includes:

the model library determination submodule is used for determining a target granularity entity model library corresponding to the target granularity simulation training environment.

The attack party model determining submodule is used for determining a target granularity attack party model in the target granularity entity model library based on the attack party model element information.

The countermeasure model determination submodule is used for determining a target granularity countermeasure model in the target granularity entity model library based on the countermeasure model element information.

The environment model determining submodule is used for determining a target granularity simulation environment model in the target granularity entity model library based on simulation environment model element information.

In one embodiment of the present application, the simulation deduction module 204 includes:

and the countermeasure rule determination submodule is used for determining the target countermeasure rule based on the target granularity attack party model and the target granularity countermeasure party model.

The state information acquisition module is used for controlling the target granularity attack party model and the target granularity attack party model to carry out simulation deduction in the target granularity simulation environment model based on the target countermeasure rule so as to obtain the current simulation state information.

In one embodiment of the present application, the instruction output module 205 includes:

the state information transfer module is used for acquiring current simulation state information through the state observation interface, inputting the current simulation state information into the decision-making agent, so that the decision-making agent outputs and obtains a target granularity model control instruction based on a training task and an optimization target; the target granularity model control instructions include target granularity policy control instructions and/or target granularity action control instructions.

In one embodiment of the present application, the step update module 206 includes:

the first control instruction transfer module is used for acquiring a target granularity model control instruction through the task-level model control interface under the condition that the task level is a macroscopic task level, and sending the target granularity model control instruction to the step-by-step updating interface so that the step-by-step updating interface updates situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction.

And the second control instruction transfer module is used for acquiring a target granularity model control instruction through the function level model control interface under the condition that the task level comprises a micro function level, and sending the target granularity model control instruction to the step updating interface so that the step updating interface updates situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction.

In one embodiment of the present application, the agent training device 200 based on the multi-granularity simulation training environment further includes:

the initialization module is used for acquiring an initialization instruction sent by the scene reset interface and initializing situation information of the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model in response to the initialization instruction.

and the state display module is used for sending the current simulation state information to the visual interface so that the visual interface displays the current simulation state information in real time.

and the information storage module is used for storing the current simulation state information, the target granularity model control instruction and the countermeasure result information between the target granularity attack party model and the target granularity countermeasure party model.

It should be noted that, referring to the specific implementation manner of the intelligent agent training apparatus 200 based on the multi-granularity simulation training environment according to the first aspect of the embodiment of the present application, the specific implementation manner of the intelligent agent training method based on the multi-granularity simulation training environment set forth in the foregoing is not repeated here.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device comprising the element.

The invention provides an intelligent body training method and device based on a multi-granularity simulation training environment, which are described in detail, wherein specific examples are applied to illustrate the principle and the implementation mode of the invention, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.

Claims

1. An agent training method based on a multi-granularity simulation training environment, the method comprising:

acquiring element information of a training task and an optimization target of the decision-making agent and a simulation training scene; the element information comprises attack party model element information, countermeasure party model element information and simulation environment model element information; the attack party model element information is element information aiming at an unmanned aerial vehicle capturing device model; the countermeasure model element information is element information for an unmanned aerial vehicle model;

controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain current simulation state information and current rewarding points; the current simulation state information comprises position information of an unmanned aerial vehicle capturing device and position information of an unmanned aerial vehicle;

updating situation information corresponding to the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model respectively based on the target granularity model control instruction, and executing the step of controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction until the current reward score meets a training cut-off condition, so as to complete training of the decision-making agent; the decision-making intelligent agent is used for controlling the unmanned aerial vehicle capturing device to capture the unmanned aerial vehicle;

the method further comprises the steps of:

2. The agent training method based on the multi-granularity simulation training environment according to claim 1, wherein the multi-granularity simulation training environment library comprises a macroscopic task level simulation training environment and a microscopic function level simulation training environment; based on the task level of the training task, determining a target granularity simulation training environment in a preset multi-granularity simulation training environment library, wherein the step comprises the following steps:

3. The agent training method based on the multi-granularity simulation training environment according to claim 1, wherein the step of determining a target granularity attack side model, a target granularity countermeasure side model and a target granularity simulation environment model corresponding to the target granularity simulation training environment based on the element information comprises:

4. The method for training an agent based on a multi-granularity simulation training environment according to claim 1, wherein the step of controlling the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model to perform simulation deduction to obtain current simulation state information and current reward points comprises:

5. The method for training an agent based on a multi-granularity simulation training environment according to claim 1, wherein the step of inputting the current simulation state information into the decision agent to cause the decision agent to output a control instruction of a target granularity model based on the training task and the optimization target comprises:

6. The method for training an agent based on a multi-granularity simulation training environment according to claim 5, wherein the step of updating situation information corresponding to each of the target granularity attack model, the target granularity countermeasure model, and the target granularity simulation environment model based on the target granularity model control instruction comprises:

7. The agent training method based on a multi-granularity simulation training environment of claim 1, further comprising:

8. The agent training method based on a multi-granularity simulation training environment of claim 1, further comprising:

9. An agent training device based on a multi-granularity simulation training environment, the device comprising:

the information acquisition module is used for acquiring the training task and the optimization target of the decision-making agent and element information of the simulation training scene; the element information comprises attack party model element information, countermeasure party model element information and simulation environment model element information; the attack party model element information is element information aiming at an unmanned aerial vehicle capturing device model; the countermeasure model element information is element information for an unmanned aerial vehicle model;

The simulation deduction module is used for controlling the target granularity attack model, the target granularity countermeasure model and the target granularity simulation environment model to carry out simulation deduction so as to obtain current simulation state information and current rewarding points; the current simulation state information comprises position information of the unmanned aerial vehicle capturing device and position information of the unmanned aerial vehicle;

the stepping updating module is used for updating situation information corresponding to the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model respectively based on the target granularity model control instruction, and executing the step of controlling the target granularity attack side model, the target granularity countermeasure side model and the target granularity simulation environment model to carry out simulation deduction until the current reward score meets a training cut-off condition, so as to complete training of the decision agent; the decision-making intelligent agent is used for controlling the unmanned aerial vehicle capturing device to capture the unmanned aerial vehicle;