CN116842761B

CN116842761B - Self-game-based blue army intelligent body model construction method and device

Info

Publication number: CN116842761B
Application number: CN202311106421.8A
Authority: CN
Inventors: 任雪峰; 陶添文
Original assignee: Beijing Zhuoyi Intelligent Technology Co Ltd
Current assignee: Beijing Zhuoyi Intelligent Technology Co Ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-24
Anticipated expiration: 2043-08-30
Also published as: CN116842761A

Abstract

The application discloses a method and a device for constructing a bluing agent model based on a self-game, wherein the method comprises the following steps: designing an countermeasure environment comprising a plurality of bluing agents, countermeasure opponents and interaction information thereof, and determining the state space, action space and strategy expression mode of the environment to form a bluing agent model frame; simulating the antagonism of the bluing agent and the opponent, acquiring and analyzing the generated state or action information, and optimizing the state space and action space of the environment by using the state or action information; different strategies and actions of opponents in the countermeasure process are analyzed, and countermeasure data are formed according to current and historical countermeasure experiences and strategies of countermeasure are generated according to a decision program; based on the countermeasure data, the strategy of the bluing agent model is updated by utilizing a self-game algorithm to learn and optimize in real time. The scheme learns and optimizes the strategy through antagonism, has flexibility, adaptability and real-time performance, and can better cope with uncertainty in the antagonism environment and changes of the opponent strategy.

Description

Self-game-based blue army intelligent body model construction method and device

Technical Field

The application relates to the technical field of unmanned aerial vehicles, in particular to a method and a device for constructing a bluing agent model based on self-game.

Background

The existing blue-army intelligent model is generally based on fixed rules and manual design strategies, and cannot adapt to dynamic changes in complex countermeasure environments, so that the blue-army intelligent model lacks flexibility and adaptability when facing different opponent strategies; the strategy of the opponent intelligent agent has high uncertainty, which makes it difficult for the army intelligent agent model to accurately predict the opponent's behavior and cope with the strategy, thereby affecting the performance and stability of the army intelligent agent model. In addition, the existing bluing agent model often lacks the capability of real-time learning and optimization, and cannot adjust own strategies according to actual countermeasure conditions, so that the bluing agent model cannot continuously adapt to and promote own defensive capability in a countermeasure environment. The strategy of the opponent is always evolved and changed along with the time and the learning in the countermeasure process, the existing blue-army intelligent body model is difficult to capture the change of the strategy of the opponent in time, and the defense strategy of the blue-army intelligent body model is correspondingly adjusted, so that the defense capacity is reduced, and the blue-army intelligent body model is easy to attack or bypass by the opponent due to the lack of an effective self-adaptive mechanism and dynamic strategy adjustment.

Disclosure of Invention

The present application has been made in view of the above-mentioned problems, and it is an object of the present application to provide a method and apparatus for constructing a self-game based bluing agent model that overcomes or at least partially solves the above-mentioned problems.

According to one aspect of the application, there is provided a method for constructing a self-game based bluing agent model, the method comprising:

designing an countermeasure environment which comprises a plurality of army intelligent bodies, countermeasure opponents and interaction information thereof, determining the state space, action space and strategy expression mode of the environment, and forming an army intelligent body model frame;

simulating the antagonism of the bluing agent and the opponent, acquiring and analyzing the generated state or action information, and optimizing the state space and the action space of the environment by utilizing the state or action information;

different strategies and actions of opponents in the countermeasure process are analyzed, corresponding strategies are generated according to current and historical countermeasure experiences and decision programs, and countermeasure data are formed;

based on the countermeasure data, the strategy of the bluing agent model is updated by utilizing a self-game algorithm to learn and optimize in real time.

In some embodiments, the bluing agent includes a sensor, a decision making program, and an actuator, simulating the bluing agent's antagonism with an adversary, acquiring and analyzing the generated state or motion information, and optimizing the state space and the motion space of the environment using the state or motion information includes:

sensing the environment by using the sensor, selecting actions according to the obtained environment state and task target and decision making program, and acting on the environment by using an executor;

and obtaining feedback information of the environment through perception, and updating and optimizing the state and the action space based on the feedback information.

In some embodiments, generating strategies for countermeasures based on current and historical countermeasures and in accordance with decision-making procedures includes:

constructing a corresponding relation among states, actions and rewards according to the cost and the loss of the two countermeasures;

selecting different actions according to the current state, and determining the rewarding value of the different actions after the different actions act on the environment;

and evaluating the magnitude or the loss of the rewarding values of different actions, and determining the action strategy in the current state by combining the historical countermeasure experience.

In some embodiments, based on the challenge data, the strategy for updating the bluing agent model using the self-gaming algorithm for real-time learning and optimization includes:

detecting and sensing data obtained by the sensor, and determining state information in countermeasure;

judging the state information, and forming a decision based on a deep reinforcement learning model;

generating an countermeasure action based on the decision, and evaluating an effect of the countermeasure action;

and forming updated countermeasure situation or game action feedback by utilizing the environment according to the countermeasure action and the influence thereof so as to be perceived by the sensor.

In some embodiments, the method further comprises:

after the parameters and strategies of the bluing agent model are optimized and adjusted, analyzing the strategy change and evolution trend corresponding to the adjustment;

and according to the strategy change of the opponent, the defense strategy of the bluing agent model is further adjusted and optimized through a self-game algorithm.

In some embodiments, the method further comprises:

establishing an opponent model in the bluing agent model by collecting the information of the opponent in the countermeasure;

pre-judging an attack mode of an adversary according to the adversary model;

and analyzing and learning the attack mode and the security threat level thereof in real time, and adaptively adjusting and optimizing the defense strategy.

In some embodiments, the method further comprises:

and evaluating the performance and execution effect of the blue army intelligent body model by re-antagonizing or performing analysis and comparison with other blue army intelligent body models.

According to another aspect of the present application, there is provided a self-game based bluing agent model construction apparatus, the apparatus comprising:

the module design module is suitable for designing an countermeasure environment, wherein the environment comprises a plurality of army intelligent bodies, countermeasure opponents and interaction information thereof, and the state space, the action space and the strategy expression mode of the environment are determined to form an army intelligent body model frame;

the simulation countermeasure module is suitable for simulating the countermeasure of the bluing agent and the opponent, acquiring and analyzing the generated state or action information, and optimizing the state space and the action space of the environment by utilizing the state or action information;

the strategy generation module is suitable for analyzing different strategies and actions of opponents in the countermeasure process, generating coping strategies according to current and historical countermeasure experiences and decision programs and forming countermeasure data;

and the strategy updating module is suitable for carrying out real-time learning and optimization by utilizing a self-game algorithm based on the countermeasure data and updating the strategy of the bluing agent model.

According to still another aspect of the present application, there is provided an electronic apparatus, characterized by comprising: a processor and a memory arranged to store computer executable instructions that when executed cause the processor to perform the self-game based bluarmy agent model building method according to any one of the above embodiments.

According to yet another aspect of the present application, there is provided a computer readable storage medium storing one or more programs which, when executed by a processor, implement a self-game based bluing agent model construction method according to any of the above.

According to the technical scheme disclosed by the application, the self-game algorithm is adopted to play the game, so that the learning and optimization of the fight strategy is realized, the flexibility, the adaptability and the instantaneity are realized, the uncertainty in the fight environment and the change of the opponent strategy can be better dealt with, the bluish army intelligent body model can continuously learn and optimize the defending capability of the bluish army intelligent body model from the simulated fight, the safety is improved, and the attack risk of the opponent is reduced.

The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a flow diagram of a method for building a self-game based bluing agent model in accordance with one embodiment of the present application;

FIG. 2 shows a schematic structural diagram of a blue force agent interacting with an environment in accordance with one embodiment of the present application;

FIG. 3 illustrates a flow diagram of blue-army agent policy generation, according to one embodiment of the application;

FIG. 4 illustrates a flow diagram of a self-gaming algorithm optimization model, according to one embodiment of the application;

FIG. 5 shows a schematic structural diagram of a self-game based bluing agent model building device according to one embodiment of the present application;

fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.

Term interpretation:

self-play (Self-play) refers to a method of training and optimizing algorithms in the field of artificial intelligence, particularly in reinforcement learning (Reinforcement Learning). In self-gaming, a computer program or agent gradually learns and improves its performance by playing with itself, rather than relying solely on data or a priori knowledge provided by human experts.

The Agent is an intelligent system which is based on cloud and takes AI as a core and is constructed by three-dimensional perception, global collaboration, accurate judgment, continuous evolution and openness. The intelligent agent resides in a certain environment, can continuously and autonomously play a role, and has the characteristics of residence, reactivity, sociability, initiative and the like. Such as unmanned aerial vehicles, combat robots, etc., are generally intelligent agents.

Bluing force refers to forces that exclusively play a hypothetical enemy in their simulated opponent exercises. The device can simulate the combat characteristics of any army in the world and train with the blue army representing the front army or the imaginary enemy army in a targeted manner.

The deep reinforcement learning combines the perception capability of the deep learning and the decision capability of the reinforcement learning, can be directly controlled according to the input image, and is an artificial intelligence method which is closer to the human thinking mode. Deep learning has stronger perceptibility, but lacks certain decision-making capability; and reinforcement learning has decision making capability and is not in charge of sensing problems. Therefore, the two are combined, the advantages are complementary, and a solution idea is provided for the perception decision problem of the complex system. For example, this may be achieved by Monte Carlo tree search (Monte Carlo Tree Search, MCTS) in combination with neural network models.

Fig. 1 shows a flow diagram of a method for building a model of a self-game based army agent, which may be implemented by an electronic device, including the army agent itself, a computer, a notebook, etc., including an unmanned plane, a combat robot, etc., according to one embodiment of the application. The method comprises the following steps:

step S110, designing an countermeasure environment, wherein the environment comprises a plurality of army intelligent bodies, countermeasure opponents and interaction information thereof, and determining the state space, action space and strategy expression mode of the environment to form an army intelligent body model frame.

Step S120, simulating the countermeasure of the army agent and the opponent, acquiring and analyzing the generated state or action information, and optimizing the state space and the action space of the environment by using the state or action information.

The opponent of the game, namely the bluing agent itself, can play the game by itself.

Step S130, different strategies and actions of opponents in the countermeasure process are analyzed, coping strategies are generated according to current and historical countermeasure experiences and decision making programs, and countermeasure data are formed.

And step S140, based on the countermeasure data, utilizing a self-game algorithm to learn and optimize in real time, and updating the strategy of the bluing agent model.

The self-game algorithm exhibits strong capability in game problems, and this step enables learning and optimizing strategies by self-playing. Alternatively, deep reinforcement learning methods that may be employed include Monte Carlo tree search (Monte Carlo Tree Search, MCTS) in combination with neural network models.

Specifically, the bluing agent model can be designed to perform multi-round self-play with itself, data are collected by simulating different strategies and behaviors of opponents, a self-game algorithm is used, so that the bluing agent model can continuously learn and optimize the strategy of itself in the process of opponents, and a strategy generation algorithm of the bluing agent model can be designed to generate decisions based on the current state and historical opponent experience. In the data processing stage, data analysis and pattern recognition are needed, strategies and behavior patterns of opponents are explored, the behaviors of the opponents are learned and optimized in real time, real-time feedback in an antagonism environment is obtained, and then the strategies of the bluing intelligent body model are updated by using a self-game algorithm according to the behaviors and feedback information of the opponents.

According to the embodiment, the self-playing strategy is realized through the self-game algorithm to learn and optimize the strategy, the flexibility, the adaptability and the real-time performance are realized, the uncertainty in the countermeasure environment and the change of the opponent strategy can be better dealt with, and therefore the bluish army intelligent body model can continuously learn and optimize the own defending capability from the simulated countermeasure, the safety is improved, and the attack risk of the opponent is reduced.

In some embodiments, referring to fig. 2, the army agent includes a sensor, a decision-making program and an actuator, and simulating the countermeasure of the army agent with the opponent in step S120, acquiring and analyzing the generated state or action information, and optimizing the state space and the action space of the environment using the state or action information includes:

In some embodiments, as shown in connection with FIG. 3, generating policies for countermeasures based on current and historical countermeasures experience and in accordance with a decision-making routine in step S130 includes:

In some embodiments, referring to fig. 4, based on the countermeasure data, in step S140, the strategy of using the self-gaming algorithm to learn and optimize in real time, and updating the bluing agent model includes:

and forming updated countermeasure situation or game action feedback by utilizing the environment corresponding to the self-game task according to the countermeasure action and the influence thereof so as to be perceived by the sensor.

In some embodiments, the method further comprises:

pre-judging an attack mode of an adversary according to the adversary model;

In some specific embodiments, parameters and policies of the blutroy agent model can be optimally adjusted by using an adaptive learning algorithm based on policies and behavior patterns of the adversary, and then the change and evolution trend of the policies of the adversary are adjusted and analyzed. According to the change of opponent strategies, the defense strategy of the army model is adjusted through a self-game algorithm so as to adapt to the continuously changing countermeasure environment.

In some embodiments, the method further comprises:

and evaluating the performance and execution effect of the blue army intelligent body model through multiple antagonism or analysis and comparison with other blue army intelligent body models.

Specifically, parameters and strategies of the blue-army intelligent body model can be optimized and adjusted by using a self-adaptive learning algorithm based on strategies and behavior modes of the opponent, then strategy change and evolution trend of the opponent are analyzed and simulated, and defense strategies of the blue-army intelligent body model are adjusted by using a self-game algorithm according to the change of the opponent strategy so as to adapt to the continuously changing countermeasure environment.

In summary, the above embodiments of the present application can achieve the following technical effects:

1. generating data by self-playing: a series of playing data are generated by playing the bluing agent model with the bluing agent model. In each round of play, the bluing agent model acts as a bluing party according to the current strategy, and simultaneously simulates the opponent party to take action. This can generate a series of playing states and corresponding action sequences.

2. Policy evaluation and updating: based on the data generated by the self-playing, the strategy evaluation and updating are performed by using a reinforcement learning algorithm. Through continuous iteration, strategy optimization and value network, the blue army intelligent body model can gradually learn better defense strategies.

3. Modeling and predicting an opponent: by observing the opponent's actions in self-playing, the blutroy agent model can build an opponent model and make predictions. The behavior of an adversary may be modeled using a deep learning model, predicting the policies and possible actions of the adversary. Thus, the blutroy intelligent agent model can adjust own defense strategy according to expected behaviors of opponents, and the coping capacity is improved.

4. Real-time learning and adaptation: the blue-force intelligent body model can learn and adapt in real time in the actual countermeasure environment. New playing data are continuously obtained by competing with opponents, and strategies and opponent models are updated. The real-time learning and adaptation capability enables the blutroy intelligent body model to cope with the change and evolution of opponent strategies and continuously improves the defending capability of the blutroy intelligent body model.

5. Security threat identification and handling: the blue army intelligent body model can better identify and cope with the security threat through a self-game algorithm. By playing with opponents, the bluish army intelligent body model can observe the attack behavior of the opponents and conduct security threat identification. When the threat is identified, the army intelligent body model can timely adjust the defending strategy according to the capability of real-time learning and adaptation, so that the safety of the army intelligent body model is enhanced.

In accordance with another aspect of the present application, and referring to fig. 5, there is provided a self-game based bluing agent model construction apparatus, the apparatus 500 comprising:

the module design module 510 is adapted to design an countermeasure environment, where the environment includes a plurality of army agents, countermeasure opponents and interaction information thereof, and determines a state space, an action space and a policy expression mode of the environment, so as to form an army agent model frame;

the simulated countermeasure module 520 is adapted to simulate the countermeasure of the bluing agent and the opponent, acquire and analyze the generated state or action information, and optimize the state space and the action space of the environment by using the state or action information;

a policy generation module 530 adapted to analyze different policies and actions of an adversary in the countermeasure process, generate policies of countermeasures according to current and historical countermeasure experiences and according to decision-making procedures, and form countermeasure data;

the strategy updating module 540 is adapted to learn and optimize in real time by using a self-gaming algorithm based on the countermeasure data, and update the strategy of the bluing agent model.

In some embodiments, the bluarmy agent comprises sensors, decision making programs, and actuators, then the simulated countermeasure module 520 is further adapted to:

In some embodiments, the policy generation module 530 is further adapted to:

In some embodiments, the policy update module 540 is further adapted to:

In some embodiments, the apparatus 500 is further adapted to:

pre-judging an attack mode of an adversary according to the adversary model;

In some embodiments, the apparatus 500 is further adapted to:

It should be noted that, the specific implementation manner of each embodiment of the apparatus may be performed with reference to the specific implementation manner of the corresponding embodiment of the method, which is not described herein.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may also be used with the teachings herein. The required structure for the construction of such devices is apparent from the description above. In addition, the present application is not directed to any particular programming language. It will be appreciated that the teachings of the present application described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed application requires more features than are expressly recited in each claim.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a self-gaming based bluarmy agent model building device according to an embodiment of the present application. The present application can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

The embodiment of the application provides a non-volatile computer storage medium, which stores at least one executable instruction, and the computer executable instruction can execute the self-game-based army intelligent agent model construction method in any method embodiment.

Fig. 6 shows a schematic structural diagram of an embodiment of the electronic device according to the present application, and the embodiment of the present application is not limited to the specific implementation of the electronic device.

As shown in fig. 6, the electronic device may include: a processor 602, a communication interface (Communications Interface), a memory 606, and a communication bus 608.

Wherein: processor 602, communication interface 604, and memory 606 perform communication with each other via communication bus 608. Communication interface 604 is used to communicate with network elements of other devices, such as clients or other servers. The processor 602 is configured to execute the program 610, and may specifically perform relevant steps in the embodiment of the self-game based blue-force smart model building method for an electronic device.

In particular, program 610 may include program code including computer-operating instructions.

The processor 602 may be a central processing unit CPU or a specific integrated circuit ASIC (Application Specific Integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present application. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

A memory 606 for storing a program 610. The memory 606 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 may be specifically configured to cause the processor 602 to perform operations corresponding to the above-described embodiment of the self-game based bluarmy agent model building method.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

Claims

1. A method for building a bluing agent model based on self-gaming, the method comprising:

based on the countermeasure data, utilizing a self-game algorithm to learn and optimize in real time, and updating the strategy of the bluing agent model;

the bluing agent comprises a sensor, a decision program and an executor, and then simulating the countermeasure of the bluing agent and an opponent, acquiring and analyzing the generated state or action information, and optimizing the state space and the action space of the environment by using the state or action information comprises the following steps:

obtaining feedback information of the environment through sensing, and updating and optimizing the state and the action space based on the feedback information;

the strategies to generate countermeasures based on current and historical countermeasures and in accordance with decision-making programs include:

evaluating the magnitude or the loss of rewarding values of different actions, and determining an action strategy in the current state by combining historical countermeasure experience;

the method further comprises the steps of:

pre-judging an attack mode of an adversary according to the adversary model;

2. The method of claim 1, wherein based on the countermeasure data, utilizing a self-gaming algorithm for real-time learning and optimization, the strategy for updating the bluarmy agent model comprises:

3. The method according to claim 1, wherein the method further comprises:

4. A method according to any one of claims 1-3, characterized in that the method further comprises:

5. A self-game based bluing agent model building device, the device comprising:

the strategy updating module is suitable for carrying out real-time learning and optimization by utilizing a self-game algorithm based on the countermeasure data and updating the strategy of the bluing agent model;

the bluarmy agent comprises a sensor, a decision program and an actuator, and the simulated countermeasure module is further adapted to:

the policy generation module is further adapted to:

and, the apparatus is further adapted to:

pre-judging an attack mode of an adversary according to the adversary model;

6. An electronic device comprising a processor and a memory arranged to store computer executable instructions that when executed cause the processor to perform the self-game based blutro agent model construction method of any one of claims 1-4.

7. A computer readable storage medium storing one or more programs which, when executed by a processor, implement the self-game based bluarmy agent model building method of any one of claims 1-4.