CN112560332A

CN112560332A - Aviation soldier system intelligent behavior modeling method based on global situation information

Info

Publication number: CN112560332A
Application number: CN202011375776.3A
Authority: CN
Inventors: 李妮; 董力维; 王泽�
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-26
Anticipated expiration: 2040-11-30
Also published as: CN112560332B

Abstract

The invention discloses an aviation soldier system intelligent behavior modeling method based on global situation information, which is characterized in that the global situation is mathematically expressed based on a state vector by comprehensively studying and judging the complex aerial global battlefield situation; providing a situation feature extraction and perception algorithm based on a two-dimensional GIS situation map, obtaining element information which cannot be directly obtained from a state vector, and obtaining a global situation state space for sensing an aviation soldier intelligent behavior model; and a reward value generation algorithm based on network connected domain maximization is provided, and the aviation soldier intelligent behavior model is driven to perform iterative evolution to high return under the excitation of an incomplete global situation. The technical scheme of the invention can provide effective theoretical basis and technical assistance for researching the aspects of obtaining greater operational information advantages under the condition of incomplete battlefield situation perception, generating efficient air combat command and control decision, analyzing, deducing and repeating air combat schemes, improving the operational level of an aviation soldier system and the like.

Description

Aviation soldier system intelligent behavior modeling method based on global situation information

Technical Field

The invention belongs to the technical field of operational situation analysis and aviation soldier modeling and simulation, and particularly relates to an aviation soldier system intelligent behavior modeling method based on global situation information.

Background

In recent years, China is in the development process of new military transformation, and the security form faced by China is increasingly complex. The aviation soldier has the characteristics of rapid operation force application, multidimensional battlefield space, rapid change of operation situation, various operation modes and the like, and is important force for maintaining national safety.

The military force system of the aviation soldier is taken as a typical complex system, the uncertainty of the operation situation, the complexity of weaponry and equipment and the huge volume of operation tasks make the operation simulation research face huge challenges, and at present, the countermeasure operation simulation research aiming at the military force system of the aviation soldier is far from reaching the mature ground step. With the increasing of the pace of the modern system fight and the increasing of the complexity, the human brain decision is difficult to adapt to the trend of rapid change of the fight situation of the aviation soldier armed system, the future aviation soldier armed system fight needs rapid, automatic and autonomous decision, and the intelligent technology is urgently needed to extend the human brain so as to improve the capability of a command information system, thereby adapting to the high-speed, complex and variable battlefield environment.

The current war force system confrontation is mainly network center war, and finally forms the fighting activity with fighting strength advantage through information advantage enabling control decision advantage. The network central war acquires, fuses, transmits and processes environmental information of various aspects of the battlefield through a detection network and an information network which are composed of combat entities distributed in the battlefield, thereby forming a battlefield perception situation, and the environmental information is quickly shared and presented to an instruction control center as a basis for generating an instruction control decision.

The military strength countermeasure technology becomes the key point in the systematic countermeasure simulation research in response to the command decision environment for combined defeating of all levels of fighters in the current informatization combat. The aviation soldier formation decision model based on the rule set has the defects of high knowledge acquisition difficulty, poor adaptability to decision environment, poor reconfigurability, high modeling workload, complex maintenance and the like.

With the breakthrough progress of the artificial intelligence technology in the fields of perception intelligence and cognitive intelligence, the machine learning technology represented by deep learning and reinforcement learning methods makes the behavior modeling direction progress, and provides possibility for breakthrough of the technical bottleneck of confrontation behavior decision facing the military strength system of the aviation soldier. The core idea of reinforcement learning behavior modeling is to enable a reinforcement learning algorithm to an agent, wherein a situation environment is a carrier for confrontation and learning and an object for interacting with the reinforcement learning agent, and becomes a core element in the whole closed-loop process of behavior modeling based on reinforcement learning. Therefore, it is important to expand the following two issues around the situation: firstly, selecting a proper state vector space to reasonably express the complex air battlefield situation and effectively sensing the situation environment by an aviation soldier force behavior model; and secondly, generating effective continuous rewards aiming at the confrontation decision-making behaviors of the aviation soldiers under the excitation of incomplete global situation environment information. However, at present, in order to solve the two important problems, a theoretical research and a technical implementation means for a specific system confrontation simulation scene still need to be deeply explored and optimally designed, namely an effective expression and perception technology of an incomplete confrontation situation and an aviation soldier intelligent reward generation technology under the excitation of the incomplete situation.

In the field of intelligent behavior modeling of reinforcement learning, the foreign starts earlier and has undergone theoretical development and application exploration for decades, and a systematic research method and a wide application foundation are formed. The artificial intelligence wave taking reinforcement learning and deep learning as the tide is also developed in China, but the reinforcement learning intelligent behavior modeling research aiming at global situation information is less at present and is basically in the development stage. Therefore, the method for modeling the intelligent behaviors of the aviation soldier system based on the global situation information is researched to form a reinforcement learning behavior modeling framework with universality and practicability in the aerial force fight, and the method is an important subject for breaking through an intelligent decision and improving the fight level from an artificial intelligence level in the modern aerial fight.

At present, the research and the related method for fighting and confrontation of multi-formation aviation soldier strength systems based on reinforcement learning are lacked in China. Most researches are mainly conducted by scientific research institutions, and original algorithm theories and deep application expansion are lacked. Most of research aims at modeling problems generally having small-scale and definite state space, such as double-machine close-range air combat based on tactical level.

Disclosure of Invention

Aiming at the problems of complex model, complex operation rule, complex red and blue situation formalization expression and difficult effective generation of confrontation decision caused by strong uncertainty of the armed force configuration and the task flow in the conventional combat simulation of the armed force system of the aviation soldier, the invention develops the research of the intelligent behavior modeling method of the armed force system under the global situation, and provides the intelligent behavior modeling method of the armed force system based on the global situation information. The specific technical scheme of the invention is as follows:

an aviation soldier system intelligent behavior modeling method based on global situation information comprises the following steps:

s1: extracting key elements to construct an environment state space vector according to the air combat characteristics of the aviation soldiers and the importance degree of relevant factors influencing the air combat result of the aviation soldiers, and effectively representing the air combat battlefield situation of the aviation soldiers; selecting own fire power network connected domain ratio A_rOwn information network connected domain ratio I_rEnemy fire power network communication domain ratio A_bEnemy information network connected domain ratio I_bThe rest percentage xi of the ammunition of the weapon and the battle loss ratio epsilon of the formation of the aviation soldier form an environment state space vector S ═<A_r，I_r，A_b，I_bXi, epsilon > describe the battlefield situation of the aviation soldier;

s2: acquiring environment situation information A which cannot be directly acquired in environment state space vector by using situation feature extraction and perception algorithm based on two-dimensional GIS situation map_r、I_r、A_bAnd I_bBased on two-dimensional GIS state with graphic featuresThe situation map, namely the own information detection range, the fire striking range and the detected enemy force entity position are all distinguished by obvious colors, image feature extraction is carried out to obtain monochromatic feature map layers of an own information network connected domain and a fire network connected domain of the own party and the enemy, and perception of situation environment information by reinforcement learning is realized;

in the image characteristic extraction part, extracting image color characteristics contained in a two-dimensional GIS situation image by adopting a color-based characteristic extraction method, and removing color values of non-characteristic pixels to obtain an information network connected domain and a fire network connected domain which are composed of own aviation soldier system combat entities and serve as two monochromatic characteristic image layers reflecting own intelligent blue combat situation; positioning an enemy combat entity through a feature extraction method, calling corresponding entity information in a self weapon equipment rule base, simulating an information network connected domain and a fire network connected domain which are formed by enemy, namely intelligent redparty combat entities, and generating two monochromatic feature map layers reflecting the fighting situation of the enemy;

s3: designing a combat behavior space of an army system of an aviation soldier; dividing an executable task set of the airplane formation according to the combat characteristics of the armed force formation in the aviation soldier system; the executable tasks of the airplane formation in the aviation soldier system are integrated to form a combat action space of the aviation soldier system;

s4: generating an effective real-time reward mechanism by adopting a reward value generation algorithm based on network connected domain maximization;

performing color histogram statistics on the characteristic map layer obtained in the step S2, calculating the number ratio of color-imparting pixels of an information network connected domain and a fire network connected domain in a monochromatic characteristic map layer, obtaining numerical parameters for representing own and enemy situation characteristics, obtaining combat advantage quantitative parameters of the two parties through comprehensive weight quantization, designing a reward function based on combat advantage comparison, providing positive and negative clear real-time reward feedback for each behavior decision of the intelligent agent, and driving the intelligent agent to make a continuously optimized behavior decision by using a reward mechanism;

s5: constructing a state transition model and designing an action selection strategy;

the transition of the air combat situation of the aviation soldier accords with a first-order Markov decision process, namely the state transition probability is only related to the current state; a behavior selection strategy is designed by a greedy-random algorithm, a behavior with the maximum effectiveness is selected in a specific state, randomness is added to the behavior selection, and the aviation soldiers are allowed to form a form to be 'explored' in a state space;

s6: based on a time sequence difference algorithm, an aviation soldier state space vector, a state transition model and an action selection strategy algorithm formed in the steps S1-S5 and a reward value generation algorithm based on network connected domain maximization are fused to form an improved reinforcement learning frame in aviation soldier combat confrontation, and iterative learning training of the military force Agent is carried out based on the frame.

Further, the situation feature extraction and perception algorithm process based on the two-dimensional GIS situation map in step S2 includes:

s2-1: extracting image features; abstracting a frame of m multiplied by n two-dimensional GIS situation map based on RGB color space into a matrix

Each element of the matrix is a three-dimensional vector representing the RGB color value of the pixel point at the corresponding position, as shown in the following formula:

in the formula, c_ij∈[RGB]I is more than or equal to 0 and less than m, j is more than or equal to 0 and less than n, and the three-dimensional vector is a three-dimensional vector of RGB color values of pixel points at corresponding positions, [ r, g, b [ ]]_ijWherein r represents a red component of the color, g represents a green component of the color, b represents a blue component of the color, and the numerical ranges of the three elements are all 0-255;

s2-2: the color value range of the information perception domain of the own party combat entity is set as

The fire striking domain color value of the own combat entity is

The color value of the enemy combat entity is cr, and the steps S2-3 to S2-8 are carried out on each frame of two-dimensional GIS situation map;

s2-3: copying a two-dimensional GIS situation map, judging whether the current pixel belongs to a local information sensing area or not on a copy layer I pixel by pixel, and if so, keeping the color value of the current pixel; if not, assigning the color value to be 0; the following formula:

s2-4: copying a two-dimensional GIS situation map, judging whether the current pixel point belongs to a fire striking area of the own party on a copy map layer II, and if so, keeping the color value of the current pixel point; if not, assigning the color value to be 0; the following formula:

s2-5: copying a two-dimensional GIS situation map, judging whether the current pixel point belongs to an enemy combat entity or not on a copy map layer III, and if so, keeping the color value of the current pixel point; if not, assigning the color value to be 0; the following formula:

s2-6: let the enemy combat entity be e₁，e₂，...，e_pCalling corresponding information perception range in weapon equipment rule base as

The corresponding fire striking range is

Executing the step S2-8 and the step S2-9 on the layer obtained after the processing of the step S2-6;

s2-7: copying the layer obtained after the processing of the step S2-6, and copying the layerIV, judging whether the current pixel point belongs to the enemy combat entity or not by pixel points, if so, assigning color values of all pixel points in a circle taking the current pixel point as the center and the information perception range of the corresponding combat entity as the radius to be

If not, the color value of the current pixel point is reserved; the following formula:

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red information perception range;

s2-8: the layer obtained after the processing of the step S2-6 is copied, whether the current pixel point belongs to the enemy combat entity is judged pixel by pixel on the copy layer V, if yes, all the pixel point color values in a circle taking the current pixel point as the center and the corresponding combat entity firepower impact range as the radius are assigned as the color values

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red fire striking range;

therefore, the intelligent aviation soldier system obtains four feature layers respectively reflecting own party and enemy information network connected domains and fire network connected domains from the two-dimensional GIS situation map, namely, the layer I, the layer II, the layer IV and the layer V, and completes situation feature extraction and perception.

Further, the aviation force comprises fighter force formation, bomber formation, early warning machine formation, unmanned reconnaissance machine formation and electronic interference machine formation, and the operation behavior space of the intelligent aviation force system is as follows:

s3-1: the fighter plane comprises fighter planes and bombers, and according to the fighter characteristics of the fighter planes, the performable tasks of the fighter plane formation comprise: region patrol J₁Patrol J for take-off area₂Patrol on route J₃Patrol J of takeoff route₄Huoyang J₅Takeoff and protection J₆Air interception J₇Go back J₈；

S3-2: tasks that can be performed according to operational characteristics of bombers include: regional patrol H₁Patrol H for take-off area₁Patrol on route H₂Patrol H for takeoff route₃Regional assault H₄Takeoff area assault H₅Target assault H₆Takeoff target assault H₇Go back H₈；

S3-3: the executable tasks of the early warning machine according to the fighting characteristics of the early warning machine comprise: regional patrol detection Y₁Patrol detection Y for route₂Early warning machine detection mode Y₃And early warning machine radar startup and shutdown Y₄And the detection task is cancelled Y₅；

S3-4: the executable tasks according to the operational characteristics of the electronic jammer comprise: regional interference R₁Route interference R₂Setting an interference pattern R₃Turn off the disturbance R₄Ending the disturbance R₅；

S3-5: the executable tasks of the unmanned reconnaissance plane according to the fighting characteristics comprise: regional patrol scouting W₁Patrol and reconnaissance W for route₂The scout task is cancelled W₃；

S3-6: sorting the executable tasks of the different aviation soldier formations described in the steps S3-1 to S3-5 to obtain a combat action space A ═ J of the whole intelligent aviation soldier force system decision-making action model₁，…J₈，H₁，…H₉，Y₁，…Y₅，R₁，…R₅，W₁，…W₃}。

Further, the reward value generation algorithm based on the network connected domain maximization in the step S4 includes:

s4-1: counting the pixel proportion of the color histogram in the feature map layer obtained in the step S2; respectively executing the steps S4-2 to S4-4 to four m multiplied by n characteristic image layers representing the own-party information network connected domain, the own-party fire network connected domain, the enemy information network connected domain and the enemy fire network connected domain;

s4-2: color quantization; setting the color interval of the map layer as range, wherein the range comprises the color value range of the information perception domain of the own combat entity

The fire striking domain color value of the own combat entity is

And the color value of the enemy combat entity is c^rNamely, the following conditions are satisfied:

randomly dividing range into N color intervals bin_i＝[c_i1，c_i2]Each bin is called a bin of the color histogram, as follows:

range＝bin₁∪bin₂U…∪bin_N

in the formula, c_i1Is the color interval bin_iLower boundary of c_i2Is the color interval bin_iThe upper bound of (c);

s4-3: carrying out color detection pixel by pixel, calculating the number of pixels of which the colors fall in each interval, and obtaining a color histogram, wherein the color histogram is represented as:

in the formula (I), the compound is shown in the specification,

indicating that the color falls in the interval bin_i＝[c_i1，c_i2]The number of pixels in (1) is proportional; c. C_pqThe color value of the pixel point is; c. C_iIs the color sub-interval bin_iCenter color value of (d): c. C_i＝0.5×(c_i1，c_i2)；δ(c_pq-c_i) The specific form is a color judgment function:

s4-4: counting the total pixel proportion of non-zero color values:

in the formula, h_TA total pixel fraction representing a non-zero color value;

s4-5: the four steps from the step S4-2 to the step S4-4 are executed to obtain the total proportion h of the non-zero color value pixels of the four monochrome characteristic image layers_T(1)，h_T(2)，h_T(3)，h_T(4) Respectively corresponding to situation characteristic parameters I as own information network_b＝h_T(1) Own fire network situation characteristic parameter A_b＝h_T(2) Situation characteristic parameter I of enemy information network_r＝h_T(3) Situation characteristic parameter A of enemy fire power network_r＝h_T(4)；

S4-6: based on the result of step S4-5, obtaining the quantified parameters of the battle superiority of both parties by weight integrated quantification, using P_bIndicating the operational advantage of the own party in the systematic confrontation, P_rThe fighting advantages of the enemy in the system fight are shown, and the fighting advantages of the two parties are as follows:

P_b＝ω₁·I_b+ω₂·A_b

P_r＝ω₁·I_r+ω₂·A_r

in the formula, ω₁Representing the weight of the information network advantage in the comprehensive combat advantage; omega₂Represents the weight of the fire network advantage in the comprehensive combat advantage, the weight value is adjusted between (0, 1), and meets omega₁+ω₂＝1；

S4-7: designing a formalized reward function based on the contrast of the operational advantages of the two parties; the core idea of the reward function for proposing the reward mechanism is that: comparing a one-time behavior decision made under the current situation with the comprehensive combat superiority of two parties formed after the interaction of the battlefield environment to obtain a reward value based on the current situation and the decision; specifically, if the decision makes the intelligent agent have the comprehensive combat advantage relative to the enemy, the reward is positive, and the greater the advantage is, the greater the absolute value of the reward value is; if the decision makes the intelligent agent have the disadvantage of comprehensive combat relative to the enemy, the reward is negative, and the greater the disadvantage, the greater the absolute value of the reward value; meanwhile, the reward parameters need to be normalized;

the reward function is expressed as: the proportion of the operational advantage of one intelligent agent to the total operational advantage of the two intelligent agents is used as a main reward value, a minimum value delta is matched to introduce a positive and negative numerical characteristic, and the following formula is shown as follows:

wherein R is an award value based on the current situation and decision; delta is a minimum value in the range of (10)^-4，10^-3) The significance is to avoid divide by zero while introducing normalized prize values into the positive and negative numerical features.

Further, the specific process of step S5 is as follows:

s5-1: the transition of the fighting situation is described by the probability,

represents the transition probability between states, meaning: the probability of executing an action a in state s to reach state s', all transition probabilities constituting a matrix, called the environmental transition matrix, denoted as environmental transition matrixT；

S5-2: after the own party selects the behavior a, the change of the fighting situation is completely expressed by the state transition matrix, and the air combat process conforms to the first-order Markov decision process, namely the transition probability is only related to the current state;

s5-3: combining the probability in the state transition model, each state s, and selecting the behavior a according to a certain probability by following the strategy pi under the state to form a 'state-behavior' pair (s, a), wherein the value of the 'state-behavior' pair is obtained by a Q function, and Q is used for obtaining the value of the 'state-behavior' pair_π(s, a) represents;

s5-4: in the course of behavior selection, add the random selection part to form the behavior selection tactics mu on the basis of greedy tactics, in order to choose a behavior from the behavior space under each state, and transfer to the next state with a certain probability, the construction of the behavior selection tactics mu lies in setting up a exploration constant tau at first, tau is for (0, 1), when choosing the behavior each time, produce a random number rho with interval [0, 1], there are:

taking τ to 0.2, there is 20% of possible free choice actions.

Further, the specific process of step S6 is as follows:

s6-1: utilizing the situation perception information obtained in the step S2, the weapon ammunition residual percentage obtained in the simulation platform and the aviation soldier formation combat damage ratio to form a state space vector S, representing a specific state space vector at a certain moment by using S, and building a GRBF neural network consisting of an input layer, a discrete layer, a hidden layer and an output layer to discretize a Q function value of a ' state-behavior ' pair so as to partition a continuous state space and obtain a state-behavior ' pair value corresponding to a discrete state; the network input is a state space vector, and the values of all 'state-behavior' pairs obtained by selecting different behaviors under the state corresponding to the state space vector are output; the network input layer and the discrete layer have the same dimension with the state space vector; the hidden layer of the network has m nodes in total, and the output layer has the same dimension as the behavior space; for an aviation Agent, 30 behaviors are selected in the behavior space under each state, and the calculation formula is as follows:

wherein Q (s, a)_j) Q function value, w, representing the execution of the j-th action in state s_ijThe connection weight between the ith node of the hidden layer and the jth node of the output layer,

normalization for the ith node of the hidden layer:

in the formula, the radial basis function b_iThe formula for calculation of(s) is:

in the formula (d)_iIs the center of the ith basis function, having the same dimension as s, σ_iIs the width of the ith basis function, | | s-d_i| | is the Euclidean distance between the input state and the center of the basis function; after p is set artificially, d_i，σ_iAll determined by a k-means clustering algorithm;

s6-2: carrying out iterative learning training of the military force agents based on the framework of the step S6-1, counting the learning process in cycles, regarding the completion of one round of combat as the completion of one learning cycle, and describing the decision process of the intelligent aviation soldier combat system into steps S6-3 to S6-10;

s6-3: initializing a GRBF neural network of an Agent of an aviation soldier, setting the center and the width of the GRBF through K-means clustering, and setting the maximum learning cycle number K, wherein K is 1;

s6-4: starting the learning of the k-th iteration cycle, starting the confrontation simulation, wherein t is the current time, and t is the0，s_t＝s₀，s₀Is in an initial state;

s6-5: in the kth iteration cycle, state s_tDown-obey policy μ execution behavior a_tThen, based on the instant reward obtained in step S4

Transition to a new state s_t+1Continuing to execute behavior a following policy μ_t+1(ii) a Calculating s_tCorresponding GRBF network output, and updating the weight from the hidden layer to the output layer by using a time sequence difference algorithm according to the following formula

In the formula (I), the compound is shown in the specification,

the ith node and the id (a) of the output layer of the hidden layer of the GRBF neural network obtained by iteration in the kth learning period_t) The connection weight between each node;

the ith node and the id (a) of the output layer of the hidden layer of the GRBF neural network in the k-1 learning period_t) The connection weight between each node;

the ith node and the id (a) of the output layer of the hidden layer of the GRBF neural network in the k-1 learning period_t+1) The connection weight between each node; b_i(s_t) Is the state S described in S6-1_tA radial basis function of; b_i(s_t+1) Is a state s_t+1A radial basis function of; id (a)_t) Is an action a_tThe serial number of (2); id (a)_t+1) Is an action a_t+1The serial number of (2); alpha represents the learning rate and takes the valueThe range is (0, 1);

s6-6: let t be t +1, and repeatedly execute step S6-5 until the confrontation simulation wins and falls to the end state of the current iteration cycle;

s6-7: let K be K +1 and repeatedly perform steps S6-5 to S6-6 until K > K.

The invention has the beneficial effects that:

1. the invention carries out new exploration research aiming at the intelligent decision-making modeling of the formation of the aviation soldiers based on the deep reinforcement learning algorithm, and compared with the intelligent decision-making model of the formation of the aviation soldiers based on the rule set, the intelligent behavior model of the formation of the aviation soldiers based on the deep reinforcement learning has the advantages of less knowledge acquisition time consumption, good adaptability to the decision environment change, convenient reuse, no need of maintenance and the like, and supports the task of continuous change in the high-dynamic battlefield environment.

2. The invention provides a situation characteristic extraction and perception algorithm based on a two-dimensional GIS situation map, which is used for forming the perception capability of incomplete situation environment information under the system confrontation situation based on a situation map information extraction mechanism similar to human decision, serving the behavior value generation of a military strength intelligent agent in the confrontation and supporting a complete closed loop of a reinforcement learning process.

3. The invention provides a reward value generation algorithm based on network connected domain maximization, designs a reward mechanism for decision continuous optimization by taking the network connected domain maximization of own combat as a core, and supports driving a force agent to complete decision generation and optimization under the excitation of incomplete confrontation situation information.

Drawings

In order to illustrate embodiments of the present invention or technical solutions in the prior art more clearly, the drawings which are needed in the embodiments will be briefly described below, so that the features and advantages of the present invention can be understood more clearly by referring to the drawings, which are schematic and should not be construed as limiting the present invention in any way, and for a person skilled in the art, other drawings can be obtained on the basis of these drawings without any inventive effort. Wherein:

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a situation feature extraction and perception algorithm flow based on a two-dimensional GIS situation map of the invention;

FIG. 3 is a two-dimensional confrontational map of the present invention, wherein (a) and (b) are two different examples of confrontational maps (the two-dimensional GIS situational map is the input to the situational feature extraction algorithm);

fig. 4 is a monochrome feature map layer obtained by color feature extraction according to the present invention, where (a) is a two-dimensional GIS situation map before color feature extraction, and (b) is the monochrome feature map layer obtained by color feature extraction;

FIG. 5 is a flow chart of a reward value generation algorithm based on network connected domain maximization according to the present invention;

FIG. 6 is a diagram illustrating situation characteristic parameters of two enemies and the my party obtained through color histogram statistics, wherein (a) is four monochrome characteristic image layers representing battle network situations, and (b) is a situation characteristic parameter obtained by performing color quantization on the four monochrome characteristic image layers;

FIG. 7 is a transition model of the present invention constructed from a plurality of probabilities;

FIG. 8 is a sequence of actions for which the selection of the invention is most effective;

FIG. 9 is a TD-Q based reinforcement learning framework of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Aiming at the technical bottleneck existing in the intelligent behavior modeling of the aviation soldier system, the invention aims to explore and research the intelligent behavior modeling problem of the multi-formation aviation soldier system, which has complicated and changeable war situations and difficult description of state space, based on a deep reinforcement learning algorithm.

The invention mainly solves the technical problems in two aspects: (1) the effective expression of the state space of the complex air battlefield situation and the effective perception of the aviation soldier behavior model to the global situation; (2) generating a problem for an effective continuous reward value of an airline soldier confrontation decision-making behavior under the incentive of incomplete global situation environment information. On the basis, an aviation soldier system intelligent confrontation decision-making behavior model is constructed, an effective and high-return air formation combat behavior sequence space is formed through iterative training, support is provided for constructing an intelligent aviation force model based on reinforcement learning, and further a larger combat information advantage is obtained under the condition of incomplete battlefield situation sensing for research, efficient air combat command decision is generated, and effective theoretical basis and technical assistance are provided for the aspects of analyzing, deducing and replying the air combat scheme, improving the combat level of the aviation soldier system and the like.

As shown in figure 1, the invention establishes an aviation soldier system intelligent behavior model based on global situation information, selects series key situation information influencing the confrontation outcome of the aviation soldier system to form a state space element of the aviation soldier system operation by comprehensively studying and judging the complex aerial global battlefield situation, and mathematically expresses the global situation based on a state vector; providing a situation characteristic extraction and perception algorithm based on a two-dimensional GIS (geographic Information system) situation map, and obtaining element Information which cannot be directly obtained from a state vector so as to obtain a global situation state space for sensing an intelligent behavior model of an aviation soldier; and providing a reward value generation algorithm based on network connected domain maximization to drive the aviation soldier intelligent behavior model to perform iterative evolution towards high return under the excitation of incomplete global situation. A complete technical chain for constructing an intelligent behavior model of an aerodrome system is formed through the scheme.

Specifically, the aviation soldier system intelligent behavior modeling method based on global situation information comprises the following steps:

s1: the military force of aviation soldiers generally consists of fighter plane formation, bomber formation, early warning plane formation, unmanned reconnaissance plane formation and electronic jammer formation.Extracting key elements to construct an environment state space vector according to the air combat characteristics of the aviation soldiers and the importance degree of relevant factors influencing the air combat result of the aviation soldiers, and effectively representing the air combat battlefield situation of the aviation soldiers; selecting own fire power network connected domain ratio A_rOwn information network connected domain ratio I_rEnemy fire power network communication domain ratio A_bEnemy information network connected domain ratio I_bThe rest percentage xi of the ammunition of the weapon and the battle loss ratio epsilon of the formation of the aviation soldier form an environment state space vector S ═<A_r，I_r，A_b，I_bXi, epsilon > describe the battlefield situation of the aviation soldier;

s2: acquiring environment situation information A which cannot be directly acquired in environment state space vector by using situation feature extraction and perception algorithm based on two-dimensional GIS situation map_r、I_r、A_bAnd I_bBased on a two-dimensional GIS situation map with graphic features, namely, a self information detection range, a fire striking range and a detected enemy force entity position, obvious color distinction is carried out, image feature extraction is carried out, a self and enemy information network connected domain and a fire network connected domain monochromatic feature map layer are obtained, and perception of situation environment information by reinforced learning is achieved;

in the process of confrontation of the military force system of the aircraft soldier, due to the independence of decision of the two parties, the transition of the battle situation of the aircraft soldier has the characteristic of nondeterministic property and needs to be described through probability, as shown in fig. 7. Because the air combat situation changes violently, the transition of the air combat situation of the aviation soldier is considered to conform to a first-order Markov Decision Process (MDP), i.e. the state transition probability is only related to the current state. A behavior selection strategy is designed by a greedy-random algorithm, a behavior with the maximum effectiveness is selected in a specific state, randomness is added to the behavior selection, and the aviation soldiers are allowed to form a form to be 'explored' in a state space;

As shown in fig. 2-3, the situation feature extraction and perception algorithm based on the two-dimensional GIS situation map in step S2 includes:

s2-1: extracting image features; mxn size based on RGB color space for one frameThe two-dimensional GIS situation map is abstracted into a matrix

The fire striking domain color value of the own combat entity is

The color value of the enemy combat entity is c^rExecuting the step S2-3 to the step S2-8 for each frame of the two-dimensional GIS situation map;

The corresponding fire striking range is

s2-7: the layer obtained after the processing of the step S2-6 is copied, whether the current pixel point belongs to the enemy combat entity is judged pixel by pixel on the copy layer IV, if yes, all the pixel point color values in a circle taking the current pixel point as the center and the corresponding combat entity information perception range as the radius are assigned as the color values

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red information perception range;

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red fire striking range;

the intelligent aviation soldier system obtains four feature layers respectively reflecting own and enemy information network connected domains and fire network connected domains from a two-dimensional GIS situation map, namely a layer I, a layer II, a layer IV and a layer V, and completes situation feature extraction and perception, wherein as shown in FIG. 4, (a) is a two-dimensional situation map before color feature extraction, and (b) is a single-color feature layer obtained by color feature extraction;

in the step S3, the aviation soldier forces comprise fighter plane formation, bomber formation, early warning plane formation, unmanned reconnaissance plane formation and electronic interference plane formation, and the operation behavior space of the intelligent aviation soldier force system is as follows:

S3-3: the executable tasks of the early warning machine according to the fighting characteristics of the early warning machine comprise: regional patrol detection Y₁Patrol detection Y for route₂Early warning machine detection mode (alternate air, sea and air) Y₃And early warning machine radar startup and shutdown Y₄And the detection task is cancelled Y₅；

S3-4: the executable tasks according to the operational characteristics of the electronic jammer comprise: regional interference R₁Route interference R₂Setting interference pattern (jamming interference, aiming interference) R₃Turn off the disturbance R₄Ending the disturbance R₅；

As shown in fig. 5, the flow of the reward value generation algorithm based on the network connected domain maximization in step S4 is as follows:

The fire striking domain color value of the own combat entity is

range＝bin₁∪bin₂U…∪bin_N

in the formula (I), the compound is shown in the specification,

s4-4: counting the total pixel proportion of non-zero color values:

in the formula, h_TA total pixel fraction representing a non-zero color value;

s4-5: the four steps from the step S4-2 to the step S4-4 are executed to obtain the total proportion h of the non-zero color value pixels of the four monochrome characteristic image layers_T(1)，h_T(2)，h_T(3)，h_T(4) Respectively corresponding to situation characteristic parameters I as own information network_b＝h_T(1) Own fire network situation characteristic parameter A_b＝h_T(2) Situation characteristic parameter I of enemy information network_r＝h_T(3) Situation characteristic parameter A of enemy fire power network_r＝h_T(4) (ii) a The situation characteristic parameters of the enemy and the my are obtained through color histogram statistics and are shown in fig. 6, wherein (a) is four monochromatic feature layers representing the battle network situation, and (b) is the situation characteristic parameter obtained through color quantization of the four monochromatic feature layers;

P_b＝ω₁·I_b+ω₂·A_b

P_r＝ω₁·I_r+ω₂·A_r

corresponding to the reward mechanism, the reward function is expressed as: the proportion of the operational advantage of one intelligent agent to the total operational advantage of the two intelligent agents is used as a main reward value, a minimum value delta is matched to introduce a positive and negative numerical characteristic, and the following formula is shown as follows:

As shown in fig. 7 to 8, the specific process of step S5 is:

s5-1: the transition of the fighting situation is described by the probability,

represents the transition probability between states, meaning: the probability that behavior a is executed in state s to reach state s', and all transition probabilities form a matrix, called an environment transition matrix and marked as T;

s5-3: combining the probability in the state transition model, each state s, and selecting the behavior a according to a certain probability by following the strategy pi under the state to form a 'state-behavior' pair (s, a), wherein the value of the 'state-behavior' pair is formed byThe Q function is obtained by Q_π(s, a) represents;

taking τ to 0.2, there is 20% of possible free choice actions.

As shown in fig. 9, the specific process of step S6 is:

wherein Q (s, a)_j) Q function value, w, representing the execution of the j-th action in state s_ijFor the ith node and the output layer of the hidden layerThe connection weight between the jth node,

normalization for the ith node of the hidden layer:

s6-4: starting learning of the k-th iteration cycle, starting the confrontation simulation, wherein t is the current time, t is 0, and s_t＝s₀，s₀Is in an initial state;

In the formula (I), the compound is shown in the specification,

the ith node and the id (a) of the output layer of the hidden layer of the GRBF neural network in the k-1 learning period_t+1) The connection weight between each node; b_i(s_t) Is state S in S6-1_tA radial basis function of; b_i(s_t+1) Is a state s_t+1A radial basis function of; id (a)_t) Is an action a_tThe serial number of (2); id (a)_t+1) Is an action a_t+1The serial number of (2); alpha represents the learning rate, and the value range is (0, 1);

s6-7: let K be K +1 and repeatedly perform steps S6-5 to S6-6 until K > K.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An aviation soldier system intelligent behavior modeling method based on global situation information is characterized by comprising the following steps:

s1: extracting key elements to construct an environment state space vector according to the air combat characteristics of the aviation soldiers and the importance degree of relevant factors influencing the air combat result of the aviation soldiers, and effectively representing the air combat battlefield situation of the aviation soldiers; selecting own fire power network connected domain ratio A_rOwn information network connected domain ratio I_rEnemy fire power network communication domain ratio A_bEnemy information network connected domain ratio I_bThe rest percentage xi of the weapon ammunition and the battle loss ratio epsilon of the aviation soldier form an environment state space vector S ═ A_r，I_r，A_b，I_b，ξ，ε>Describing the battlefield situation of the aviation soldier;

2. The aviation soldier system intelligent behavior modeling method based on global situation information as claimed in claim 1, wherein in step S2, the situation feature extraction and perception algorithm flow based on the two-dimensional GIS situation map is as follows:

The fire striking domain color value of the own combat entity is

The corresponding fire striking range is

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red information perception range;

in the formula, c^rThe color value of the enemy combat entity,

the color value corresponding to the red fire striking range;

3. The method for modeling the intelligent behavior of the aviation soldier system based on the global situation information as claimed in claim 1, wherein the aviation soldier forces comprise fighter formation, bomber formation, early warning machine formation, unmanned reconnaissance machine formation and electronic interference machine formation, and the operation behavior space of the intelligent aviation soldier force system is as follows:

4. The aviation soldier system intelligent behavior modeling method based on global situation information as claimed in claim 1, wherein in step S4, the reward value generation algorithm flow based on network connected domain maximization is as follows:

The fire striking domain color value of the own combat entity is

range＝bin₁Ubin₂U…Ubin_N

in the formula (I), the compound is shown in the specification,

s4-4: counting the total pixel proportion of non-zero color values:

in the formula, h_TA total pixel fraction representing a non-zero color value;

P_b＝ω₁·I_b+ω₂·A_b

P_r＝ω₁·I_r+ω₂·A_r

5. The aviation soldier system intelligent behavior modeling method based on global situation information as claimed in claim 1, wherein the specific process of step S5 is as follows:

s5-1: the transition of the fighting situation is described by the probability,

representing transition probabilities between states, includingMeaning as follows: the probability that behavior a is executed in state s to reach state s', and all transition probabilities form a matrix, called an environment transition matrix and marked as T;

taking τ to 0.2, there is 20% of possible free choice actions.

6. The aviation soldier system intelligent behavior modeling method based on global situation information as claimed in claim 1, wherein the specific process of step S6 is as follows:

normalization for the ith node of the hidden layer:

In the formula (I), the compound is shown in the specification,

the ith node and the id (a) of the output layer of the hidden layer of the GRBF neural network in the k-1 learning period_t+1) The connection weight between each node; b_i(s_t) Is the state S described in S6-1_tA radial basis function of; b_i(s_t+1) Is a state s_t+1A radial basis function of; id (a)_t) Is an action a_tThe serial number of (2); id (a)_t+1) Is an action a_t+1The serial number of (2); alpha represents the learning rate, and the value range is (0, 1);

s6-7: let K be K +1 and repeatedly perform steps S6-5 to S6-6 until K > K.