CN105844068A

CN105844068A - Distribution method oriented to simulation Q learning attack targets

Info

Publication number: CN105844068A
Application number: CN201610427869.3A
Authority: CN
Inventors: 罗鹏程; 谢俊洁; 金光; 李进
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2016-08-10
Anticipated expiration: 2036-06-16
Also published as: CN105844068B

Abstract

The invention discloses a distribution method oriented to simulation Q learning attack targets. The method comprises the following steps: (1) determining an initial state, acquiring two-side air combat situation information of a red side and a blue side, wherein the two-side air combat situation information comprises the number of formation aircrafts and relevant parameters of formation aircrafts, and providing input for target distribution and air combat model calculation of the red side; (2) determining an action set performed by the formation of the red side, strictly stipulating complete state-action pairs, determining a proper probability epsilon value and selecting actions of the red side by using an epsilon-greedy strategy; and (3) stipulating a Q learning algorithm reward function, an ending state and a convergence condition and distributing iteration to attack targets of the red side by using a Q learning algorithm until the convergence condition is met. The method gets rid of the dependence on the priori knowledge; due to the introduction of the epsilon-greedy strategy, the local optimum trap is avoided; by setting the parameter epsilon, the balance between the algorithm efficiency and the local optimum problem can be sought.

Description

A kind of Q learning attack target assignment method of Simulation-Oriented

Technical field

The present invention relates to war simulation technical field, in particular relate to the Q learning attack mesh of a kind of Simulation-Oriented Mark distribution method.

Background technology

In no-data region, traditional target assignment method is at two row (task and performance element) be given in advance The Maximum Value (or consuming minimum) produced after finding a pairing scheme to make it match in element, but due to it Rely on gradient function thus be easily absorbed in local optimum trap；Ant group algorithm has optimizing ability for Target Assignment Good advantage, but calculate the longest, especially when complication system, inefficient；Compare ant colony to calculate Method, particle cluster algorithm has when Target Assignment that speed of searching optimization is fast, the simple advantage of algorithm, but is processing Local optimum still it is easily trapped into during dispersed problem；When genetic algorithm solves problems, although land productivity very well By features such as the self-organization of this algorithm, adaptivity, concurrency, uncertainties, but fail to overcome it The defect of local search ability difference, causes algorithmic statement slow, greatly have impact on search efficiency, and easily Precocious phenomenon occurs.To this end, a kind of being avoided that need to be proposed be absorbed in local optimum and the higher target of attack of efficiency Distribution method, for no-data region.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes the Q learning attack Target Assignment of a kind of Simulation-Oriented Method, for obtaining the optimum results of red fighter formation Target Assignment in no-data region.

The technical scheme is that

The Q learning attack target assignment method of a kind of Simulation-Oriented, comprises the following steps:

(1) determining original state, obtain red blue both sides' air combat situation information, both sides' air combat situation information includes Both sides form into columns interior number of aircraft and formation aircraft relevant parameter, calculate for red Target Assignment and air battle model and carry For input；

(2) determine that red is formed into columns the behavior aggregate that can perform, and complete " state-action " of strict regulations is right； Determine suitable probability ε value and use ε-greedy strategy to carry out red Action Selection；

(3) regulation Q learning algorithm Reward Program, final state and the condition of convergence, application Q learning algorithm is The red target of attack distribution iteration that carries out is until meeting the condition of convergence.

In the step (1) of the present invention, the formation aircraft relevant parameter calculated for air battle model includes that both sides fly Machine quantity, commander the factor, opportunity of combat vulnerability, Multi-target Attacking ability, sortie ratio, allow belligerent than, make War the Radar Cross Section of aircraft, operational aircraft Radar cross-section redaction coefficient, airborne radar maximum detection distance, On airborne absolutely empty bullet maximum effective range, maximum detection distance, probability of detection, early warning plane to target are effectively sent out Existing destination probability, guided missile quantity and guided missile scoring.Original state for the study of red Target Assignment is belligerent Front both sides respectively form into columns interior number of aircraft；Target Assignment study also needs to be returned based on air battle model result of calculation The relevant parameter of report value and Q learning algorithm such as determines discount factor and the Learning Step of Q learning algorithm.

In the present invention, red is formed into columns the most different target distribution schemes of the action that can perform, with Arab's number Word table shows；Both sides participate in aircraft kind and quantity perceptually state in the formation of air battle, represent with matrix. Red Action Selection have employed ε-greedy strategy, i.e. chooses current Q function with a greater probability (1-ε) and reaches Action to maximum；Attempt other different actions with probability ε simultaneously.

In the step (3) of the present invention, final state is to have each interior number of aircraft of forming into columns of a side in operation both sides It is 0；The condition of convergence for all " state-actions " to the amplitude of variation of awards sum be respectively less than and give threshold values；When Arriving final state, algorithm will terminate an iteration circulation, and continue to start new iteration from original state and follow Ring, until study terminates.

The present invention has versatility, carries out target of attack distribution based on Q learning algorithm and has the advantage that

(1) dependence to priori has been broken away from；

(2) introducing to ε-greedy strategy, it is to avoid be absorbed in local optimum trap；

(3) by the setting to parameter ε, can be at seeking balance in efficiency of algorithm and local optimum problem.

Accompanying drawing explanation

Fig. 1 is that target of attack distributes schematic diagram.

Fig. 2 is the flow chart of the present invention.

Detailed description of the invention

The present invention is described in detail below in conjunction with the accompanying drawings.

Target of attack distribution in air battle refers to, monitors according to combat duty and Theater Air War situation, formulates and make Fighting spirit figure, the aircraft resource dispatching, controlling a whole group of planes in units of fighter formation carries out corresponding formation point Joining, its concept is as shown in Figure 1.The present invention proposes the Q learning attack target assignment method of a kind of Simulation-Oriented, For obtaining the optimum results of red fighter formation Target Assignment in no-data region.

Step of the present invention includes:

(1) determine original state, obtain both sides' air combat situation information, including both sides form into columns interior number of aircraft with Formation aircraft relevant parameter, calculates for red Target Assignment and air battle model and provides input.

(2) determine that red is formed into columns the behavior aggregate that can perform, and complete " state-action " of strict regulations is right； Determine suitable ε value and use ε-greedy strategy to carry out red Action Selection.

(3) regulation Q learning algorithm Reward Program, final state and the condition of convergence, apply Q learning algorithm Target of attack distribution iteration is carried out until meeting the condition of convergence for red.

It is the flow chart of the Q learning attack target assignment method of Simulation-Oriented of the present invention with reference to Fig. 2, Fig. 2, wherein Q(s_t,a_t) represent that red is in state s_tLower employing action a_tThe incentive discount obtained and；γ ∈ [0,1] be discount because of Son, the importance between balance return immediately and long-term return；α is Learning Step, for control algolithm Learning efficiency；R is Reward Program, r_t+1Represent at s_tLower selection action a_tReturn immediately；A is all feasible Behavior aggregate；ε is greedy strategy parameter, introduces and is used for avoiding local optimum trap；Represent at shape State s_t+1Lower everything maximal rewards, reflects the mode of action of long-term return.

Below in conjunction with a specific embodiment, after meeting with blue party assault aircraft and jammer with red patrol aircraft, Target Assignment under red early warning plane is commanded is background, the implementation process of the detailed description present invention:

(1) original state is determined, its form such as table 1 below.Wherein, red early warning plane have learning ability and The most directly participate in fighting (assuming that will not be shot down).It addition, the action award value of red is all under initial condition 0, the aforementioned relevant parameter of both sides' air combat formation calculated for air battle model preserves with tabular form.

Table 1 state representation form

Determine discount factor and the Learning Step of Q learning algorithm simultaneously, discount factor γ=0.9 is set；Study step Long α=0.1.

(2) determine that red is formed into columns the behavior aggregate that can perform, and complete " state-action " of strict regulations be right, Its form such as table 2.

Table 2 " state-action " is to representation

It is simultaneously introduced ε-greedy strategy, goes out suitable ε value according to emulation Data Mining.General scenario takes ε=0.1 Can meet demand, its meaning is: when red carries out Action Selection, chooses current Q function with 0.9 probability and reaches The action of maximum；Attempt other different actions with the probability of 0.1 simultaneously.

(3) red Reward Program is defined as follows: after using air battle model to calculate air battle result, when judging blue party Assault aircraft or countermeasure aircraft have a frame to be shot down, then red obtains the award of+1；When judging that red patrols Patrolling aircraft has a frame to be shot down, then red obtains the return of-1.Self-defence is selected to evade the system i.e. abandoned at red Sky temporary, gives red punishment-10.

Employing air battle model is calculated by air battle result, and after above-mentioned Reward Program calculates, available red is held The return immediately that action is made.

Red " state-action " to incentive discount value renewal equation as follows:

Q (s_{t}, a_{t}) &LeftArrow; Q (s_{t}, a_{t}) + α [r_{t + 1} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

r_t+1It is the return immediately of red execution action；Represent in state s_t+1Lower red everything is Big return.

Definition final state is 0 for a number formulary amount in belligerent both aircraft, each time from arbitrary initial state The process reaching final state is referred to as one act (also referred to as scene).

Reach final state every time, it is judged that red " state-action " to incentive discount value matrix whether restrain. If not convergence, go back to original state iteration again, until red " state-action " to incentive discount Value matrix is restrained.

To red " state-action " to the matrix that is not zero of incentive discount value carry out data analysis, it is known that red Optimal strategy is incentive discount and the combination of actions of value maximum under this state.

Claims

1. the Q learning attack target assignment method of a Simulation-Oriented, it is characterised in that comprise the following steps:

The Q learning attack target assignment method of Simulation-Oriented the most according to claim 1, it is characterised in that In step (1), formation aircraft relevant parameter includes both aircraft quantity, commander's factor, opportunity of combat vulnerability, many Target attack ability, sortie ratio, allow belligerent ratio, the Radar Cross Section of operational aircraft, operational aircraft Radar cross-section redaction coefficient, airborne radar maximum detection distance, airborne absolutely empty bullet maximum effective range, maximum are sent out Now apart to the probability of detection of target, the effective target detection probability of early warning plane, guided missile quantity and guided missile scoring.

The Q learning attack target assignment method of Simulation-Oriented the most according to claim 1, it is characterised in that In step (3), final state is that in having a side each to form into columns in operation both sides, number of aircraft is 0；The condition of convergence For all " state-actions " to the amplitude of variation of awards sum be respectively less than and give threshold values；When arriving final state, Algorithm will terminate an iteration circulation, and continue to start new iterative cycles from original state, until study knot Bundle.

The Q learning attack target assignment method of Simulation-Oriented the most according to claim 1, it is characterised in that

In step (1), determining original state, wherein red has s₁₁Frame early warning plane；s₁₂Frame patrol aircraft；Blue There is s side₂₁Frame jammer；s₂₂Frame assault aircraft；Red early warning plane has learning ability and the most directly participates in fighting And suppose to be shot down；

In step (2), determine that red is formed into columns the behavior aggregate that can perform, and strict regulations complete " state- Action " right, its form is as shown in table 2:

Table 2 " state-action " is to representation

Choose ε=0.1, when i.e. red carries out Action Selection, choose current Q function with 0.9 probability and reach maximum Action；Attempt other different actions with the probability of 0.1 simultaneously；

In step (3), red Reward Program is defined as follows: after using air battle model to calculate air battle result, when Judge that blue party assault aircraft or countermeasure aircraft have a frame to be shot down, then red obtains the award of+1；When sentencing Determining red patrol aircraft has a frame to be shot down, then red obtains the return of-1.Self-defence is selected to evade at red When i.e. abandoning control of the air, give red punishment-10；

Red " state-action " to incentive discount value renewal equation as follows:

Q (s_{t}, a_{t}) &LeftArrow; Q (s_{t}, a_{t}) + α [r_{t + 1} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

Definition final state is 0 for a number formulary amount in belligerent both aircraft, each time from arbitrary initial state The process reaching final state is referred to as one act；

Reach final state every time, it is judged that red " state-action " to incentive discount value matrix whether restrain； If not convergence, go back to original state iteration again, until red " state-action " to incentive discount value Matrix is restrained.

The Q learning attack target assignment method of Simulation-Oriented the most according to claim 4, it is characterised in that Determine discount factor and the Learning Step of Q learning algorithm, discount factor γ=0.9 is set；Learning Step α=0.1.

The Q learning attack target assignment method of Simulation-Oriented the most according to claim 4, it is characterised in that To red " state-action " to the matrix that is not zero of incentive discount value carry out data analysis, it is known that red is optimal Strategy is incentive discount and the combination of actions of value maximum under this state.