CN111047014B

CN111047014B - Multi-agent air countermeasure distributed sampling training method and equipment

Info

Publication number: CN111047014B
Application number: CN201911266811.5A
Authority: CN
Inventors: 孙智孝; 彭宣淇; 朴海音; 杨晟琦; 孙阳; 李思凝; 杜冲; 刘仲; 葛俊; 杨芳; 詹光; 王言伟; 张少卿
Original assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Current assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2023-06-23
Anticipated expiration: 2039-12-11
Also published as: CN111047014A

Abstract

The application belongs to the field of multi-agent air countermeasure game, and particularly relates to a multi-agent air countermeasure distributed sampling training method and equipment. The method comprises the following steps: step one: acquiring a learning node and a sampling node, establishing a connection between the learning node and the sampling node, and initializing a multi-agent air countermeasure network; step two: the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends a sample to the learning node after collecting the sample; step three: and training by the learning node through the sample, and updating and storing the multi-agent air countermeasure network. The method and the device can complete distributed sampling and training of the multi-agent air countermeasure network, and improve sample collection and transmission efficiency of the system and training efficiency of the countermeasure network.

Description

Multi-agent air countermeasure distributed sampling training method and equipment

Technical Field

The application belongs to the field of multi-agent air countermeasure game, and particularly relates to a multi-agent air countermeasure distributed sampling training method and equipment.

Background

Reinforcement learning is an important approach to solving the sequential decision problem at present, and has achieved excellent results in many fields. The multi-agent air countermeasure problem is a typical sequential decision problem and is characterized in that the state space and the action space are large in dimension, and a large number of samples are required for training the neural network of the multi-agent air countermeasure.

In the case of limited stand-alone computing resources, a method of multi-machine distributed sampling is required to increase the sample collection efficiency and the overall training efficiency of the challenge network. When the reinforcement learning method is used for multi-agent air countermeasure network distributed sampling training, the following difficulties are mainly existed: a. the countermeasure network is composed of a plurality of neural networks, such as a state value network, a maneuver strategy network, a target distribution network, a missile launching decision network and the like, and different nodes need to share and continuously update the parameters of the neural networks; b. the learning node needs to control the start and stop of sampling of the sampling node, and the sample acquired by the sampling node is efficiently acquired; c. because the time consumption is long in the sampling training process, redundancy and fault tolerance design are required to be carried out on the learning node and the sampling node, the training system is ensured to stably run without manual intervention; d. the sampling nodes transmit samples through the network cable, so that the problems of network occupation and blockage are required to be solved under the condition of more sampling nodes, and meanwhile, in order to reduce the calculation time cost, the read-write operation of a computer hard disk is required to be reduced as much as possible. Due to the difficulties, the prior art generally has the defects of poor stability, low efficiency and the like when performing multi-agent air countermeasure network distributed sampling training.

It is therefore desirable to have a solution that overcomes or at least alleviates at least one of the above-mentioned drawbacks of the prior art.

Disclosure of Invention

The purpose of the application is to provide a multi-agent air countermeasure distributed sampling training method and equipment, so as to solve at least one problem existing in the prior art.

The technical scheme of the application is as follows:

a first aspect of the present application provides a multi-agent air countermeasure distributed sampling training method, comprising:

step one: acquiring a learning node and a sampling node, establishing a connection between the learning node and the sampling node, and initializing a multi-agent air countermeasure network;

step two: the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends a sample to the learning node after collecting the sample;

step three: and training by the learning node through the sample, and updating and storing the multi-agent air countermeasure network.

Optionally, in step one: the establishing a connection between the learning node and the sampling node includes:

assigning a computer network address of a sampling node to the learning node;

the learning node inquires the number of available sampling nodes through grpc service, and records the network positions of the available sampling nodes in the memory of the learning node.

Optionally, in the second step, the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction and starts sampling, and sending the sample to the learning node after the sampling node collects the sample includes:

s21, the learning node sends a sampling instruction to the sampling node, and the sampling node receives the sampling instruction and starts sampling;

s22, after the sampling node collects the samples, serializing the samples, and sending the serialized samples to a redis server of the learning node;

s23, the learning node reads the samples in the redis server, deserializes the samples and stores the samples into a memory, and after a certain sample is acquired, the learning node stops sending a sampling instruction to the sampling node and stops sampling.

Optionally, in step S21, the learning node sends a sampling instruction to the sampling node, and the sampling node receives the sampling instruction, where the starting of sampling is specifically:

s211, the learning node sequences a sampling zone bit 1 and the multi-agent air countermeasure network and sends the serialized sampling zone bit 1 and the multi-agent air countermeasure network to the sampling node;

s212, the sampling node receives the sampling zone bit 1, and the multi-agent air countermeasure network is in reverse sequence, and sampling is started.

Optionally, in step S211, the learning node sends the result of the multi-agent air countermeasure network serialization to the sampling node through a grpc service through a proto3 protocol.

Optionally, in step S23, the learning node reads a sample in the redis server by a blocking pop-up method.

Optionally, in step S23, after a certain sample is collected, the learning node stops sending a sampling instruction to the sampling node, where the stopping of sampling specifically includes:

after a certain sample is acquired, the learning node changes the sampling zone bit 1 into a sampling zone bit 0, and the sampling node stops sampling after receiving the sampling zone bit 0.

Optionally, the method further comprises the step four: and iterating the second step to the third step, and continuously updating the multi-agent air countermeasure network.

A second aspect of the present application provides a multi-agent air countermeasure distributed sampling training device, based on the multi-agent air countermeasure distributed sampling training method as described above, comprising:

the system comprises an initialization module, a sampling module and a multi-agent air countermeasure network, wherein the initialization module is used for acquiring a learning node and a sampling node, establishing a connection between the learning node and the sampling node and initializing the multi-agent air countermeasure network;

the sampling module is used for sending a sampling instruction to the sampling node by the learning node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends a sample to the learning node after collecting the sample;

and the training module is used for training the learning node through the sample, updating and storing the multi-agent air countermeasure network.

The invention has at least the following beneficial technical effects:

the multi-agent air countermeasure distributed sampling training method can complete distributed sampling and training of the multi-agent air countermeasure network, and improves sample collection and transmission efficiency and countermeasure network training efficiency of the system.

Drawings

FIG. 1 is a flow chart of a multi-agent air countermeasure distributed sampling training method in accordance with one embodiment of the present application;

FIG. 2 is a graph showing the variation of sampling time with the number of sampling nodes when the same number of samples are collected;

fig. 3 is a graph of peak network traffic versus sample size for a single transmission for a learning node of the present application.

Detailed Description

In order to make the purposes, technical solutions and advantages of the implementation of the present application more clear, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all, of the embodiments of the present application. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application. Embodiments of the present application are described in detail below with reference to the accompanying drawings.

In the description of the present application, it should be understood that the terms "center," "longitudinal," "lateral," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, merely to facilitate description of the present application and simplify the description, and do not indicate or imply that the device or element being referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the scope of protection of the present application.

The present application is described in further detail below in conjunction with fig. 1-3.

step three: the learning node trains through the samples, updates and stores the multi-agent air countermeasure network.

Specifically, in one embodiment of the present application, in step one, establishing a connection between the learning node and the sampling node includes:

designating a computer network address of the sampling node for the learning node;

In the second step: the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends the sample to the learning node after collecting the sample comprises:

s21, the learning node sends a sampling instruction to the sampling node, and the sampling node receives the sampling instruction and starts sampling; the method specifically comprises the following steps: s211, the learning node sequences a sampling zone bit 1 and a multi-agent air countermeasure network and then sends the serialized sampling zone bit 1 and the multi-agent air countermeasure network to the sampling node; s212, the sampling node receives the sampling flag bit 1, and the multi-agent air countermeasure network in reverse sequence is realized, and sampling is started. In this embodiment, the learning node sends the result of the multi-agent air countermeasure network serialization to the sampling node through the grpc service through the proto3 protocol.

s23, the learning node reads the samples in the redis server, deserializes the samples and stores the samples into the memory, and after a certain sample is acquired, the learning node stops sending sampling instructions to the sampling node and stops sampling. In this embodiment, the learning node reads the sample in the redis server by the blocking pop-up method. After a certain sample is acquired, the learning node stops sending a sampling instruction to the sampling node, and the stopping of sampling specifically comprises the following steps: after a certain sample is acquired, the learning node changes the sampling zone bit 1 into a sampling zone bit 0, and the sampling node stops sampling after receiving the sampling zone bit 0.

The multi-agent air countermeasure distributed sampling training method further comprises the following steps: and (3) iterating the second step to the third step until the training cycle number requirement is met or the user stops manually, and continuously updating the multi-agent air countermeasure network.

Based on the above-mentioned multi-agent air countermeasure distributed sampling training method, a second aspect of the present application provides a multi-agent air countermeasure distributed sampling training device, including:

the initialization module is used for acquiring the learning node and the sampling node, establishing the connection between the learning node and the sampling node, and initializing the multi-agent air countermeasure network;

the sampling module is used for sending a sampling instruction to the sampling node by the learning node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends the sample to the learning node after collecting the sample;

and the training module is used for training the learning nodes through the samples, updating and storing the multi-agent air countermeasure network.

Specifically, in the initialization module, a distributed sampling training program is firstly loaded on different computers, one computer is designated as a learning node, the computer network address of the sampling node is designated for the learning node, and the countermeasure network is initialized or loaded.

In the sampling module, a learning node is started, available sampling nodes are detected by using grpc, sampling is started if available sampling nodes exist, and otherwise, circulation is waited. The method comprises the steps that a sampling start instruction is sent by a learning node, after serialization of an countermeasure network, the sampling node sends the result to the sampling node, the sampling node receives parameters and reverses the sequence of the countermeasure network, and a sampling program is started to serialize a sample and store the serialized sample in a learning node redis; the learning node reads the sample information in redis and saves the deserialized result on memory. After enough samples are taken, the learning node changes the sampling flag bit state, and the sampling node stops sampling and waits for the next call. In this embodiment, the learning node serves as a center of the distributed sampling training framework, may be started independently, may pop up an event that no sampling node is available under the condition that no sampling node is available, and may cycle to wait for access of the sampling node, and may start sampling under the condition that one sampling node is available. In the sampling module, system monitoring and overtime monitoring can be added, when a learning node crashes or a program crashes, the program is restarted, the initial state is automatically recovered, a stored network is loaded, and the sampling can be continued; when the sampling node crashes or the program crashes, the current sampling cycle can be continued by other sampling nodes, and the crashed sampling node can be automatically added into the next cycle after restarting. In this embodiment, the samples to be transmitted are serialized and then transmitted, and the serialized files are used as memory virtual files to read and write, so as to avoid excessive hard disk read-write operations.

In the training module, the learning node performs training according to the acquired sample, updates network parameters and stores the network. The update formulas of the median network V_net, the strategy network A_net, the target distribution network target_net and the bullet decision network shoot_net in the countermeasure network are as follows:

wherein S is _i For the current multi-agent air countermeasure state quantity r _i Giving a prize value for the environment at the ith step length, wherein gamma is a discount factor; p (a) _i |S _i )，p(target _i |S _i )，p(shoot _i |S _i ) Respectively at S _i In state, the aircraft makes the current maneuver, selects the current target, and takes the probability of a firing decision.

Advantageously, in this embodiment, the training module sequences the trained network and stores the serialized network in the hard disk of the learning node in the training process, so as to support checking the training result without interrupting the training process.

Advantageously, the multi-agent over-the-air challenge distributed sampling training device of the present application supports the addition of sampling nodes without interrupting the training process.

The multi-agent air countermeasure distributed sampling training method and device have the following advantages:

1) In the sampling process, multi-machine multi-process sampling can be realized, and the sampling efficiency of samples is increased, so that the sampling-training iterative process is reduced, the training and convergence of the neural network are accelerated, and the single sampling time is changed along with the number of sampling nodes as shown in figure 2;

2) Due to the addition of error-tolerant mechanisms such as a query and overtime protection mechanism, the correct execution of the current cycle is not affected when part of sampling nodes cannot be connected in the one-time cycle process, and the self restarting function of the learning node can be realized;

3) The transmitted data, the sample and the neural network do not need to carry out hard disk IO operation, which is beneficial to the rapid and efficient execution of the program;

4) For multi-agent air countermeasure sample data with high dimension and large capacity, a segmented transmission method is adopted, and the optimal number of single transmission samples is searched through experiments, so that the network load is reduced, and the network bandwidth utilization rate is increased. Under the condition of a certain network line transmission speed, the number of mountable learning nodes can be increased, the sampling efficiency is increased, and the relation between the network flow and the single transmission sample size is respectively shown in figure 3;

5) In the period of waiting for the sample by the learning node, the learning node can sample by using the spare CPU resources, so that the sampling efficiency is increased;

6) By adopting the method of blocking and reading data, the waiting time of the sampling node and the learning node can be reduced by adopting the method of stopping sampling by the flag bit, and the completion of current cyclic sampling can not be influenced after the program of part of the sampling nodes crashes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-agent air countermeasure distributed sampling training method, comprising:

step three: the learning node trains through samples, updates and stores the multi-agent air countermeasure network;

in the first step: the establishing a connection between the learning node and the sampling node includes:

assigning a computer network address of a sampling node to the learning node;

the learning node inquires the number of available sampling nodes through grpc service, and records the network positions of the available sampling nodes in the memory of the learning node;

the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction and starts sampling, and the sampling node sends the sample to the learning node after collecting the sample comprises:

2. The multi-agent air countermeasure distributed sampling training method according to claim 1, wherein in step S21, the learning node sends a sampling instruction to the sampling node, the sampling node receives the sampling instruction, and the starting of sampling is specifically:

3. The multi-agent air countermeasure distributed sampling training method according to claim 2, wherein in step S211, the learning node sends the multi-agent air countermeasure network serialized result to the sampling node through a grpc service by a proto3 protocol.

4. A multi-agent over-the-air challenge distributed sampling training method as claimed in claim 3 wherein in step S23 the learning node reads the samples in the redis server by a block and pop method.

5. The multi-agent air countermeasure distributed sampling training method according to claim 4, wherein in step S23, the learning node stops sending a sampling instruction to the sampling node after a certain sample is collected, and the stopping of sampling is specifically:

6. The multi-agent over-the-air challenge distributed sampling training method of claim 5, further comprising the step four: and iterating the second step to the third step, and continuously updating the multi-agent air countermeasure network.

7. A multi-agent air countermeasure distributed sampling training apparatus, based on the multi-agent air countermeasure distributed sampling training method as claimed in any one of claims 1 to 6, comprising: