CN117666589A

CN117666589A - Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium

Info

Publication number: CN117666589A
Application number: CN202311681927.1A
Authority: CN
Inventors: 刘军; 孔凤杰; 肖翰文; 丁晓蕾; 尹小丹; 冯亿喆; 王颖; 程士豪
Original assignee: Jining University
Current assignee: Jining University
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-08

Abstract

The invention belongs to the technical field of unmanned ships, and particularly discloses an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, which comprises the following steps that firstly, environmental information is acquired by using an unmanned ship sensor, and whether a tracking missile exists or not is judged; then, obtaining corresponding coordinates, speed, course angle and other values of the unmanned ship, the tracking missile and the intercepting missile as reinforcement learning state quantity; inputting the state quantity into the reinforcement learning network model to obtain an action quantity and executing the action quantity; repeating the steps until the unmanned ship completely avoids or intercepts the tracked missile; the unmanned ship based on the two aspects of avoidance and interception uses reinforcement learning to control the unmanned ship to solve missile threat, has the characteristics of self-adaption, high efficiency, timeliness and the like, and has wide application prospect.

Description

Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium

Technical Field

The invention belongs to the technical field of unmanned boats, and particularly relates to an unmanned boat missile interception and avoidance algorithm, an interception and avoidance system and a readable storage medium based on reinforcement learning.

Background

Along with the application of unmanned ships in the fields of military, safety, emergency rescue and the like, missile threats facing the unmanned ships are continuously increased, missile interception and avoidance become keys of normal navigation of the unmanned ships, aiming at the problem of unmanned ship missile interception and avoidance, the traditional methods mainly depend on rules and pre-designed control algorithms, and are difficult to cope with the complexity and uncertainty of the missile threats, limit exists when dealing with complex and changeable missile threats, and different environments and tactical situations are difficult to adapt to; in addition, the existing algorithm has inherent limitation in adapting to different environments and hostile strategies, cannot realize autonomous learning and optimization, and is difficult to ensure the accuracy and efficiency of missile interception and avoidance, and the safety of unmanned boat offshore work is affected.

Disclosure of Invention

Based on the problems, the invention provides an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, an interception and avoidance system and a readable storage medium.

Based on the above purpose, the invention is realized by the following technical scheme:

the invention provides an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, which comprises the following steps of:

s11, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if yes, executing the step S22, and if not, repeatedly executing the step S11.

S22, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and then obtaining the coordinates, the speed and the course angle of the interception missile.

S33, inputting all variables in the S22 as state quantities into an unmanned ship missile interception and avoidance network model, obtaining action quantities for controlling the unmanned ship to propel and intercept missile launching control, and controlling the unmanned ship to move and intercept missile launching; the unmanned ship missile interception and avoidance network model is an optimal network model obtained after the unmanned ship missile interception and avoidance algorithm based on reinforcement learning is trained.

S44, repeating the steps S11-S33, and enabling the unmanned ship to achieve the purpose of missile interception and avoidance; the unmanned ship can avoid the missile tracking and intercept the tracked missiles, and the unmanned ship's missile interception and avoidance can be realized.

Preferably, the method comprises the following steps:

s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the unmanned ship missile interception and avoidance virtual environment comprises a marine environment, an unmanned ship model, an unmanned ship and missile motion model, a missile model and an obstacle model.

S2, randomly setting a tracking missile at one point in the virtual environment for attacking the unmanned ship.

S3, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if so, executing the step S4, and if not, repeatedly executing the step S3.

S4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, so as to obtain the coordinates, the speed and the course angle of the intercepted missile.

S5, taking each variable in S4 as a state quantity S _t Inputting the unmanned aerial vehicle missile interception and avoidance network model to obtain an action quantity a for controlling the unmanned aerial vehicle to propel and intercept missile emission control _t Controlling the movement of the unmanned ship and intercepting missile launching; simultaneously based on shapeState quantity information to obtain the current local rewarding function r _st (s _t ) The method comprises the steps of carrying out a first treatment on the surface of the The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network; the near-end policy optimization algorithm neural network is Proximal Policy Optimization algorithm neural network, and is hereinafter referred to as PPO algorithm.

S6, repeating the steps S3-S5 to obtain the state quantity S at the next moment _t ' the state quantity s at the current moment _t Action a _t Prize r _st The next time state quantity s _t ' as a set of experience tuples (s _t ，a _t ，r _st ，s _t ′)。

S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts and tracks the missile or the unmanned aerial vehicle successfully evades the missile, the self-local training is ended; obtaining a global rewarding function r according to training results _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) According to a global rewarding function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office _st (s _t ) Obtaining global local rewarding function r _t (s _i ) The global local rewards function updates the corresponding experience tuples of each group as (s _t ，a _t ，r _t ，s _t ′)。

S8, repeatedly executing S2-S7, counting the number of repeated bureaus, executing the step S9 if the number of bureaus reaches the number E required by parameter updating, temporarily clearing the count, counting again, and continuously repeatedly executing S2-S7 if the number of bureaus is smaller than E until the number of bureaus reaches E bureaus.

And S9, screening n groups of experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating network model parameters for intercepting and avoiding unmanned ship missiles sequentially by using the screened n groups of experience tuples.

And S10, judging whether the total office number reaches a preset cycle number N, if so, ending the cycle, and ending the unmanned ship missile interception and avoidance network model training to obtain an optimal network model.

Preferably, in step S5,

state quantity

Motion quantity a _t ＝(v _u ′，v _l ′，α _u ′，α _l ′，B _l )；

Wherein, (x) _d ，y _d ，z _d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) _l ，y _l ，z _l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v _u ，v _d ，v _l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles;representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha _u ，α _d ，α _l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) _l The Boolean value variable is shown as 1 when the intercepted missile is launched, otherwise is 0, and all parameters related to the intercepted missile are 0 when the intercepted missile is not launched.

Preferably, in step S5, the local rewards function r _st (s _t ) The method comprises the following steps:

wherein omega ₁ ，ω ₂ ，ω ₃ ，ω ₄ Indicating super parameters, and adjusting according to actual conditions; d represents the linear distance between the tracking missile and the unmanned boat; τ represents a supplementary factor preventing d=0 from causing a gradient explosion, typically τ=0.05; t and s represent the time elapsed from the start of the office and the number of steps currently performed, respectively.

Preferably, in step S7, the global rewards function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) The method comprises the following steps:

wherein R is _l Indicating interception of rewards, R _b Representing an evasion reward;

intercept rewards R _l The method comprises the following steps:

evasion reward R _b The method comprises the following steps:

preferably, in step S7, the global rewards function r is used _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) And a local rewards function r _st (s _t ) Obtaining global local rewarding function r _t (s _i ) The method comprises the following steps:

preferably, in step S5, the unmanned ship missile interception and avoidance network model includes an Actor network and a Critic network; the Actor network is a strategy network, and the Critic network is a value network; the input to the Actor network is the state quantity s _t Output is the action quantity a _t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity s at the current moment _t Output is state quantity s _t Is a state value evaluation value of (a).

Preferably, the Actor network comprises an input layer, four linear hidden layers and an output layer, wherein the input layer inputs the state quantity s _t The four linear hidden layers are 1024, 512, 256 and 128 nodes respectively, and the output value of the final output layer is the action quantity a _t The method comprises the steps of carrying out a first treatment on the surface of the The Critic network comprises an input layer, two linear hidden layers and an output layer, wherein the input layer also inputs the state quantity s _t The two linear hidden layers comprise 256 nodes and 128 nodes, and finally the two linear hidden layers are output to the output layer to obtain a state value evaluation value.

Preferably, in step S9, the method for screening the experience data from the experience buffer pool by using the priority experience playback algorithm is as follows:

wherein P (c) represents a set of empirical data(s) selected from the empirical buffer pool _t ，a _t ，r _st ，s _t ' s); c represents the serial number of the extracted experience data; p is p _k Representing the priority of the extracted empirical data; alpha represents a preset parameter for adjusting the degree of preferential sampling of the data samples.

Preferably, in step S9, the updating and optimizing of the network model parameters are performed for the interception and avoidance of the unmanned ship missile by adopting the PPO algorithm, and the steps include:

(1) Defining variables;

parameters of policy network are denoted by θ, by pi _θ (a _t |s _t ) Represented in state s _t Lower execution action a _t The probability of the policy output; by usingParameters representing the value network->Representing the output of the value network, estimating the state s _t Is of value (c).

(2) Defining a dominance function;

dominance function A (s _t ，a _t ) Represented in state s _t Lower execution action a _t The calculation method is the difference between the return of the current state and the reference value, the dominance function a (s _t ，a _t ) The method comprises the following steps:

wherein Q(s) _t ，a _t ) Representing execution of action a _t The cumulative return on the back is then,representing value network pair states s _t Is a function of the estimate of (2).

Wherein t represents the current time step number; t represents the number of steps of the local office; gamma represents a discount factor, representing the discount degree of future rewards; r is (r) _t Indicating the prize earned at time t'.

(3) An objective function;

the objective function consists of three parts, including a ratio term of the strategy network, a loss term of the cost function and an entropy regularization term, and the objective function is as follows:

wherein,representing the desire for multiple times t; />clip(ρ _t 1-e, 1+ 'e) is a phase function of the ratio, limiting the ratio to the interval 1-e, 1 +' e]Between them; alpha and beta represent the hyper-parameters of the adjustment ratio clipping and the entropy loss weight, respectively; h (pi) _θ (·|s _t ) Entropy of the policy network for encouraging exploration.

(4) Updating a strategy network;

by maximising the objective function L ^PPO2 (θ) update the parameter θ of the policy network:

θ _new ＝θ _old +argmax _θ L ^PPO2 (θ)。

(5) Updating a value network;

the updating of the value network is done by minimizing the mean square error:

wherein R is _t Is a calculated dominance estimate.

The PPO algorithm effectively balances stability and exploratory property in training through mechanisms such as near-end strategy optimization, GAE, resistance loss, entropy regularization and the like, and improves the performance of a strategy network.

The invention provides an unmanned ship missile interception and avoidance system based on reinforcement learning, which comprises a monitoring module, a communication module and a main control module; the monitoring module is hardware equipment carrying with sensors such as sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment; the main control module is hardware equipment carrying a microprocessor and has a wireless communication function, and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions; the monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into the unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an optimal network model of the unmanned ship missile interception and avoidance network, and executes the interception and avoidance algorithm according to the first aspect of the invention.

A third aspect of the invention provides a computer readable storage medium having a computer program stored thereon which when executed by a computer processor implements any of the steps of the algorithm described in the first aspect of the invention.

In addition, the unmanned ship missile interception and avoidance algorithm based on reinforcement learning provided by the first aspect of the invention comprises the following steps that firstly, environmental information is acquired by using an unmanned ship sensor, and whether a tracking missile exists or not is judged; then, obtaining corresponding coordinates, speed, course angle and other values of the unmanned ship, the tracking missile and the intercepting missile as reinforcement learning state quantity; inputting the state quantity into the reinforcement learning network model to obtain an action quantity and executing the action quantity; repeating the steps until the unmanned ship completely avoids or intercepts the tracked missile; the unmanned ship based on the two aspects of avoidance and interception uses reinforcement learning to control the unmanned ship to solve missile threat, has the characteristics of self-adaption, high efficiency, timeliness and the like, and has wide application prospect.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, aiming at complex sea conditions, the virtual environment is designed for simulation training, the simulation of the virtual environment can be completed on a server platform, and the dependence on the actual scene is small, so that the limitation of unmanned ship missile interception and avoidance training is reduced, the influence on the surrounding environment in the training process is avoided, the autonomous learning ability of the simulation training of the virtual environment is strong, the network convergence can be realized quickly and efficiently, the training cost is effectively saved, the training time and period are shortened, and the training effect is quickly improved.

2. Compared with the traditional single sensor, the unmanned ship can realize rapid identification of tracking missiles in complex scenes such as heavy fog, heavy rain, undersea, dim light at night and the like, greatly improves the identification efficiency of the unmanned ship, and further improves the universality and instantaneity of algorithms.

3. The invention adopts the algorithm to control the interception missile to intercept the tracking missile while considering that the unmanned ship avoids the missile, realizes the double guarantee of avoiding and intercepting the tracking missile, double insurance is provided for the unmanned ship to get rid of missile threat, the probability of the unmanned ship being threatened by the tracked missile is greatly reduced, and the offshore work safety of the unmanned ship is obviously improved.

4. According to the invention, a priority experience playback method is adopted in the process of training the network model, experience data are screened out from the experience buffer pool, and network parameters of the reinforcement-learning unmanned ship missile interception and avoidance network model are updated according to the screened experience tuples, and the priority experience playback algorithm can select experience data with higher priority according to the current state, efficiently utilize the experience data and update the experience data, so that the precision and efficiency of the network parameters are greatly improved, the probability of unmanned ship missile interception and avoidance is obviously improved, and the safety of unmanned ship offshore work is ensured.

5. The algorithm can enable the unmanned ship to more intelligently cope with missile attack through autonomous learning and optimization, and improves the survivability of the unmanned ship in dangerous environments and the task execution effect.

Drawings

FIG. 1 is a schematic flow chart of the present invention in example 1;

FIG. 2 is a schematic flow chart of the present invention in example 2.

Detailed Description

The present invention will be described in further detail by way of the following specific examples, which are not intended to limit the scope of the present invention.

Example 1

An unmanned ship missile interception and avoidance algorithm based on reinforcement learning, as shown in fig. 1, comprises the following steps:

s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the virtual environment comprises a marine environment, an unmanned ship model, unmanned ships and missile motion models, a missile model, an obstacle model and the like.

S4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship by taking the center of the unmanned ship as an origin of a coordinate system, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, so as to obtain the coordinates, the speed and the course angle of the intercepted missile.

S5, taking each variable in S4 as a state quantity S _t Inputting the unmanned aerial vehicle missile interception and avoidance network model to obtain an action quantity a for controlling the unmanned aerial vehicle to propel and intercept missile emission control _t Controlling the movement of the unmanned ship and intercepting missile launching; state quantityMotion quantity a _t ＝(v _u ′，v _l ′，α _u ′，α _l ′，B _l ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, (x) _d ，y _d ，z _d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) _l ，y _l ，z _l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v _u ，v _d ，v _l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles; />Representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha _u ，α _d ，α _l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) _l The Boolean value variable is shown as 1 when the intercepted missile is launched, otherwise is 0, and all parameters related to the intercepted missile are 0 when the intercepted missile is not launched.

Meanwhile, based on information such as state quantity and the like, the current local rewarding function r is calculated and obtained _st (s _t )；

Local rewarding function r _st (s _t ) The method comprises the following steps:

The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network; the near-end policy optimization algorithm neural network is Proximal Policy Optimization algorithm neural network, and is hereinafter referred to as PPO algorithm.

The unmanned ship missile interception and avoidance network model comprises an Actor network and a Critic network; the Actor network is a strategy network, and the Critic network is a value network; the input to the Actor network is the state quantity s _t Output is the action quantity a _t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity s at the current moment _t Output is state quantity s _t Is a state value evaluation value of (a).

The Actor network comprises an input layer, four linear hidden layers and an output layer, wherein the input layer inputs state quantity s _t The four linear hidden layers are 1024, 512, 256 and 128 nodes respectively, and the output value of the final output layer is the action quantity a _t The method comprises the steps of carrying out a first treatment on the surface of the The Critic network comprises an input layer, two linear hidden layers and an output layer, wherein the input layer also inputs the state quantity s _t The two linear hidden layers comprise 256 nodes and 128 nodes, and finally the two linear hidden layers are output to the output layer to obtain a state value evaluation value.

S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts the tracking missile or the unmanned aerial vehicle successfully evades the missile (the tracking missile is exploded on an obstacle or sunk into submarine explosion), the training of the local bureau is finished; obtaining a global rewarding function r according to training results _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) According to a global rewarding function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office _st (s _t ) Obtaining global local rewarding function r _t (s _i ) Updating each corresponding set of experience tuples to(s) according to the global local rewards function _t ，a _t ，r _t ，s _t ′)。

Global reward function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) The method comprises the following steps:

intercept rewards R _l The method comprises the following steps:

evasion reward R _b The method comprises the following steps:

in step S7, according to the global rewards function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) And a local rewards function r _st (s _t ) Obtaining global local rewarding function r _t (s _i ) The method comprises the following steps:

The method for screening the experience data from the experience buffer pool by adopting the priority experience playback algorithm comprises the following steps:

The method comprises the following steps of updating and optimizing unmanned ship missile interception and avoidance network model parameters by adopting a PPO algorithm:

(1) Defining variables;

(2) Defining a dominance function;

(3) An objective function;

(4) Updating a strategy network;

θ _new ＝θ _old +argmax _θ L ^PPO2 (θ)。

(5) Updating a value network;

the updating of the value network is done by minimizing the mean square error:

wherein R is _t Is a calculated dominance estimate.

Example 2

An unmanned ship missile interception and avoidance algorithm based on reinforcement learning, as shown in fig. 2, comprises the following steps: the method comprises the following steps:

S33, inputting all variables in the S22 as state quantities into an unmanned ship missile interception and avoidance network model, obtaining action quantities for controlling the unmanned ship to propel and intercept missile launching control, and controlling the unmanned ship to move and intercept missile launching; the unmanned aerial vehicle missile interception and avoidance network model is an optimal unmanned aerial vehicle missile interception and avoidance network model after training, and the unmanned aerial vehicle missile interception and avoidance network model is the optimal unmanned aerial vehicle missile interception and avoidance network model obtained after unmanned aerial vehicle missile interception and avoidance training in the first aspect.

Example 3

An unmanned ship missile interception and avoidance system based on reinforcement learning comprises a monitoring module, a communication module and a main control module.

The monitoring module is hardware equipment carrying sensors such as sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment.

The main control module is hardware equipment carrying a microprocessor and has a wireless communication function and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions.

The monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into the unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an optimal network model of the unmanned ship missile interception and avoidance network, and executes the interception and avoidance algorithm according to the first aspect of the invention.

The specific flow of the interception and avoidance algorithm for realizing the unmanned ship guided missile interception and avoidance system of the embodiment 1-2 is as follows:

(1) The monitoring module monitors and collects surrounding training environment information in real time, and is communicated with the main control module through the communication module, and the main control module processes the environment information and judges whether a tracking missile exists or not.

(2) If the tracked missile exists, calculating to obtain information of the tracked missile by taking the center of the unmanned aerial vehicle as an origin of a coordinate system, inputting various variables as state quantities into an unmanned aerial vehicle missile interception and avoidance network model in a main control module, and setting rewards of the unmanned aerial vehicle missile interception and avoidance network model according to training results.

(3) Repeating training to obtain a plurality of groups of experience tuples and storing the experience tuples into an experience buffer pool, and modifying rewards of the unmanned ship missile interception and avoidance network model according to training results of unmanned ship missile interception and avoidance; the training is repeated until the number of bureaus is set.

(4) And screening experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating the unmanned ship missile interception and avoidance network model.

(5) And circulating the steps to reach the preset cycle number to obtain an optimal network model for unmanned ship missile interception and avoidance control, so as to realize the aims of unmanned ship missile avoidance tracking and guided missile interception and tracking.

Example 4

A computer readable storage medium having stored thereon a computer program which when executed by a computer processor performs any of the steps of a reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm as described in embodiments 1-2.

The computer readable medium described herein may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

In conclusion, the invention effectively overcomes the defects in the prior art and has high industrial utilization value. The above-described embodiments are provided to illustrate the gist of the present invention, but are not intended to limit the scope of the present invention. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. An unmanned ship missile interception and avoidance algorithm based on reinforcement learning is characterized by comprising the following steps of:

s11, acquiring environmental information at the current moment by using sonar, camera and radar carried on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if yes, executing the step S22, and if not, repeatedly executing the step S11;

s22, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and then obtaining the coordinates, the speed and the course angle of the interception missile;

s33, inputting each variable in the S22 as a state quantity into an unmanned ship missile interception and avoidance network model to obtain an action quantity for controlling the unmanned ship to propel and intercept missile launching control; the unmanned ship missile interception and avoidance network model is an optimal network model obtained after training an unmanned ship missile interception and avoidance algorithm based on reinforcement learning;

s44, repeating the steps S11-S33, and enabling the unmanned ship to achieve the purpose of missile interception and avoidance.

2. The reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm of claim 1, comprising the steps of:

s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the virtual environment comprises a marine environment, an unmanned ship model, an unmanned ship and missile motion model, a missile model and an obstacle model;

s2, randomly setting a tracking missile at one point in the virtual environment for attacking the unmanned ship;

s3, acquiring environmental information at the current moment by using sonar, camera and radar carried on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if so, executing the step S4, and if not, repeatedly executing the step S3;

s4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and obtaining the coordinates, the speed and the course angle of the intercepted missile;

s5, taking each variable in S4 as a state quantity S _t Missile interception of input unmanned shipAnd in the evasion network model, obtaining an action quantity a for controlling the unmanned ship to propel and intercept missile launching control _t The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, based on state quantity information, the current local rewarding function r is obtained _st (s _t ) The method comprises the steps of carrying out a first treatment on the surface of the The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network;

s6, repeating the steps S3-S5 to obtain the state quantity S at the next moment _t ' the state quantity s at the current moment _t Action a _t Prize r _st The next time state quantity s _t ' as a set of experience tuples (s _t ，a _t ，r _st ，s _t ′)；

S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts and tracks the missile or the unmanned aerial vehicle successfully evades the missile, the self-local training is ended; obtaining a global rewarding function r according to training results _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) According to a global rewarding function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office _st (s _t ) Obtaining global local rewarding function r _t (s _i ) The global local rewards function updates the corresponding experience tuples of each group as (s _t ，a _t ，r _t ，s _t ′)；

S8, repeatedly executing S2-S7, counting the number of repeated bureaus, executing the step S9 if the number of bureaus reaches the number E required by parameter updating, temporarily clearing the count, counting again, and continuously repeatedly executing S2-S7 if the number of bureaus is smaller than E until the number of bureaus reaches E bureaus;

s9, screening n groups of experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating network model parameters for intercepting and avoiding unmanned ship missiles sequentially by using the screened n groups of experience tuples;

3. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S5 the state quantityThe motion quantity a _t ＝(v _u ′，v _l ′，α _u ′，α _l ′，B _l )；

Wherein, (x) _d ，y _d ，z _d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) _l ，y _l ，z _l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v _u ，v _d ，v _l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles;representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha _u ，α _d ，α _l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) _l The Boolean value variable is 1 when the intercepted missile is launched, otherwise, the Boolean value variable is 0, and all parameters of the Boolean value variable are 0 when the intercepted missile is not launched.

4. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S5 the local reward function r _st (s _t ) The method comprises the following steps:

wherein omega ₁ ，ω ₂ ，ω ₃ ，ω ₄ Representing the super-parameters; d represents the linear distance between the tracking missile and the unmanned boat; τ represents a supplemental coefficientThe method comprises the steps of carrying out a first treatment on the surface of the t and s represent the time elapsed from the start of the office and the number of steps currently performed, respectively.

5. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S7 the global reward function r _t (s ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _T ) The method comprises the following steps:

the intercept rewards Rl are:

evasion reward R _b The method comprises the following steps:

6. reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S7 the global local reward function r _t (s _i ) The method comprises the following steps:

7. the reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm of claim 2 wherein in step S5, the unmanned aerial vehicle missile interception and avoidance network model includes an Actor network and a Critic network; the input of the Actor network is a state quantity s _t Output is the action quantity a _t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity at the current moments _t Output is state quantity s _t Is a state value evaluation value of (a).

8. The reinforcement learning-based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2, wherein in step S9, the method of screening the experience data from the experience buffer pool using a priority experience playback algorithm is as follows:

9. The unmanned ship missile interception and avoidance system based on reinforcement learning is characterized by comprising a monitoring module, a communication module and a main control module; the monitoring module is hardware equipment carrying sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment; the main control module is hardware equipment carrying a microprocessor and has a wireless communication function, and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions; the monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into an unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an unmanned ship missile interception and avoidance network optimal network model, and executes the interception and avoidance algorithm according to any one of claims 1-8.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when executed by a computer processor, the computer program implements any of the reinforcement learning-based unmanned aerial vehicle missile interception and avoidance algorithms of any of claims 1-8.