CN117666589A - Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium - Google Patents

Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium Download PDF

Info

Publication number
CN117666589A
CN117666589A CN202311681927.1A CN202311681927A CN117666589A CN 117666589 A CN117666589 A CN 117666589A CN 202311681927 A CN202311681927 A CN 202311681927A CN 117666589 A CN117666589 A CN 117666589A
Authority
CN
China
Prior art keywords
missile
unmanned ship
interception
avoidance
unmanned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311681927.1A
Other languages
Chinese (zh)
Inventor
刘军
孔凤杰
肖翰文
丁晓蕾
尹小丹
冯亿喆
王颖
程士豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jining University
Original Assignee
Jining University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jining University filed Critical Jining University
Priority to CN202311681927.1A priority Critical patent/CN117666589A/en
Publication of CN117666589A publication Critical patent/CN117666589A/en
Pending legal-status Critical Current

Links

Abstract

The invention belongs to the technical field of unmanned ships, and particularly discloses an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, which comprises the following steps that firstly, environmental information is acquired by using an unmanned ship sensor, and whether a tracking missile exists or not is judged; then, obtaining corresponding coordinates, speed, course angle and other values of the unmanned ship, the tracking missile and the intercepting missile as reinforcement learning state quantity; inputting the state quantity into the reinforcement learning network model to obtain an action quantity and executing the action quantity; repeating the steps until the unmanned ship completely avoids or intercepts the tracked missile; the unmanned ship based on the two aspects of avoidance and interception uses reinforcement learning to control the unmanned ship to solve missile threat, has the characteristics of self-adaption, high efficiency, timeliness and the like, and has wide application prospect.

Description

Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium
Technical Field
The invention belongs to the technical field of unmanned boats, and particularly relates to an unmanned boat missile interception and avoidance algorithm, an interception and avoidance system and a readable storage medium based on reinforcement learning.
Background
Along with the application of unmanned ships in the fields of military, safety, emergency rescue and the like, missile threats facing the unmanned ships are continuously increased, missile interception and avoidance become keys of normal navigation of the unmanned ships, aiming at the problem of unmanned ship missile interception and avoidance, the traditional methods mainly depend on rules and pre-designed control algorithms, and are difficult to cope with the complexity and uncertainty of the missile threats, limit exists when dealing with complex and changeable missile threats, and different environments and tactical situations are difficult to adapt to; in addition, the existing algorithm has inherent limitation in adapting to different environments and hostile strategies, cannot realize autonomous learning and optimization, and is difficult to ensure the accuracy and efficiency of missile interception and avoidance, and the safety of unmanned boat offshore work is affected.
Disclosure of Invention
Based on the problems, the invention provides an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, an interception and avoidance system and a readable storage medium.
Based on the above purpose, the invention is realized by the following technical scheme:
the invention provides an unmanned ship missile interception and avoidance algorithm based on reinforcement learning, which comprises the following steps of:
s11, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if yes, executing the step S22, and if not, repeatedly executing the step S11.
S22, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and then obtaining the coordinates, the speed and the course angle of the interception missile.
S33, inputting all variables in the S22 as state quantities into an unmanned ship missile interception and avoidance network model, obtaining action quantities for controlling the unmanned ship to propel and intercept missile launching control, and controlling the unmanned ship to move and intercept missile launching; the unmanned ship missile interception and avoidance network model is an optimal network model obtained after the unmanned ship missile interception and avoidance algorithm based on reinforcement learning is trained.
S44, repeating the steps S11-S33, and enabling the unmanned ship to achieve the purpose of missile interception and avoidance; the unmanned ship can avoid the missile tracking and intercept the tracked missiles, and the unmanned ship's missile interception and avoidance can be realized.
Preferably, the method comprises the following steps:
s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the unmanned ship missile interception and avoidance virtual environment comprises a marine environment, an unmanned ship model, an unmanned ship and missile motion model, a missile model and an obstacle model.
S2, randomly setting a tracking missile at one point in the virtual environment for attacking the unmanned ship.
S3, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if so, executing the step S4, and if not, repeatedly executing the step S3.
S4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, so as to obtain the coordinates, the speed and the course angle of the intercepted missile.
S5, taking each variable in S4 as a state quantity S t Inputting the unmanned aerial vehicle missile interception and avoidance network model to obtain an action quantity a for controlling the unmanned aerial vehicle to propel and intercept missile emission control t Controlling the movement of the unmanned ship and intercepting missile launching; simultaneously based on shapeState quantity information to obtain the current local rewarding function r st (s t ) The method comprises the steps of carrying out a first treatment on the surface of the The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network; the near-end policy optimization algorithm neural network is Proximal Policy Optimization algorithm neural network, and is hereinafter referred to as PPO algorithm.
S6, repeating the steps S3-S5 to obtain the state quantity S at the next moment t ' the state quantity s at the current moment t Action a t Prize r st The next time state quantity s t ' as a set of experience tuples (s t ,a t ,r st ,s t ′)。
S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts and tracks the missile or the unmanned aerial vehicle successfully evades the missile, the self-local training is ended; obtaining a global rewarding function r according to training results t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) According to a global rewarding function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office st (s t ) Obtaining global local rewarding function r t (s i ) The global local rewards function updates the corresponding experience tuples of each group as (s t ,a t ,r t ,s t ′)。
S8, repeatedly executing S2-S7, counting the number of repeated bureaus, executing the step S9 if the number of bureaus reaches the number E required by parameter updating, temporarily clearing the count, counting again, and continuously repeatedly executing S2-S7 if the number of bureaus is smaller than E until the number of bureaus reaches E bureaus.
And S9, screening n groups of experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating network model parameters for intercepting and avoiding unmanned ship missiles sequentially by using the screened n groups of experience tuples.
And S10, judging whether the total office number reaches a preset cycle number N, if so, ending the cycle, and ending the unmanned ship missile interception and avoidance network model training to obtain an optimal network model.
Preferably, in step S5,
state quantity
Motion quantity a t =(v u ′,v l ′,α u ′,α l ′,B l );
Wherein, (x) d ,y d ,z d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) l ,y l ,z l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v u ,v d ,v l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles;representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha u ,α d ,α l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) l The Boolean value variable is shown as 1 when the intercepted missile is launched, otherwise is 0, and all parameters related to the intercepted missile are 0 when the intercepted missile is not launched.
Preferably, in step S5, the local rewards function r st (s t ) The method comprises the following steps:
wherein omega 1 ,ω 2 ,ω 3 ,ω 4 Indicating super parameters, and adjusting according to actual conditions; d represents the linear distance between the tracking missile and the unmanned boat; τ represents a supplementary factor preventing d=0 from causing a gradient explosion, typically τ=0.05; t and s represent the time elapsed from the start of the office and the number of steps currently performed, respectively.
Preferably, in step S7, the global rewards function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) The method comprises the following steps:
wherein R is l Indicating interception of rewards, R b Representing an evasion reward;
intercept rewards R l The method comprises the following steps:
evasion reward R b The method comprises the following steps:
preferably, in step S7, the global rewards function r is used t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) And a local rewards function r st (s t ) Obtaining global local rewarding function r t (s i ) The method comprises the following steps:
preferably, in step S5, the unmanned ship missile interception and avoidance network model includes an Actor network and a Critic network; the Actor network is a strategy network, and the Critic network is a value network; the input to the Actor network is the state quantity s t Output is the action quantity a t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity s at the current moment t Output is state quantity s t Is a state value evaluation value of (a).
Preferably, the Actor network comprises an input layer, four linear hidden layers and an output layer, wherein the input layer inputs the state quantity s t The four linear hidden layers are 1024, 512, 256 and 128 nodes respectively, and the output value of the final output layer is the action quantity a t The method comprises the steps of carrying out a first treatment on the surface of the The Critic network comprises an input layer, two linear hidden layers and an output layer, wherein the input layer also inputs the state quantity s t The two linear hidden layers comprise 256 nodes and 128 nodes, and finally the two linear hidden layers are output to the output layer to obtain a state value evaluation value.
Preferably, in step S9, the method for screening the experience data from the experience buffer pool by using the priority experience playback algorithm is as follows:
wherein P (c) represents a set of empirical data(s) selected from the empirical buffer pool t ,a t ,r st ,s t ' s); c represents the serial number of the extracted experience data; p is p k Representing the priority of the extracted empirical data; alpha represents a preset parameter for adjusting the degree of preferential sampling of the data samples.
Preferably, in step S9, the updating and optimizing of the network model parameters are performed for the interception and avoidance of the unmanned ship missile by adopting the PPO algorithm, and the steps include:
(1) Defining variables;
parameters of policy network are denoted by θ, by pi θ (a t |s t ) Represented in state s t Lower execution action a t The probability of the policy output; by usingParameters representing the value network->Representing the output of the value network, estimating the state s t Is of value (c).
(2) Defining a dominance function;
dominance function A (s t ,a t ) Represented in state s t Lower execution action a t The calculation method is the difference between the return of the current state and the reference value, the dominance function a (s t ,a t ) The method comprises the following steps:
wherein Q(s) t ,a t ) Representing execution of action a t The cumulative return on the back is then,representing value network pair states s t Is a function of the estimate of (2).
Wherein t represents the current time step number; t represents the number of steps of the local office; gamma represents a discount factor, representing the discount degree of future rewards; r is (r) t Indicating the prize earned at time t'.
(3) An objective function;
the objective function consists of three parts, including a ratio term of the strategy network, a loss term of the cost function and an entropy regularization term, and the objective function is as follows:
wherein,representing the desire for multiple times t; />clip(ρ t 1-e, 1+ 'e) is a phase function of the ratio, limiting the ratio to the interval 1-e, 1 +' e]Between them; alpha and beta represent the hyper-parameters of the adjustment ratio clipping and the entropy loss weight, respectively; h (pi) θ (·|s t ) Entropy of the policy network for encouraging exploration.
(4) Updating a strategy network;
by maximising the objective function L PPO2 (θ) update the parameter θ of the policy network:
θ new =θ old +argmax θ L PPO2 (θ)。
(5) Updating a value network;
the updating of the value network is done by minimizing the mean square error:
wherein R is t Is a calculated dominance estimate.
The PPO algorithm effectively balances stability and exploratory property in training through mechanisms such as near-end strategy optimization, GAE, resistance loss, entropy regularization and the like, and improves the performance of a strategy network.
The invention provides an unmanned ship missile interception and avoidance system based on reinforcement learning, which comprises a monitoring module, a communication module and a main control module; the monitoring module is hardware equipment carrying with sensors such as sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment; the main control module is hardware equipment carrying a microprocessor and has a wireless communication function, and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions; the monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into the unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an optimal network model of the unmanned ship missile interception and avoidance network, and executes the interception and avoidance algorithm according to the first aspect of the invention.
A third aspect of the invention provides a computer readable storage medium having a computer program stored thereon which when executed by a computer processor implements any of the steps of the algorithm described in the first aspect of the invention.
In addition, the unmanned ship missile interception and avoidance algorithm based on reinforcement learning provided by the first aspect of the invention comprises the following steps that firstly, environmental information is acquired by using an unmanned ship sensor, and whether a tracking missile exists or not is judged; then, obtaining corresponding coordinates, speed, course angle and other values of the unmanned ship, the tracking missile and the intercepting missile as reinforcement learning state quantity; inputting the state quantity into the reinforcement learning network model to obtain an action quantity and executing the action quantity; repeating the steps until the unmanned ship completely avoids or intercepts the tracked missile; the unmanned ship based on the two aspects of avoidance and interception uses reinforcement learning to control the unmanned ship to solve missile threat, has the characteristics of self-adaption, high efficiency, timeliness and the like, and has wide application prospect.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, aiming at complex sea conditions, the virtual environment is designed for simulation training, the simulation of the virtual environment can be completed on a server platform, and the dependence on the actual scene is small, so that the limitation of unmanned ship missile interception and avoidance training is reduced, the influence on the surrounding environment in the training process is avoided, the autonomous learning ability of the simulation training of the virtual environment is strong, the network convergence can be realized quickly and efficiently, the training cost is effectively saved, the training time and period are shortened, and the training effect is quickly improved.
2. Compared with the traditional single sensor, the unmanned ship can realize rapid identification of tracking missiles in complex scenes such as heavy fog, heavy rain, undersea, dim light at night and the like, greatly improves the identification efficiency of the unmanned ship, and further improves the universality and instantaneity of algorithms.
3. The invention adopts the algorithm to control the interception missile to intercept the tracking missile while considering that the unmanned ship avoids the missile, realizes the double guarantee of avoiding and intercepting the tracking missile, double insurance is provided for the unmanned ship to get rid of missile threat, the probability of the unmanned ship being threatened by the tracked missile is greatly reduced, and the offshore work safety of the unmanned ship is obviously improved.
4. According to the invention, a priority experience playback method is adopted in the process of training the network model, experience data are screened out from the experience buffer pool, and network parameters of the reinforcement-learning unmanned ship missile interception and avoidance network model are updated according to the screened experience tuples, and the priority experience playback algorithm can select experience data with higher priority according to the current state, efficiently utilize the experience data and update the experience data, so that the precision and efficiency of the network parameters are greatly improved, the probability of unmanned ship missile interception and avoidance is obviously improved, and the safety of unmanned ship offshore work is ensured.
5. The algorithm can enable the unmanned ship to more intelligently cope with missile attack through autonomous learning and optimization, and improves the survivability of the unmanned ship in dangerous environments and the task execution effect.
Drawings
FIG. 1 is a schematic flow chart of the present invention in example 1;
FIG. 2 is a schematic flow chart of the present invention in example 2.
Detailed Description
The present invention will be described in further detail by way of the following specific examples, which are not intended to limit the scope of the present invention.
Example 1
An unmanned ship missile interception and avoidance algorithm based on reinforcement learning, as shown in fig. 1, comprises the following steps:
s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the virtual environment comprises a marine environment, an unmanned ship model, unmanned ships and missile motion models, a missile model, an obstacle model and the like.
S2, randomly setting a tracking missile at one point in the virtual environment for attacking the unmanned ship.
S3, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if so, executing the step S4, and if not, repeatedly executing the step S3.
S4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship by taking the center of the unmanned ship as an origin of a coordinate system, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, so as to obtain the coordinates, the speed and the course angle of the intercepted missile.
S5, taking each variable in S4 as a state quantity S t Inputting the unmanned aerial vehicle missile interception and avoidance network model to obtain an action quantity a for controlling the unmanned aerial vehicle to propel and intercept missile emission control t Controlling the movement of the unmanned ship and intercepting missile launching; state quantityMotion quantity a t =(v u ′,v l ′,α u ′,α l ′,B l ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, (x) d ,y d ,z d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) l ,y l ,z l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v u ,v d ,v l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles; />Representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha u ,α d ,α l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) l The Boolean value variable is shown as 1 when the intercepted missile is launched, otherwise is 0, and all parameters related to the intercepted missile are 0 when the intercepted missile is not launched.
Meanwhile, based on information such as state quantity and the like, the current local rewarding function r is calculated and obtained st (s t );
Local rewarding function r st (s t ) The method comprises the following steps:
wherein omega 1 ,ω 2 ,ω 3 ,ω 4 Indicating super parameters, and adjusting according to actual conditions; d represents the linear distance between the tracking missile and the unmanned boat; τ represents a supplementary factor preventing d=0 from causing a gradient explosion, typically τ=0.05; t and s represent the time elapsed from the start of the office and the number of steps currently performed, respectively.
The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network; the near-end policy optimization algorithm neural network is Proximal Policy Optimization algorithm neural network, and is hereinafter referred to as PPO algorithm.
The unmanned ship missile interception and avoidance network model comprises an Actor network and a Critic network; the Actor network is a strategy network, and the Critic network is a value network; the input to the Actor network is the state quantity s t Output is the action quantity a t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity s at the current moment t Output is state quantity s t Is a state value evaluation value of (a).
The Actor network comprises an input layer, four linear hidden layers and an output layer, wherein the input layer inputs state quantity s t The four linear hidden layers are 1024, 512, 256 and 128 nodes respectively, and the output value of the final output layer is the action quantity a t The method comprises the steps of carrying out a first treatment on the surface of the The Critic network comprises an input layer, two linear hidden layers and an output layer, wherein the input layer also inputs the state quantity s t The two linear hidden layers comprise 256 nodes and 128 nodes, and finally the two linear hidden layers are output to the output layer to obtain a state value evaluation value.
S6, repeating the steps S3-S5 to obtain the state quantity S at the next moment t ' the state quantity s at the current moment t Action a t Prize r st The next time state quantity s t ' as a set of experience tuples (s t ,a t ,r st ,s t ′)。
S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts the tracking missile or the unmanned aerial vehicle successfully evades the missile (the tracking missile is exploded on an obstacle or sunk into submarine explosion), the training of the local bureau is finished; obtaining a global rewarding function r according to training results t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) According to a global rewarding function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office st (s t ) Obtaining global local rewarding function r t (s i ) Updating each corresponding set of experience tuples to(s) according to the global local rewards function t ,a t ,r t ,s t ′)。
Global reward function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) The method comprises the following steps:
wherein R is l Indicating interception of rewards, R b Representing an evasion reward;
intercept rewards R l The method comprises the following steps:
evasion reward R b The method comprises the following steps:
in step S7, according to the global rewards function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) And a local rewards function r st (s t ) Obtaining global local rewarding function r t (s i ) The method comprises the following steps:
s8, repeatedly executing S2-S7, counting the number of repeated bureaus, executing the step S9 if the number of bureaus reaches the number E required by parameter updating, temporarily clearing the count, counting again, and continuously repeatedly executing S2-S7 if the number of bureaus is smaller than E until the number of bureaus reaches E bureaus.
And S9, screening n groups of experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating network model parameters for intercepting and avoiding unmanned ship missiles sequentially by using the screened n groups of experience tuples.
The method for screening the experience data from the experience buffer pool by adopting the priority experience playback algorithm comprises the following steps:
wherein P (c) represents a set of empirical data(s) selected from the empirical buffer pool t ,a t ,r st ,s t ' s); c represents the serial number of the extracted experience data; p is p k Representing the priority of the extracted empirical data; alpha represents a preset parameter for adjusting the degree of preferential sampling of the data samples.
The method comprises the following steps of updating and optimizing unmanned ship missile interception and avoidance network model parameters by adopting a PPO algorithm:
(1) Defining variables;
parameters of policy network are denoted by θ, by pi θ (a t |s t ) Represented in state s t Lower execution action a t The probability of the policy output; by usingParameters representing the value network->Representing the output of the value network, estimating the state s t Is of value (c).
(2) Defining a dominance function;
dominance function A (s t ,a t ) Represented in state s t Lower execution action a t The calculation method is the difference between the return of the current state and the reference value, the dominance function a (s t ,a t ) The method comprises the following steps:
wherein Q(s) t ,a t ) Representing execution of action a t The cumulative return on the back is then,representing value network pair states s t Is a function of the estimate of (2).
Wherein t represents the current time step number; t represents the number of steps of the local office; gamma represents a discount factor, representing the discount degree of future rewards; r is (r) t Indicating the prize earned at time t'.
(3) An objective function;
the objective function consists of three parts, including a ratio term of the strategy network, a loss term of the cost function and an entropy regularization term, and the objective function is as follows:
wherein,representing the desire for multiple times t; />clip(ρ t 1-e, 1+ 'e) is a phase function of the ratio, limiting the ratio to the interval 1-e, 1 +' e]Between them; alpha and beta represent the hyper-parameters of the adjustment ratio clipping and the entropy loss weight, respectively; h (pi) θ (·|s t ) Entropy of the policy network for encouraging exploration.
(4) Updating a strategy network;
by maximising the objective function L PPO2 (θ) update the parameter θ of the policy network:
θ new =θ old +argmax θ L PPO2 (θ)。
(5) Updating a value network;
the updating of the value network is done by minimizing the mean square error:
wherein R is t Is a calculated dominance estimate.
The PPO algorithm effectively balances stability and exploratory property in training through mechanisms such as near-end strategy optimization, GAE, resistance loss, entropy regularization and the like, and improves the performance of a strategy network.
And S10, judging whether the total office number reaches a preset cycle number N, if so, ending the cycle, and ending the unmanned ship missile interception and avoidance network model training to obtain an optimal network model.
Example 2
An unmanned ship missile interception and avoidance algorithm based on reinforcement learning, as shown in fig. 2, comprises the following steps: the method comprises the following steps:
s11, acquiring environmental information at the current moment by using sensors such as sonar, camera and radar mounted on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if yes, executing the step S22, and if not, repeatedly executing the step S11.
S22, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and then obtaining the coordinates, the speed and the course angle of the interception missile.
S33, inputting all variables in the S22 as state quantities into an unmanned ship missile interception and avoidance network model, obtaining action quantities for controlling the unmanned ship to propel and intercept missile launching control, and controlling the unmanned ship to move and intercept missile launching; the unmanned aerial vehicle missile interception and avoidance network model is an optimal unmanned aerial vehicle missile interception and avoidance network model after training, and the unmanned aerial vehicle missile interception and avoidance network model is the optimal unmanned aerial vehicle missile interception and avoidance network model obtained after unmanned aerial vehicle missile interception and avoidance training in the first aspect.
S44, repeating the steps S11-S33, and enabling the unmanned ship to achieve the purpose of missile interception and avoidance; the unmanned ship can avoid the missile tracking and intercept the tracked missiles, and the unmanned ship's missile interception and avoidance can be realized.
Example 3
An unmanned ship missile interception and avoidance system based on reinforcement learning comprises a monitoring module, a communication module and a main control module.
The monitoring module is hardware equipment carrying sensors such as sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment.
The main control module is hardware equipment carrying a microprocessor and has a wireless communication function and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions.
The monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into the unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an optimal network model of the unmanned ship missile interception and avoidance network, and executes the interception and avoidance algorithm according to the first aspect of the invention.
The specific flow of the interception and avoidance algorithm for realizing the unmanned ship guided missile interception and avoidance system of the embodiment 1-2 is as follows:
(1) The monitoring module monitors and collects surrounding training environment information in real time, and is communicated with the main control module through the communication module, and the main control module processes the environment information and judges whether a tracking missile exists or not.
(2) If the tracked missile exists, calculating to obtain information of the tracked missile by taking the center of the unmanned aerial vehicle as an origin of a coordinate system, inputting various variables as state quantities into an unmanned aerial vehicle missile interception and avoidance network model in a main control module, and setting rewards of the unmanned aerial vehicle missile interception and avoidance network model according to training results.
(3) Repeating training to obtain a plurality of groups of experience tuples and storing the experience tuples into an experience buffer pool, and modifying rewards of the unmanned ship missile interception and avoidance network model according to training results of unmanned ship missile interception and avoidance; the training is repeated until the number of bureaus is set.
(4) And screening experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating the unmanned ship missile interception and avoidance network model.
(5) And circulating the steps to reach the preset cycle number to obtain an optimal network model for unmanned ship missile interception and avoidance control, so as to realize the aims of unmanned ship missile avoidance tracking and guided missile interception and tracking.
Example 4
A computer readable storage medium having stored thereon a computer program which when executed by a computer processor performs any of the steps of a reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm as described in embodiments 1-2.
The computer readable medium described herein may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
In conclusion, the invention effectively overcomes the defects in the prior art and has high industrial utilization value. The above-described embodiments are provided to illustrate the gist of the present invention, but are not intended to limit the scope of the present invention. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. An unmanned ship missile interception and avoidance algorithm based on reinforcement learning is characterized by comprising the following steps of:
s11, acquiring environmental information at the current moment by using sonar, camera and radar carried on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if yes, executing the step S22, and if not, repeatedly executing the step S11;
s22, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and then obtaining the coordinates, the speed and the course angle of the interception missile;
s33, inputting each variable in the S22 as a state quantity into an unmanned ship missile interception and avoidance network model to obtain an action quantity for controlling the unmanned ship to propel and intercept missile launching control; the unmanned ship missile interception and avoidance network model is an optimal network model obtained after training an unmanned ship missile interception and avoidance algorithm based on reinforcement learning;
s44, repeating the steps S11-S33, and enabling the unmanned ship to achieve the purpose of missile interception and avoidance.
2. The reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm of claim 1, comprising the steps of:
s1, setting an unmanned ship missile interception and avoidance virtual environment, wherein the virtual environment comprises a marine environment, an unmanned ship model, an unmanned ship and missile motion model, a missile model and an obstacle model;
s2, randomly setting a tracking missile at one point in the virtual environment for attacking the unmanned ship;
s3, acquiring environmental information at the current moment by using sonar, camera and radar carried on the unmanned ship, processing and analyzing the environmental information, judging whether a tracking missile exists or not, if so, executing the step S4, and if not, repeatedly executing the step S3;
s4, obtaining the speed and the course angle of the unmanned ship and the tracking missile, taking the center of the unmanned ship as an origin of a coordinate system, obtaining the X-Y-Z coordinates of the tracking missile at the current moment and the relative distance between the tracking missile and the unmanned ship, and calculating to obtain the elevation angle of the tracking missile relative to the unmanned ship, and obtaining the coordinates, the speed and the course angle of the intercepted missile;
s5, taking each variable in S4 as a state quantity S t Missile interception of input unmanned shipAnd in the evasion network model, obtaining an action quantity a for controlling the unmanned ship to propel and intercept missile launching control t The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, based on state quantity information, the current local rewarding function r is obtained st (s t ) The method comprises the steps of carrying out a first treatment on the surface of the The unmanned ship missile interception and avoidance network model is an improved near-end strategy optimization algorithm neural network;
s6, repeating the steps S3-S5 to obtain the state quantity S at the next moment t ' the state quantity s at the current moment t Action a t Prize r st The next time state quantity s t ' as a set of experience tuples (s t ,a t ,r st ,s t ′);
S7, repeatedly executing the steps S3-S6, obtaining a plurality of groups of experience tuples, and storing the obtained experience tuples into an experience buffer pool; when the unmanned aerial vehicle missile successfully intercepts and tracks the missile or the unmanned aerial vehicle successfully evades the missile, the self-local training is ended; obtaining a global rewarding function r according to training results t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) According to a global rewarding function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) And the number of steps in the local office updates the local rewarding function r of each step in the local office st (s t ) Obtaining global local rewarding function r t (s i ) The global local rewards function updates the corresponding experience tuples of each group as (s t ,a t ,r t ,s t ′);
S8, repeatedly executing S2-S7, counting the number of repeated bureaus, executing the step S9 if the number of bureaus reaches the number E required by parameter updating, temporarily clearing the count, counting again, and continuously repeatedly executing S2-S7 if the number of bureaus is smaller than E until the number of bureaus reaches E bureaus;
s9, screening n groups of experience tuples from the experience buffer pool by adopting a priority experience playback algorithm, and updating network model parameters for intercepting and avoiding unmanned ship missiles sequentially by using the screened n groups of experience tuples;
and S10, judging whether the total office number reaches a preset cycle number N, if so, ending the cycle, and ending the unmanned ship missile interception and avoidance network model training to obtain an optimal network model.
3. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S5 the state quantityThe motion quantity a t =(v u ′,v l ′,α u ′,α l ′,B l );
Wherein, (x) d ,y d ,z d ) Representing the coordinates of the tracking missile in an unmanned ship coordinate system; (x) l ,y l ,z l ) Representing coordinates of the intercepted missile in an unmanned ship coordinate system; v u ,v d ,v l Respectively representing the speeds of unmanned ships, guided missiles and intercepted missiles;representing the elevation angle of the tracking missile relative to the unmanned boat; d represents the linear distance between the tracking missile and the unmanned boat; alpha u ,α d ,α l Respectively representing the course angles of the unmanned ship, the tracking missile and the intercepting missile; b (B) l The Boolean value variable is 1 when the intercepted missile is launched, otherwise, the Boolean value variable is 0, and all parameters of the Boolean value variable are 0 when the intercepted missile is not launched.
4. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S5 the local reward function r st (s t ) The method comprises the following steps:
wherein omega 1 ,ω 2 ,ω 3 ,ω 4 Representing the super-parameters; d represents the linear distance between the tracking missile and the unmanned boat; τ represents a supplemental coefficientThe method comprises the steps of carrying out a first treatment on the surface of the t and s represent the time elapsed from the start of the office and the number of steps currently performed, respectively.
5. Reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S7 the global reward function r t (s 1 ,a 1 ,s 2 ,a 2 ,...,s T ) The method comprises the following steps:
wherein R is l Indicating interception of rewards, R b Representing an evasion reward;
the intercept rewards Rl are:
evasion reward R b The method comprises the following steps:
6. reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2 wherein in step S7 the global local reward function r t (s i ) The method comprises the following steps:
7. the reinforcement learning based unmanned aerial vehicle missile interception and avoidance algorithm of claim 2 wherein in step S5, the unmanned aerial vehicle missile interception and avoidance network model includes an Actor network and a Critic network; the input of the Actor network is a state quantity s t Output is the action quantity a t The method comprises the steps of carrying out a first treatment on the surface of the The input of the Critic network is the state quantity at the current moments t Output is state quantity s t Is a state value evaluation value of (a).
8. The reinforcement learning-based unmanned aerial vehicle missile interception and avoidance algorithm according to claim 2, wherein in step S9, the method of screening the experience data from the experience buffer pool using a priority experience playback algorithm is as follows:
wherein P (c) represents a set of empirical data(s) selected from the empirical buffer pool t ,a t ,r st ,s t ' s); c represents the serial number of the extracted experience data; p is p k Representing the priority of the extracted empirical data; alpha represents a preset parameter for adjusting the degree of preferential sampling of the data samples.
9. The unmanned ship missile interception and avoidance system based on reinforcement learning is characterized by comprising a monitoring module, a communication module and a main control module; the monitoring module is hardware equipment carrying sonar, camera and radar, has a communication function and is used for acquiring environmental information at the current moment; the main control module is hardware equipment carrying a microprocessor and has a wireless communication function, and is used for recording system history data, processing and analyzing environmental information and outputting missile interception and avoidance instructions; the monitoring module collects surrounding environment information in real time, processes and analyzes the surrounding environment information through the microprocessor, inputs processed data into an unmanned ship missile interception and avoidance network model, adopts a priority experience playback algorithm, obtains an unmanned ship missile interception and avoidance network optimal network model, and executes the interception and avoidance algorithm according to any one of claims 1-8.
10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when executed by a computer processor, the computer program implements any of the reinforcement learning-based unmanned aerial vehicle missile interception and avoidance algorithms of any of claims 1-8.
CN202311681927.1A 2023-12-08 2023-12-08 Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium Pending CN117666589A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311681927.1A CN117666589A (en) 2023-12-08 2023-12-08 Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311681927.1A CN117666589A (en) 2023-12-08 2023-12-08 Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium

Publications (1)

Publication Number Publication Date
CN117666589A true CN117666589A (en) 2024-03-08

Family

ID=90082299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311681927.1A Pending CN117666589A (en) 2023-12-08 2023-12-08 Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium

Country Status (1)

Country Link
CN (1) CN117666589A (en)

Similar Documents

Publication Publication Date Title
CN110488872B (en) Unmanned aerial vehicle real-time path planning method based on deep reinforcement learning
CN108820157B (en) Intelligent ship collision avoidance method based on reinforcement learning
US11794898B2 (en) Air combat maneuvering method based on parallel self-play
US9030347B2 (en) Preemptive signature control for vehicle survivability planning
US9240001B2 (en) Systems and methods for vehicle survivability planning
EP2844950A2 (en) Systems and methods for vehicle survivability planning
US8831793B2 (en) Evaluation tool for vehicle survivability planning
CN111766901B (en) Multi-unmanned aerial vehicle cooperative target distribution attack method
US11887485B2 (en) Control method and system for collaborative interception by multiple unmanned surface vessels
Lei et al. Path planning for unmanned air vehicles using an improved artificial bee colony algorithm
CN113625569B (en) Small unmanned aerial vehicle prevention and control decision method and system based on hybrid decision model
Li et al. Search-evasion path planning for submarines using the artificial bee colony algorithm
Lee et al. Autonomous control of combat unmanned aerial vehicles to evade surface-to-air missiles using deep reinforcement learning
CN116050795A (en) Unmanned ship cluster task scheduling and collaborative countermeasure method based on MADDPG
CN116203979A (en) Monocular unmanned aerial vehicle obstacle avoidance method, device and medium based on depth deterministic strategy gradient
Wang et al. Obstacle avoidance of UAV based on neural networks and interfered fluid dynamical system
CN111723931B (en) Multi-agent confrontation action prediction method and device
CN116661503B (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN112651486A (en) Method for improving convergence rate of MADDPG algorithm and application thereof
CN117666589A (en) Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium
CN116048126A (en) ABC rapid convergence-based unmanned aerial vehicle real-time path planning method
Sporyshev et al. Reinforcement learning approach for cooperative AUVs in underwater surveillance operations
Sun et al. Research on stealth assistant decision system of submarine voyage stage
CN116882607B (en) Key node identification method based on path planning task
Hao et al. Flight Trajectory Prediction Using an Enhanced CNN-LSTM Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination