CN115951711A

CN115951711A - Unmanned cluster multi-target searching and catching method in high sea condition environment

Info

Publication number: CN115951711A
Application number: CN202310080412.XA
Authority: CN
Inventors: 李斌; 彭思聪
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-04-11

Abstract

The invention discloses a multi-target searching and catching method for an unmanned cluster in a high sea condition environment. The unmanned aerial vehicle device is used as a cluster eye to be responsible for a target searching task, the communication unmanned ship device is used as a cluster brain to be responsible for cluster control, data processing and target distribution, meanwhile, the target pursuit unmanned ship device is pursued to execute a target pursuit task, and the cluster searching and pursuit tasks are completed through mutual cooperation of different unmanned aerial vehicles. In addition, on the basis of the cooperative execution of the cluster, the invention considers the problem of maneuvering decision of unmanned clusters in high sea condition environment. On one hand, when the traditional communication technology is limited and the positioning technology is interfered in a high sea condition environment, the unmanned equipment can accurately position the target; on the other hand, in order to strengthen the information interaction of each unmanned device in the cluster, the pursuit task of each unmanned device is ensured to be completed, and meanwhile, the global benefit is maximized.

Description

Unmanned cluster multi-target searching and catching method in high sea condition environment

Technical Field

The invention belongs to the technical field of unmanned cluster collaborative searching and pursuing, and particularly relates to an unmanned cluster multi-target searching and pursuing method in a high sea condition environment.

Background

In recent years, along with the rapid development of unmanned equipment, unmanned systems play an important role in future civilians and wars. However, in the face of complex environments, it is increasingly difficult for a single unmanned platform to achieve efficient processing of tasks. The heterogeneous unmanned system cooperation technology becomes an effective means for improving the unmanned cluster intelligence and realizing the efficient task processing. Different kinds of agents carry out work division cooperation according to the characteristics of the agents, and the task processing efficiency can be effectively improved.

The unmanned ship has the sudden and violent development in the aspects of offshore sea surface search, water area exploration and the like. However, in a high sea state environment, the unmanned ship is difficult to accurately acquire the surrounding sea area environment and target information in a bumpy state, and the unmanned plane can utilize the advantage of flying in the air to ensure the search of a complex and variable environment. However, the unmanned aerial vehicle also has the disadvantages of poor flight endurance and small load. Therefore, two unmanned devices can be combined, the unmanned device is used as a cluster eye to be responsible for a target search task, the unmanned boat device is used as a cluster brain to be responsible for cluster control, data processing and target distribution, and meanwhile, the unmanned boat executes a target pursuit task. At present, the existing collaborative search and pursuit technology has little research on cross-domain collaboration of a heterogeneous unmanned system, most collaborative tasks only aim at unmanned aerial vehicle clusters or unmanned ship clusters, and in a high sea state environment, the traditional positioning technology is difficult to ensure that accurate positioning of cluster equipment to a target is realized in a complex and variable sea area. In addition, most of the existing research on cooperative search or pursuit of heterogeneous unmanned systems adopts a centralized algorithm, namely, a central server is used for distributing tasks for all members in a cluster, and the mode is not favorable for the robustness and high adaptability to the environment of the unmanned cluster.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides an unmanned cluster multi-target searching and tracking method in a high sea state environment. The unmanned aerial vehicle device is used as a cluster eye to be responsible for a target searching task, the communication unmanned ship device is used as a cluster brain to be responsible for cluster control, data processing and target distribution, meanwhile, the pursuit unmanned ship device executes a target pursuit task, and the searching and pursuit tasks of the cluster are completed through mutual cooperation of different unmanned aerial vehicles. In addition, on the basis of the cooperative execution of the cluster, the invention considers the problem of maneuvering decision of unmanned clusters in high sea condition environment. On one hand, when the traditional communication technology is limited and the positioning technology is interfered in a high sea condition environment, the unmanned equipment can accurately position the target; on the other hand, in order to enhance the information interaction of each unmanned device in the cluster, ensure that each unmanned device completes the pursuit task and simultaneously realize the maximization of the global benefit, the unmanned cluster training method based on the distributed multi-agent reinforcement learning is provided.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

an unmanned cluster multi-target searching and catching method under a high sea condition environment comprises the following steps:

s1, discretizing a sea area to be searched by using a grid form, and modeling a search environment by using a grid method;

s2, regarding the unmanned aerial vehicles as particles moving on an aerial two-dimensional plane, optimizing a search path for each unmanned aerial vehicle based on a collaborative coverage search algorithm of an environmental stimulus function, optimizing a next optimal track point of the unmanned aerial vehicle according to state information of the unmanned aerial vehicle and the environmental stimulus function, updating a motion state of the unmanned aerial vehicle and moving to a corresponding position according to the optimal track point, searching grids in a range of the unmanned aerial vehicle in each time step, and sending perception information to the unmanned communication boat;

s3, tracking the target after the unmanned aerial vehicle searches the target, measuring and calculating the relative distance, the relative distance change rate and the relative speed between the unmanned aerial vehicle and the target in each time step, calculating the relative positioning estimation of the unmanned aerial vehicle and the target, and further calculating the relative positioning estimation of the pursuit unmanned ship and the target according to the positioning estimation of the unmanned aerial vehicle and the target;

s4, the unmanned aerial vehicle records target state information and transmits the target state information to the communication unmanned ship, a target assignment matrix is constructed, and unmanned ship target task allocation is carried out according to the current target state information and the existing unmanned ship pursuit state information;

s5, establishing a pursuit unmanned ship pursuit decision model and a decision learning model on the basis of the allocation of a target task to the pursuit unmanned ship; and after the pursuit unmanned boat is completed, the pursuit system is executed to pursue the pursuit task.

Preferably, the whole environment is regarded as a planar rectangular area in step S1, and the areaThe domain is divided into L _x ×L _y Discrete Grid, grid _(x,y) With the x-th row and y-th column grids representing rectangles, Δ x and Δ y represent the length and width of the unit grid, respectively, the entire search environment E can be formulated as follows:

E＝{Grid _(x,y) |m＝1,2,…,L _x ,n＝1,2,…,L _y }

at time t, grid _(x,y) The state of (c) is represented as:

s _(x,y) (t)＝[μ _(x,y) ,ζ _(x,y) ,η _(x,y) (t),c _(x,y) ]

in the formula, mu _(x,y) Represents Grid _(x,y) Coordinate of the center point, ζ _(x,y) E {0,1} represents Grid _(x,y) Whether or not a target is present, wherein _(x,y) =1 denotes Grid _(x,y) Zeta internal existence of search target _(x,y) =0 means Grid _(x,y) Absence of target therein, η _(x,y) (t) is an element of {0,1,2, \8230;, h } represents Grid _(x,y) Number of searches until time t, c _(x,y) For the search of the stimulus function, grid is represented _(x,y) The attraction degree to the unmanned aerial vehicle.

Preferably, in the step S2, based on the collaborative coverage search algorithm of the environmental stimulation function, the specific steps of optimizing the search path for each unmanned aerial vehicle are as follows:

s21, initializing the position and the state of the unmanned aerial vehicle, wherein the state information formula of the unmanned aerial vehicle i at the moment t is expressed as follows:

s _i (t)＝[λ _i (t),o _i (t)]

in the formula of lambda _i (t)＝(x _i (t),y _i (t)) position coordinates, o, of drone i in environment E at time t _i (t) represents the heading angle of drone i at time t;

s22, calculating a stimulation function c of each grid _(x,y) In the cluster search process, the stimulus function c _(x,y) The following calculation method is adopted for updating:

in the formula, c _(x,y) (0) Is Grid _(x,y) Initial stimulation value, alpha epsilon (0, 1) is attenuation coefficient, when Grid _(x,y) The more times of being searched, the smaller the search stimulus value;

drone i will select the grid with the largest search stimulus value as the next search point within the adjacent grid, the formula is as follows:

after the unmanned aerial vehicle i searches for the target, the unmanned aerial vehicle records and calculates a target state, and sends the target state to the communication unmanned ship in the same communication group, wherein the target state is expressed by a formula:

s _target,j (t)＝[λ _j (t),v _j (t),θ _j,i (t)]

in the formula, λ _j (t)＝(x _j (t),y _j (t)) represents the position coordinates of the target j in the environment E at time t, v _j (t) represents the target j speed, θ _j,i (t) represents the deviation angle of target j with respect to drone i.

Preferably, in step S3, after the unmanned aerial vehicle searches for the target, the unmanned aerial vehicle measures the speed by using ultra-wideband distance measurement and visual odometer, and measures the relative distance between the unmanned aerial vehicle and the unmanned surface vehicle at each time step in real time

And relative speed>

According to the measurement data, given the t time step, the relative positioning estimation formula between the unmanned aerial vehicle i and the target j is expressed as follows:

in the formula (I), the compound is shown in the specification,

represents the rate of change of relative distance, ε ^t 、/>

And &>

Respectively are relative speed>

Relative distance->

And the rate of change of relative distance->

The measurement errors at the t-th time step respectively; t is sampling period of the ultra-wideband sensor, and gamma belongs to R ⁺ Is a tunable constant gain;

obtaining the relative distance between the pursuit unmanned ship k and the unmanned aerial vehicle i according to the state information of the pursuit unmanned ship and the unmanned aerial vehicle given in the cluster

Relative speed>

And the rate of change of relative distance->

And then calculating the relative distance between each unmanned boat for pursuit and the target>

Relative speed->

And the rate of change of relative distance->

Calculating a relative positioning estimation formula of the pursuit unmanned ship k and the target j under the same time step length, wherein the relative positioning estimation formula is expressed as follows:

preferably, in step S4, a total of l unmanned boats for pursuit are pursued on p targets, where l ≧ p, and a target allocation matrix a = [ a ] =isset _ij ]When a is _ij =1, the target j is assigned to the pursuit unmanned boat i when a _ij =0, this means that the target j is not assigned to the pursuit unmanned ship i, and in the target assignment at least one pursuit unmanned ship should be assigned to each target, that is to say

Furthermore, all pursuit unmanned boats should eventually be subjected to a pursuit task, i.e. </or>

Establishing a target distribution model for a distribution target by minimizing the initial relative distance between the unmanned ship and the target, wherein the target distribution model is expressed as follows:

a _ij ∈{0,1}

in the formula (I), the compound is shown in the specification,

and the initial relative distance between the unmanned boat and the target is shown, the matching degree of each unmanned boat with the target is calculated by each unmanned boat, and the unmanned boat with the highest matching degree is subjected to the pursuit task.

Preferably, a target pursuit model for pursuing the unmanned ship is established in step S5, and the model is represented by tuples as follows:

where S represents the current chasing state space, which can be shared by all devices in the cluster, and A _i Represents the action space of the pursuit unmanned boat i, T: sxA ^l → S represents the deterministic transfer function of the environment, R _i :

Representing a reward function for pursuing the unmanned ship;

the global reward value of the pursuit unmanned ship formation is defined as the average value of the reward values of all the pursuit unmanned ships, and the formula is expressed as follows:

in the formula, r _t (s, a) represents the reward value obtained by chasing the unmanned boat formation at the time t under the state s;

the maximum strategy formulation is expressed as follows:

in which s' ≡ s ^t+1 Represents the state at time t + 1;

setting the reward value of each pursuing unmanned boat, wherein the formula is expressed as follows:

r _i (s,a)＝r _cap +r _help +r _step

in the formula, r _cap Represents a capturing reward obtained when the distance between the pursuit unmanned ship and the target is less than the pursuit distance, i.e., the pursuit unmanned ship pursues the target, r _help Indicating that when a plurality of unmanned boats for pursuing are available for pursuing the same target, the target is rewarded with assistance after being caught, r _step Expressed as a step size reward, r _step The formula is expressed as follows:

r _step ＝ω ₁ r ₁ +ω ₂ r ₂

ω ₁ +ω ₂ ＝1

in the formula, r ₁ Reward for pursuing distance r ₂ Awarding for collisions;

pursuit distance reward r ₁ The formula is expressed as follows:

in the formula (I), the compound is shown in the specification,

for remaining catch distance, k ₁ Is awarded r ₁ Adjusting the coefficient;

collision reward r ₂ The formula is expressed as follows:

in the formula (d) _min ＝mind _i,j Denotes the minimum travel distance between unmanned boats, r ₂ ∈(-1,0]，k ₂ Is awarded as r ₂ Adjusting the coefficient;

preferably, a multi-unmanned-boat pursuit maneuvering decision model is established in the step S5, an Actor-Critic structure is adopted, an Actor network and a Critic network of each pursuit unmanned boat are connected through a bidirectional recurrent neural network, a hidden layer in the Actor network and the Critic network of a single pursuit unmanned boat decision model is used as recurrent units of the bidirectional recurrent neural network, and the pursuit unmanned boats are expanded according to the number of the pursuit unmanned boats; wherein the content of the first and second substances,

the formula of the individual target function of the unmanned boat is expressed as follows:

in the formula (I), the compound is shown in the specification,

indicating that action a is taken under the state transition function T _θ The resulting status profile, <' >>

As desired;

the formula of the target function of the pursuit unmanned ship formation is expressed as follows:

the gradient formula of the policy network parameter θ is expressed as follows:

using parameterized critical functions Q ^ξ (s, a) to estimate the state-action function of the above equation

And training Critic using a sum of squares penalty function, Q ^ξ The gradient formula of (s, a) is expressed as follows: />

In the formula, xi is a Q network parameter;

and optimizing the Actor network and the Critic network by adopting a random gradient descent method, and updating network parameters through data obtained by trial and error in the interactive learning process to finish the optimization of collaborative search pursuit.

Preferably, the training and learning process of the multi-unmanned-boat cooperative target pursuit decision model comprises the following steps:

s51, initializing online network parameters of Actor and Critic, and allocating the online network parameters to corresponding target network parameters, namely theta '← theta and xi' ← xi, wherein theta 'and xi' are the target parameters of Actor and Critic respectively, and initializing empirical playback space

Storing the data obtained in the exploration;

s52, determining an initial state of training, and setting an initial position state and a speed state of a pursuing unmanned ship formation and a target;

s53, repeating the multi-set training according to the initial state, and simulating to execute the following operations:

each chasing unmanned boat is based on state s ^t And a random process epsilon generates an action

And executing;

after all actions have been performed, the state transitions to s ^t+1 Calculating the prize value

And will pass on the process variable

Store to empirical playback space >>

During learning, a batch of M pieces of experience data are randomly extracted

To calculate the target Q value of each pursuit unmanned boat, the formula is expressed as follows:

a gradient estimate for Critic is calculated, formulated as follows:

and updating online network parameters of Actor and Critic according to the obtained gradient estimation delta xi and delta theta, and then updating target network parameters, wherein a formula is expressed as follows:

wherein k ∈ (0, 1).

Adopt the beneficial effect that above-mentioned technical scheme brought:

(1) The invention uses the rasterization method to model the search environment of the unmanned cluster, thereby facilitating the description of the environment information and reducing the calculated amount;

(2) The invention designs a collaborative coverage search algorithm based on an environmental stimulus function, integrates the position of a target which possibly appears and the current state of the unmanned aerial vehicle, and optimizes the search path of the unmanned aerial vehicle;

(3) The method adopts a relative positioning method with persistent reward to measure the relative position and relative speed between the unmanned equipment and the target in real time, does not depend on external infrastructure, ensures accurate positioning in a rejection environment, can cope with the interference of a traditional positioning system in a high sea condition environment, and ensures accurate positioning of the unmanned aerial vehicle and the unmanned boat on the target;

(4) The unmanned cluster collaborative communication model is established, the unmanned cluster is divided into a plurality of communication groups, the unmanned cluster collaborative search target is realized on the sea surface lacking communication infrastructure and resources, the collision of the unmanned aerial vehicle or the unmanned ship is avoided, and the unmanned aerial vehicle searching group can quickly transmit the self state information, the environment information and the target state information to the base station deployed on the unmanned ship in the target searching process;

(5) In the unmanned cluster collaborative searching and pursuing process, the unmanned ship target tasks are distributed while the unmanned aerial vehicles in the cluster are searched, the task harmony between the target searching and the target pursuing is realized, and a target task distribution method is designed;

(6) According to the unmanned ship cooperative pursuit method, unmanned ship individual learning behaviors are organized into unmanned ship cluster group cooperation through a coordination mechanism, a distributed multi-agent reinforcement learning method based on equipment communication is designed, the pursuit task of each unmanned ship is guaranteed to be completed, the overall benefit maximization of cluster pursuit is realized, the high efficiency of unmanned cluster cooperative pursuit decision is realized under the complex and changeable high sea condition environment, and the stability and reliability of the unmanned cluster on target pursuit are guaranteed;

drawings

FIG. 1 is a model of the unmanned cluster based collaborative search and pursuit system of the present invention, the system including an unmanned cluster, a communication unmanned boat group, a pursuit unmanned boat group and an object to be searched;

FIG. 2 is a flow chart of the present invention;

fig. 3 is an unmanned boat pursuit maneuver model based on a bidirectional recurrent neural network.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

Fig. 1 shows a system model based on unmanned cluster collaborative search and pursuit, which includes an unmanned cluster, a communication unmanned ship group, a pursuit unmanned ship group and a target to be searched. The embodiment provides a method for searching and tracking multiple targets of an unmanned cluster in a high-sea-condition-oriented environment, a specific flow is shown in fig. 2, and the specific implementation method of the method for searching and tracking multiple targets of the unmanned cluster in the high-sea-condition-oriented environment is as follows:

1. the search environment is modeled using a grid method. The whole environment is regarded as a planar rectangular area, and the area is divided into L _x ×L _y A discrete grid. Grid _(x,y) By the x-th row and y-th column grid representing a rectangle, Δ x and Δ y represent the length and width of the unit grid, respectivelyAnd (4) degree. The entire search environment E may be represented by a grid set as:

E＝{Grid _(x,y) |m＝1,2,…,L _x ,n＝1,2,…,L _y }

at time t, grid _(x,y) The state of (a) is represented as:

s _(x,y) (t)＝[μ _(x,y) ,ζ _(x,y) ,η _(x,y) (t),c _(x,y) ]

wherein mu _(x,y) Represents Grid _(x,y) Coordinate of the center point, ζ _(x,y) E {0,1} represents Grid _(x,y) Whether or not a target is present, wherein _(x,y) =1 denotes Grid _(x,y) ζ is the search target _(x,y) =0 means Grid _(x,y) Absence of target therein, η _(x,y) (t) ∈ {0,1,2, \8230;, h } denotes Grid _(x,y) Number of searches until time t, c _(x,y) For the search of the stimulus function, grid is represented _(x,y) The attraction degree to the unmanned aerial vehicle.

2. Initializing the number of devices in the unmanned cluster, wherein the number of the devices comprises m unmanned planes, n communication unmanned boats, l pursuit unmanned boats and p targets. In the unmanned cluster, unmanned aerial vehicles and communication unmanned boats form communication groups, each communication group comprises one communication unmanned boat and a plurality of unmanned aerial vehicles, and communication of unmanned aerial vehicles in the group is divided into communication between A2S (UAV to UAV) and communication between A2A (UAV to UAV) according to difference of communication parties. In the packet backhaul uplink, drones share uplink spectrum resources and reuse resource blocks within the group.

3. Designing a collaborative coverage search algorithm based on an environmental stimulus function to optimize a search path of each unmanned aerial vehicle, wherein the method comprises the following specific steps:

1) Initializing the position and state of the drone, wherein the state information of drone i at time t may be represented as:

s _i (t)＝[λ _i (t),o _i (t)]

wherein λ is _i (t)＝(x _i (t),y _i (t)) represents the position coordinates, o, of drone i in environment E at time t _i (t) represents the heading angle of drone i at time t.

2) Calculating a stimulation function c for each grid _(x,y) In the cluster search process, the stimulus function c _(x,y) The following calculation method is adopted for updating:

wherein c is _(x,y) (0) Is Grid _(x,y) Initial stimulation value, alpha epsilon (0, 1) is attenuation coefficient, when Grid _(x,y) The more times of being searched, the smaller the search stimulus value.

Drone i will select the grid with the largest search stimulus value within the adjacent grid as the next search point, i.e. the next search point

Under the action of the environment search stimulation function, the unmanned aerial vehicle always tends to move to the grid with larger stimulation values, namely the grid which is not searched is more likely to be selected and searched by the unmanned aerial vehicle, and the unmanned aerial vehicle can effectively avoid the area which is repeatedly searched by the unmanned aerial vehicle. Therefore, the cluster higher search coverage rate and efficiency can be ensured.

After the unmanned aerial vehicle i searches for the target, the unmanned aerial vehicle records and calculates a target state, and sends the target state to the communication unmanned ship in the same communication group, wherein the target state can be expressed as:

s _target,j (t)＝[λ _j (t),ν _j (t),θ _j,i (t)]

wherein λ is _j (t)＝(x _j (t),y _j (t)) represents the position coordinates of the target j in the environment E at time t, ν _j (t) represents the target j speed, θ _j,i (t) represents the deflection angle of target j relative to drone i.

4. After the unmanned aerial vehicle searches for the target, the unmanned aerial vehicle measures the speed by adopting ultra wide band ranging and visual odometer, and measures the relative distance between the unmanned aerial vehicle and the unmanned ship under each time step in real time

And relative speed->

According to the measurement data, giving a relative positioning estimation between the unmanned aerial vehicle i and the target j under the t time step:

wherein, the first and the second end of the pipe are connected with each other,

represents the rate of change of relative distance, ε ^t 、/>

And &>

Respectively is the relative speed->

Relative distance->

Relative distance change rate>

The measurement error at the t time step; t is the sampling period of the ultra-wideband sensor, and gamma belongs to R ⁺ Is a tunable constant gain.

Further, according to the state information of the pursuit unmanned ship and the unmanned aerial vehicle given in the cluster, the relative distance between the pursuit unmanned ship k and the unmanned aerial vehicle i is obtained

Relative speed>

And the rate of change of relative distance->

And then the relative distance between each pursuit unmanned boat and the target is calculated>

Relative speed->

And relative distance change rate>

Further, calculating the relative positioning estimation of the pursuit unmanned ship k and the target j under the same time step:

5. in target distribution, a total of l pursuit unmanned boats pursue p targets, wherein l is larger than or equal to p. Setting a target allocation matrix a = [ a ] _ij ]When a is _ij =1, indicates that the target j is assigned to the pursuit unmanned boat i, when a _ij If =0, it means that the target j is not assigned to the pursuit unmanned ship i. In object allocation, each object should be allocated at least one pursuit unmanned boat, i.e.

Establishing a target distribution model by using the minimization of the initial relative distance between the unmanned ship and the target as a distribution target as follows:

a _ij ∈{0,1}

wherein the content of the first and second substances,

6. The invention designs a distributed multi-agent reinforcement learning method based on equipment communication, which realizes a maneuvering decision for multi-unmanned boat cooperative pursuit, and comprises the following specific contents:

(1) Policy coordination mechanism

In the target pursuit process of the unmanned ships, each pursuit unmanned ship makes a maneuvering decision according to the situation of the unmanned ship in a high sea condition environment, the pursuit of the multi-unmanned ship system can be regarded as a competition game between the unmanned ships and the targets, and an unmanned ship target pursuit model is established and is represented by tuples:

where S represents the current chased state space, which can be shared by all devices in the cluster, A _i Represents the action space of the pursuit unmanned boat i, T: sxA ^l → S denotes the deterministic transfer function of the environment, R _i :

Representing a reward function for pursuing the unmanned boat.

The global reward value of the pursuit unmanned boat formation is defined as the average value of all the pursuit unmanned boat reward values, and can be expressed as:

wherein r is _t And (s, a) represents the reward value obtained by chasing the unmanned boat formation at the time t under the state s.

The goal of pursuing unmanned boat formation is to learn a strategy that maximizes the expected value of the discount reward, i.e. the

Wherein 0 < lambda < 1 is a discount factor.

In summary, the following great strategies can be obtained:

wherein s' ≡ s ^t+1 The state indicating the time is determined by the state transition function T (s, a).

To reflect the role of each pursuit unmanned boat individual in the cooperative pursuit, a reward value is set for each pursuit unmanned boat, which can be expressed as:

r _i (s,a)＝r _cap +r _help +r _step

wherein r is _cap Represents a capturing reward obtained when the distance between the pursuit unmanned ship and the target is less than the pursuit distance, i.e., the pursuit unmanned ship pursues the target, r _help Indicating that when the same target has a plurality of unmanned boats for pursuing to pursue the task, after the target is caught, assistance rewards can be obtained, r _step Denoted as a step prize, is composed of a number of sub-prize weights:

r _step ＝ω ₁ r ₁ +ω ₂ r ₂

ω ₁ +ω ₂ ＝1

wherein r is ₁ And r ₂ Is defined as follows:

1. pursuit distance reward r ₁

At each time step, the pursuit unmanned ship receives a negative return r ₁ Distance from remaining pursuit

Linear relationship, k ₁ To adjust the coefficients.

2. Collision reward r ₂

Wherein d is _min ＝mind _i,j Denotes the minimum travel distance between unmanned boats, r ₂ ∈(-1,0]，k ₂ To adjust the coefficients.

For m unmanned boat pursuits, there are m Bellman equations, i.e.

In the training process of reinforcement learning, feedback of each defined pursuit unmanned ship in the aspects of target allocation, collision avoidance and the like is given through allocation of reward values. Decision coordination can be realized after training of the pursuit unmanned ship, so that behaviors of the pursuit unmanned ships reach tacit conditions.

(2) Decision learning mechanism

And establishing a multi-unmanned-boat pursuit maneuver decision model to ensure information interaction among unmanned boats and realize the cooperation of cluster maneuvers. The model adopts an Actor-Critic structure, the Actor network and the Critic network of each pursuit unmanned ship are connected through a bidirectional recurrent neural network, specifically, as shown in fig. 3, hidden layers in a strategy network (Actor) and a Q network (Critic) of a single pursuit unmanned ship decision model are used as recurrent units of the bidirectional recurrent neural network, and the strategy network and the Q network are expanded according to the number of pursuit unmanned ships.

The individual objective function of pursuing unmanned boats may be defined as:

wherein

Indicating that action a is taken under the state transition function T _θ The resulting state distribution.

The objective function of pursuing the formation of unmanned boats is expressed as:

according to the multi-subject deterministic policy gradient theorem, the gradient of the policy network parameter θ is:

And training Critic using a sum of squares loss function. Q ^ξ The gradient of (s, a) can be expressed as: />

Where ξ is the Q network parameter.

And optimizing the Actor network and the Critic network by adopting a random gradient descent method, and updating network parameters through data obtained by trial and error in the interactive learning process to complete the optimization of collaborative search pursuit.

(3) Training and learning process of multi-unmanned-boat cooperative target pursuit decision model

a. Initializing the online network parameters of Actor and Critic and allocating the online network parameters to corresponding target network parameters, namely theta ' ← theta and xi, wherein theta ' and xi ' are the target parameters of Actor and Critic, respectively. Initializing an empirical playback space

Storing the data obtained in the exploration;

b. determining an initial state of training, and setting an initial position state and a speed state of a pursuing unmanned ship formation and a target;

c. and repeating the multi-set training according to the initial state, wherein each set of trapping simulation executes the following operations:

And executing;

And will pass on the process variable

Store to empirical playback space >>

During learning, a batch of M pieces of experience data are randomly extracted

To calculate a target Q-value for each of the pursuit drones, i.e.

Calculating a gradient estimate of Critic, i.e.

Calculating an estimate of the gradient of the Actor, i.e.

According to the obtained gradient estimation delta xi and delta theta, updating the online network parameters of Actor and Critic, and then updating the target network parameters, namely updating the target network parameters

Where k ∈ (0, 1).

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A multi-target searching and catching method for an unmanned cluster in a high sea condition environment is characterized by comprising the following steps:

s5, establishing a pursuit unmanned ship pursuit decision model and a decision learning model on the basis of the allocation of a target task to the pursuit unmanned ship; and after the pursuit unmanned boat is completed, the pursuit task of the system is executed.

2. The unmanned clustered multi-objective searching and tracking method under high sea state environment as claimed in claim 1, wherein the whole environment is regarded as a planar rectangular area in step S1, and the area is divided into L _x ×L _y Discrete Grid, grid _(x，y) Using the x-th row and y-th column grids representing rectangles, Δ x and Δ y represent the length and width of the unit grid, respectively, the whole search environment E is expressed by the grid set formula as follows:

E＝{Grid _(x，y) |m＝1，，2，…，L _x ，n＝1，，2，…，L _y }

at time t, grid _(x，y) The state of (c) is represented as:

s _(x，y) (t)＝[μ _(x，y) ，ζ _(x，y) ，η _(x，y) (t)，c _(x，y) ]

in the formula, mu _(x，y) Represents Grid _(x，y) Coordinate of center point, ζ _(x，y) E {0,1} represents Grid _(x，y) Whether or not a target is present, wherein _(x，y) =1 denotes Grid _(x，y) ζ is the search target _(x，y) =0 means Grid _(x，y) Absence of target therein, η _(x，y) (t) ∈ {0,1,2, \8230;, h } denotes Grid _(x，y) Number of searches until time t, c _(x，y) For the search of the stimulus function, grid is represented _(x，y) The attraction degree to the unmanned aerial vehicle.

3. The unmanned cluster multi-target searching and catching method under the high sea condition environment according to claim 1, wherein the specific steps of optimizing the search path for each unmanned aerial vehicle based on the collaborative coverage search algorithm of the environmental stimulus function in step S2 are as follows:

s _i (t)＝[λ _i (t)，o _i (t)]

in the formula of lambda _i (t)＝(x _i (t)，y _i (t)) position coordinates, o, of drone i in environment E at time t _i (t) represents the heading angle of drone i at time t;

s22, calculating a stimulation function c of each grid _(x，y) In the cluster search process, the stimulus function c _(x，y) The following calculation method is adopted for updating:

in the formula, c _(x，y) (0) Is Grid _(x，y) Initial stimulation value, alpha epsilon (0, 1) is attenuation coefficient, when Grid _(x，y) The more times of being searched, the smaller the search stimulus value;

s _target，j (t)＝[λ _j (t)，ν _j (t)，θ _j，i (t)]

in the formula, λ _j (t)＝(x _j (t)，y _j (t)) represents the position coordinates of the target j in the environment E at time t, v _j (t) represents the target j speed, θ _j，i (t) represents the deviation angle of target j with respect to drone i.

4. The unmanned cluster multi-target searching and catching method under the high sea condition environment of claim 1, wherein in step S3, after the unmanned aerial vehicle searches for the target, the unmanned aerial vehicle measures the speed by using ultra-wideband distance measurement and visual odometer, and measures the relative distance between the unmanned aerial vehicle and the unmanned boat at each time step in real time

And relative speed->

in the formula (I), the compound is shown in the specification,

represents the rate of change of relative distance, ε ^t 、/>

And &>

Respectively is the relative speed->

Relative distance->

And the rate of change of relative distance->

The measurement errors at the t-th time step respectively; t is the sampling period of the ultra-wideband sensor, and gamma belongs to R ⁺ Is a tunable constant gain;

Relative speed->

And relative distance change rate>

Relative speed->

And the rate of change of relative distance->

5. the unmanned cluster multi-target searching and pursuing method in the high sea condition environment as claimed in claim 1, wherein in step S4, a total of l pursuit unmanned boats pursue p targets, wherein l is larger than or equal to p, and a target distribution matrix a = [ a ] = is set _ij ]When a is _ij =1, the target j is assigned to the pursuit unmanned boat i when a _ij By =0, it is meant that the target j is not assigned to the pursuit unmanned boat i, and in the target assignment, each target should be assigned at least one pursuit unmanned boat, that is, at least one pursuit unmanned boat is assigned

Furthermore, all pursuit unmanned boats should eventually be subjected to the pursuit task, i.e. <' > in & -5 & ->

a _ij ∈{0，1}

in the formula (I), the compound is shown in the specification,

and the initial relative distance between the pursuit unmanned boat and the target is shown, the matching degree of each pursuit unmanned boat with the target is calculated, and the unmanned boat with the highest matching degree is subjected to pursuit tasks. />

6. The unmanned cluster multi-target searching and pursuit method under the high sea condition environment as claimed in claim 1, wherein a pursuit unmanned ship target pursuit model is established in step S5, and the model is represented by tuples as follows:

where S represents the current chasing state space, which can be shared by all devices in the cluster, and A _i Represents an action space for chasing the unmanned boat i, T: s X A ^l → S denotes the deterministic transfer function of the environment, R _i ：

Representing a reward function for pursuing the unmanned ship;

the maximum strategy formulation is as follows:

in which s' ≡ s ^t+1 Represents the state at time t + 1;

r _i (s，a)＝r _cap +r _help +r _step

in the formula, r _cap Represents a capturing reward obtained when the distance between the pursuit unmanned ship and the target is less than the pursuit distance, i.e., the pursuit unmanned ship pursues the target, r _help Indicating that when a plurality of unmanned boats for pursuing are arranged at the same target for pursuing, the assisting reward is obtained after the target is caught, r _step Expressed as a stride reward, r _step The formula is expressed as follows:

r _step ＝ω ₁ r ₁ +ω ₂ r ₂

ω ₁ +ω ₂ ＝1

pursuit distance reward r ₁ The formula is expressed as follows:

in the formula (I), the compound is shown in the specification,

for remaining catch distance, k ₁ Is awarded as r ₁ Adjusting the coefficient;

collision reward r ₂ The formula is expressed as follows:

in the formula (d) _min ＝mind _i，j Denotes the minimum travel distance between unmanned boats, r ₂ ∈(-l，0]，k ₂ Is awarded as r ₂ And adjusting the coefficient.

7. The unmanned cluster multi-target searching and chasing method under the high sea condition environment according to claim 1, characterized in that a multi-unmanned boat chasing maneuver decision model is established in step S5, an Actor-Critic structure is adopted, an Actor network and a Critic network of each chasing unmanned boat are connected through a bidirectional recurrent neural network, a hidden layer in the Actor network and the Critic network of a single chasing unmanned boat decision model is used as a recurrent unit of the bidirectional recurrent neural network, and the chasing unmanned boats are expanded according to the number of the chasing unmanned boats;

wherein the content of the first and second substances,

in the formula (I), the compound is shown in the specification,

As desired; />

And training Critic using a sum of squares penalty function, Q ^ξ The gradient formula of (s, a) is expressed as follows:

in the formula, xi is a Q network parameter;

8. The unmanned cluster multi-target searching and pursuing method in the high sea condition environment as claimed in claim 7, wherein the training and learning process of the multi-unmanned-boat cooperative target pursuit decision model comprises the following steps:

Storing the data obtained in the exploration;

each unmanned catch boat generates one based on state st and random process epsilonMovement of

And executing;

And will pass on the process variable

Store to empirical playback space >>

During learning, a batch of M pieces of experience data are randomly extracted

a gradient estimate for Critic is calculated, formulated as follows:

updating online network parameters of Actor and Critic according to the obtained gradient estimation delta xi and delta theta, and then updating target network parameters, wherein a formula is expressed as follows:

wherein k ∈ (0, 1).