CN114625167A

CN114625167A - Unmanned aerial vehicle collaborative search method and system based on heuristic Q-learning algorithm

Info

Publication number: CN114625167A
Application number: CN202210281173.XA
Authority: CN
Inventors: 王怀震; 王小谦; 高明; 王建华
Original assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Current assignee: Shandong New Generation Information Industry Technology Research Institute Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-14

Abstract

The invention discloses an unmanned aerial vehicle collaborative search method and system based on heuristic Q-learning algorithm, belonging to the field of unmanned aerial vehicle collaborative search, aiming at solving the technical problems of low efficiency and low reliability of the existing unmanned aerial vehicle search method, and the technical scheme is as follows: the method comprises the following specific steps: s1, establishing an unmanned aerial vehicle power model and searching an environment perception map; s2, establishing an unmanned aerial vehicle search reward and heuristic reward mechanism: designing an excellent unmanned aerial vehicle search reward mechanism based on the unmanned aerial vehicle power model, the environment perception map and the unmanned state established in the step S1; s3, the unmanned aerial vehicle adopts a heuristic Q-learning algorithm, explores the search area by using the unmanned aerial vehicle search reward mechanism established in the step S2, plans the search path and realizes the optimal search of the unmanned aerial vehicle; s4, the unmanned aerial vehicle updates the search environment perception map according to the search result of the step S3; s5, fusing attribute values of the multi-unmanned aerial vehicle environment perception map; and S6, repeating the steps S3-S5 until all the targets are searched.

Description

Unmanned aerial vehicle collaborative search method and system based on heuristic Q-learning algorithm

Technical Field

The invention relates to the field of unmanned aerial vehicle collaborative search, in particular to an unmanned aerial vehicle collaborative search method and system based on a heuristic Q-learning algorithm.

Background

In recent years, unmanned aerial vehicles have been widely used in the fields of post-disaster rescue, target striking, military confrontation, and the like. Drone search is an important area of drones. Unmanned aerial vehicle flies to searching sub-region top, uses the airborne sensor to explore this region, designs good unmanned aerial vehicle search strategy for unmanned aerial vehicle catches moving object in the region as far as possible. Therefore, in order to improve the unmanned aerial vehicle searching efficiency, various strategies are provided, and the classical rolling time domain planning generates an optimal searching path on line and is widely applied to the unmanned aerial vehicle searching process.

The extreme value of a predicted path is screened by the classical rolling time domain planning, and the value of an adjacent grid is not considered. And when the prediction time domain is too long, the single step planning time is too long, and the requirement of real-time property cannot be met. Depending on various numerical solutions or heuristic algorithms, the objective function is often required to be conductive or differentiable, and the problem of low reliability exists in the using process.

Disclosure of Invention

The invention provides an unmanned aerial vehicle collaborative searching method and system based on a heuristic Q-learning algorithm, and aims to solve the problems of low efficiency and low reliability of the existing unmanned aerial vehicle searching method.

The technical task of the invention is realized in the following way, and the unmanned aerial vehicle collaborative searching method based on the heuristic Q-learning algorithm specifically comprises the following steps:

s1, establishing an unmanned aerial vehicle power model and searching an environment perception map;

s2, establishing an unmanned plane search reward and heuristic reward mechanism: designing an excellent unmanned aerial vehicle search reward mechanism based on the unmanned aerial vehicle power model, the environment perception map and the unmanned state established in the step S1;

s3, the unmanned aerial vehicle adopts a heuristic Q-learning algorithm, explores the search area by using the unmanned aerial vehicle search reward mechanism established in the step S2, plans the search path and realizes the optimal search of the unmanned aerial vehicle;

s4, the unmanned aerial vehicle updates the search environment perception map according to the search result of the step S3;

s5, fusing attribute values of the multi-unmanned-aerial-vehicle environment perception map: the unmanned aerial vehicle and the unmanned aerial vehicle in the communication range carry out information intercommunication, exchange and fuse the perception information of the search environment map;

and S6, repeating the steps S3-S5 until all the targets are searched.

Preferably, the environment search map establishment in step S1 is specifically as follows:

rasterizing the search area into an N M discretization grid map;

the length and width of the grid are dx and dy respectively; dx, dy is the distance for the drone to fly at average flat flight speed for a planned duration.

Preferably, the environment search map in step S1 is divided into an object existence probability map and an uncertainty map;

the establishment of the target existence probability map model is as follows:

(1) defining p (x, y, k) epsilon (0,1) to represent the probability that the target is positioned at (x, y) at the moment k; p (x, y,0) ═ 0.5 indicates that the probability that the target exists in (x, y) and does not exist in (x, y) at the initial time is the same, and p (x, y,0) is as follows:

wherein v is_nRepresenting a target existence probability peak width; c. C_nRepresenting a target presence probability peak; (x)_n,y_n) Representing locations where targets may appear;

(2) unmanned aerial vehicle has detection probability p in the detection process_dAnd false alarm probability p_fIt is defined as:

p(Z(x,y,k)＝1|δ(x,y,k)＝1)＝p_d；

p(Z(x,y,k)＝1|δ(x,y,k)＝0)＝p_f；

wherein Z (x, y, k) represents a target detection result of the unmanned aerial vehicle k at (x, y) moment; δ (x, y, k) represents the presence of (x, y) at time instant of target k;

(3) with the search of the unmanned aerial vehicle on the area, the attribute value of the target existence probability map is continuously updated according to the Bayes theorem; the updating formula of the target existence probability map is as follows:

wherein phi is_kRepresenting the detection range of the unmanned aerial vehicle at the k moment;

the uncertainty map model is established as follows:

(1) defining uncertainty information of chi (x, y, k) at the time k, and initializing an uncertainty map; the formula is as follows:

χ(x,y,0)＝-p(x,y,0)log₂ p(x,y,0)；

(2) the uncertain map attribute value exponentially decays along with the increase of the number of times of access of the unmanned aerial vehicle, and the formula is as follows:

χ(x,y,k)＝τ*χ(x,y,k)

wherein τ ∈ (0,1) represents a decay exponent.

Preferably, the mechanism for searching and awarding the unmanned aerial vehicle and the heuristic awarding in step S2 is as follows:

s201, searching the reward r by r_A，r_B，r_C，r_DFour parts, the formula is as follows:

r＝w₁r_A+w₂r_B+w₃r_C+w₄r_D；

wherein r is_AIndicating that a probability value reward exists for the target; r is_BRepresenting an indeterminate value reward; r is_CRepresenting a flight cost reward; r is_DIndicating a desired search reward;

s202, reward r of probability value of existence of target_AThe setting is as follows:

r_A＝(1-σ(x,y,k))*p(x,y,k)；

wherein σ (x, y, k) represents whether the drone considers (x, y) to be present; when p (x, y, k) > p _ max, the drone considers that a target exists at (x, y);

s203, uncertain value reward r_BThe setting is as follows:

r_B＝χ(x,y,k+1)-χ(x,y,k)；

s204, flight cost reward r_CThe setting is as follows:

s205, expecting the exploration cost r_DThe setting is as follows:

s206, setting a heuristic reward mechanism: the heuristic reward mechanism attracts the unmanned aerial vehicle to move towards the place where the probability value of the global target is the maximum and to be far away from the map boundary;

s207, the heuristic reward and the search reward jointly form a reward mechanism of heuristic reinforcement learning, and the formula is as follows:

wherein D and F represent heuristic factors; d represents the distance from any position of the map to the position with the maximum probability value of the global map target; f represents the minimum distance from the position of the unmanned aerial vehicle to the 4 edges of the map; c and d represent adjustment coefficients.

Preferably, the heuristic Q-learning algorithm in step S3 is specifically as follows:

s301, building a Q table, initializing Q (S, a) to an arbitrary value, and setting tracking trace e (S, a) to 0;

s302, circulating:

initializing an unmanned aerial vehicle state s;

loop (for each step within the round):

selecting action a according to heuristic action selection strategy according to deep learning Q table

Taking action a to obtain heuristic reward r and next state s'

e(s,a)←1

s←s′

Wherein e (s, a) is the qualification track of the unmanned aerial vehicle taking the action a in the s state; gamma is a discount factor; δ is the increment of Q.

Updating value function and qualification trace of all states s' of unmanned aerial vehicle

Q(s,a)←Q(s,a)+αδe(s,a)

e(s,a)←γλe(s,a)

Wherein α is a learning rate; λ is the learning step size.

Until the end of the number of round steps

Until all Q (s, a) converge;

s303, outputting a final strategy, wherein the formula is as follows:

in the formula, pi^*An optimal strategy for the unmanned aerial vehicle; q (s, a) is an action value function value of the unmanned aerial vehicle executing the action a in the state s;

setting a heuristic action selection mechanism, and improving an action selection strategy by utilizing heuristic information of a harmonic function E (s, a) so that the unmanned aerial vehicle tends to drive to a region with the maximum probability value, wherein the formula is as follows:

in the formula, E (s, a) is a harmonic function value of the unmanned aerial vehicle executing the action a in the state s; epsilon (0,1) is a random probability value; a is_randomRandomly selecting an action for the action set;

when the intelligent agent selects the action in the s state, the action with the maximum sum of the value function and the harmonic function in the state is selected by the probability of epsilon, and the action is randomly selected by the probability of 1-epsilon, wherein the formula is as follows:

wherein p is_kRepresenting the position of the unmanned aerial vehicle at the moment k; p is a radical of_{p_max}Indicating the position with the maximum probability value of the target existence; η represents a positive coefficient for adjusting the magnitude of the harmonic function E (s, a).

Preferably, the context aware map update in step S4 is specifically as follows:

s401, the unmanned aerial vehicle explores the environment under the guidance of an unmanned aerial vehicle searching method based on a heuristic Q-learning algorithm and enters a new exploration area;

s402, the unmanned aerial vehicle updates the target existence probability map and the uncertain map according to the exploration result of the unmanned aerial vehicle and the updating rule of the environment perception map.

Preferably, the attribute value fusion of the multi-drone environment perception map in step S5 is specifically as follows:

in the flight process, the distributed unmanned aerial vehicle carries out information exchange with the unmanned aerial vehicle in a communication range, exchanges environment perception map information, and fuses attribute values of the environment perception map, wherein the formula is as follows:

wherein n represents the number of drones (including itself) within the communication range; v. of_i,k(x, y) represents attribute values (target existence probability value, uncertainty value, etc.) of the environment perception map at (x, y) of the unmanned aerial vehicle i at the time k; v_kAnd (x, y) represents an attribute value of the environment perception map after the information of (x, y) is fused at the time k.

An unmanned aerial vehicle collaborative search system based on heuristic Q-learning algorithm comprises,

the building module I is used for building an unmanned aerial vehicle power model and searching an environment perception map;

the establishment module II is used for establishing an unmanned aerial vehicle search reward and heuristic reward mechanism;

the planning module is used for exploring the search area by using an unmanned aerial vehicle search reward mechanism by adopting a heuristic Q-learning algorithm, planning a search path and realizing the optimal search of the unmanned aerial vehicle;

the updating module is used for updating the search environment perception map according to the search result;

and the fusion module is used for fusing the attribute values of the multi-unmanned aerial vehicle environment perception map, namely, the unmanned aerial vehicles and the unmanned aerial vehicles in the communication range perform information intercommunication, exchange and fuse the perception information of the search environment map.

An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executes the computer program stored by the memory, causing the at least one processor to perform a heuristic Q-learning algorithm based unmanned aerial vehicle collaborative search method as described above.

A computer-readable storage medium having stored thereon a computer program executable by a processor to implement a method for collaborative search of unmanned aerial vehicles based on a heuristic Q-learning algorithm as described above.

The unmanned aerial vehicle collaborative search method and the unmanned aerial vehicle collaborative search system based on the heuristic Q-learning G algorithm have the following advantages:

the grid mean value is considered, the grid mean value is driven by data, is independent of derivatives, does not need prediction, and has excellent convergence and robustness; meanwhile, reasonable rewards are designed by adopting a heuristic reward mechanism, and the prior knowledge is utilized to guide the intelligent agent to better complete tasks in the operation process; and the optimal search path is decided on line, so that the search efficiency is improved;

the invention provides a multi-unmanned aerial vehicle searching scheme with strong robustness, short single step decision time and high searching efficiency, thereby guiding an unmanned aerial vehicle to quickly complete a searching task;

compared with the prior art, the method has the advantages that the single-step search is shorter, and the search efficiency of the unmanned aerial vehicle is improved;

the invention realizes the rapid collaborative search of the unmanned aerial vehicle on the fixed area, the unmanned aerial vehicle scans and environment sub-area information, the self-learning environment searches the optimal path, the robustness is strong, and the unmanned aerial vehicle searching efficiency is improved;

and fifthly, the preset heuristic information guides the unmanned aerial vehicle to move away from the boundary and move to the area with higher target existence probability more quickly, so that the probability of finding the target is improved, the unmanned aerial vehicle searching efficiency is improved, and the time for the unmanned aerial vehicle to finish the task is reduced.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart of a collaborative search method for an unmanned aerial vehicle based on a heuristic Q-learning algorithm;

FIG. 2 is an environmental grid map and a model diagram of an unmanned aerial vehicle dynamics;

FIG. 3 is a diagram of a reinforcement learning framework;

FIG. 4 is a block diagram of a heuristic Q-learning algorithm;

FIG. 5 is a diagram of an initial position distribution of the unmanned aerial vehicle and the search target;

fig. 6 is an unmanned aerial vehicle initial target existence probability map;

fig. 7 is an initial uncertainty map of the drone;

fig. 8 is a target existence probability map of the unmanned aerial vehicle 1 when a task is completed;

fig. 9 is a target existence probability map of the drone 2 at the completion of the task;

fig. 10 is a diagram of the target existence probability of the drone 3 at the completion of the mission.

Detailed Description

The unmanned aerial vehicle collaborative search method and system based on the heuristic Q-learning algorithm of the invention are described in detail below with reference to the drawings and specific embodiments of the specification.

Example 1:

as shown in fig. 1, this embodiment provides an unmanned aerial vehicle collaborative search method based on a heuristic Q-learning algorithm, and the method specifically includes:

s3, the unmanned aerial vehicle adopts a heuristic Q-learning algorithm, explores the search area by using the unmanned aerial vehicle search reward mechanism established in the step S2, plans a search path and realizes the optimal search of the unmanned aerial vehicle;

and S6, repeating the steps S3-S5 until all the targets are searched.

The environment search map establishment in step S1 in this embodiment is specifically as follows:

rasterizing the search area into an N M discretization grid map;

The unmanned aerial vehicle is a fixed wing unmanned aerial vehicle and is constrained by dynamics, and the unmanned aerial vehicle moves by 45 degrees in left turning, straight going and 45 degrees in right turning within each planning time, as shown in the attached figure 2.

The environment search map in step S1 of the present embodiment is divided into a target existence probability map and an uncertainty map;

as shown in fig. 6, the establishment of the target existence probability map model is specifically as follows:

wherein v is_nRepresenting a target existence probability peak width; c. C_nRepresenting a target presence probability peak; (x)_n,y_n) Representing the possible locations of the objects;

p(Z(x,y,k)＝1|δ(x,y,k)＝1)＝p_d；

p(Z(x,y,k)＝1|δ(x,y,k)＝0)＝p_f；

wherein phi_kThe detection range of the unmanned aerial vehicle at the k moment is represented;

as shown in fig. 7, the uncertainty map model is built as follows:

χ(x,y,0)＝-p(x,y,0)log₂ p(x,y,0)；

χ(x,y,k)＝τ*χ(x,y,k)

wherein τ ∈ (0,1) represents a decay exponent.

As shown in fig. 3, the mechanism for searching for rewards and heuristics rewards by the drone in step S2 of this embodiment is as follows:

s201, searching for a reward r from r_A，r_B，r_C，r_DFour parts, the formula is as follows:

r＝w₁r_A+w₂r_B+w₃r_C+w₄r_D；

wherein r is_AIndicating a target presence probability value reward; r is_BRepresenting an indeterminate value reward; r is_CRepresenting a flight cost reward; r is_DIndicating a desired search reward;

r_A＝(1-σ(x,y,k))*p(x,y,k)；

s203, uncertain value reward r_BThe setting is as follows:

r_B＝χ(x,y,k+1)-χ(x,y,k)；

s204, flight cost reward r_CThe setting is as follows:

s205, expecting a search cost r_DThe setting is as follows:

wherein D and F represent heuristic factors; d represents the distance from any position of the map to the position with the maximum probability value of the global map target; f represents the minimum distance from the position of the unmanned aerial vehicle on the map to the 4 edges of the map; c and d represent adjustment coefficients.

As shown in fig. 4, the heuristic Q-learning algorithm in step S3 of the present embodiment is as follows:

s301, building a Q table, setting Q (S, a) to an arbitrary value, and setting trace e (S, a) to 0;

s302, circulating:

initializing an unmanned aerial vehicle state s;

loop (for each step within round):

Taking action a to obtain heuristic reward r and next state s'

e(s,a)←1

s←s′

Wherein e (s, a) is the qualification track of the unmanned aerial vehicle taking the action a in the s state; gamma is a discount factor; delta is the increment of Q.

Q(s,a)←Q(s,a)+αδe(s,a)

e(s,a)←γλe(s,a)

Wherein α is a learning rate; λ is the learning step size.

Until the end of the number of round steps

Until all Q (s, a) converge;

the key codes are as follows:

Repeat:

initialization s

Repeat (for each step in the round)

Selecting action a according to Q table and heuristic action selection strategy

Taking action a to obtain heuristic reward r and next state s'

e(s,a)←1

s←s′

For all states s'

Q(s,a)←Q(s,a)+αδe(s,a)

e(s,a)←γλe(s,a)

End of Until round steps

Until all Q (s, a) converge;

s303, outputting a final strategy, wherein the formula is as follows:

setting a heuristic action selection mechanism, and improving an action selection strategy by utilizing heuristic information of a harmonic function E (s, a) to ensure that the unmanned aerial vehicle drives to a target and has the tendency of the maximum probability value, wherein the formula is as follows:

The context aware map update in step S4 of this embodiment is specifically as follows:

In this embodiment, the attribute value fusion of the multi-drone environment perception map in step S5 is specifically as follows:

In order to verify the search effect of the heuristic Q-learning algorithm-based unmanned aerial vehicle search method, Matlab simulation is carried out on the method, and the feasibility and the search efficiency of the method are verified, which are specifically as follows:

(1) taking a rectangular search area with the search area of 2km by 2km, and dividing the search area into 20 by 20 square grids with the grid size of 100m by 100 m;

(2) the initial positions of the 3 unmanned aerial vehicles are [450,250], [1750,1250], [950, 450], and the initial positions of the 3 targets are [1450, 250], [800,1500], [1300,700 ];

(3) the unmanned aerial vehicle and the target initial position are distributed as shown in figure 5, the detection range of the unmanned aerial vehicle is 300m x 300m, the detection period is set to be 4s, the detection probability is 0.95, and the false alarm probability is 0.015;

(4) in the heuristic Q-learning algorithm, the discount factor gamma is 0.9, the learning rate alpha is 0.01, the greedy selection coefficient epsilon is 0.6, and the trace discount factor lambda is 0.9; uncertainty map decay index τ is 0.5.

The simulation result shows that after 164s, the unmanned aerial vehicle successfully searches all the targets in the area. As shown in fig. 8, 9 and 10, the target existence probability maps of the drones 1,2 and 3 after the task is finished. The unmanned aerial vehicle can collaboratively cover most areas to be searched, and the coverage search of the areas is completed. The unmanned aerial vehicle collaborative searching method based on the heuristic Q-learning algorithm can better guide the unmanned aerial vehicle to complete the searching task, has better convergence and robustness, and improves the searching efficiency of the unmanned aerial vehicle.

Example 2:

the embodiment provides an unmanned aerial vehicle collaborative search system based on heuristic Q-learning algorithm, which comprises,

and the fusion module is used for fusing the attribute values of the multi-unmanned aerial vehicle environment perception map, namely, the unmanned aerial vehicle and the unmanned aerial vehicle in a communication range perform information intercommunication, exchange and fuse the perception information of the search environment map.

Example 3:

the present embodiment also provides an electronic device, including: a memory and a processor;

wherein the memory stores computer execution instructions;

the processor executes the computer execution instructions stored in the memory, so that the processor executes the unmanned aerial vehicle collaborative search method based on the heuristic Q-learning algorithm in any embodiment of the invention.

Example 4:

the embodiment also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the unmanned aerial vehicle collaborative search method based on the heuristic Q-learning algorithm in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RYM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An unmanned aerial vehicle collaborative search method based on heuristic Q-learning algorithm is characterized by comprising the following steps:

s1, establishing an unmanned aerial vehicle dynamic model and searching an environment perception map;

s6, repeating the steps S3-S5 until all the targets are searched.

2. The collaborative unmanned aerial vehicle searching method based on the heuristic Q-learning algorithm of claim 1, wherein the environment search map establishment in step S1 is specifically as follows:

rasterizing the search area into an N M discretization grid map;

3. The collaborative unmanned aerial vehicle searching method based on the heuristic Q-learning algorithm of claim 1, wherein the environment search map in step S1 is divided into a target existence probability map and an uncertainty map;

the establishment of the target existence probability map model is as follows:

wherein v is_nRepresenting a target existence probability peak width; c. C_nRepresenting a target existence probability peak; (x)_n，y_n) Representing the possible locations of the objects;

p(Z(x，y，k)＝1|δ(x，y，k)＝1)＝p_d；

p(Z(x，y，k)＝1|δ(x，y，k)＝0)＝p_f；

wherein phi_kRepresenting the detection range of the unmanned aerial vehicle at the k moment;

the uncertainty map model is established as follows:

χ(x，y，0)＝-p(x，y，0)log₂p(x，y，0)；

(2) the uncertain map attribute value is exponentially attenuated along with the increase of the access times of the unmanned aerial vehicle, and the formula is as follows:

χ(x，y，k)＝τ*χ(x，y，k)

wherein τ ∈ (0,1) represents a decay exponent.

4. The collaborative unmanned aerial vehicle searching method based on the heuristic Q-learning algorithm of claim 1, wherein the unmanned aerial vehicle search reward and heuristic reward mechanism in step S2 is specifically as follows:

r＝w₁r_A+w₂r_B+w₃r_C+w₄r_D；

wherein r is_AIndicating that a probability value reward exists for the target; r is a radical of hydrogen_BRepresenting an indeterminate value reward; r is_CRepresenting a flight cost reward; r is_DIndicating a desired search reward;

r_A＝(1-σ(x，y，k))*p(x，y，k)；

wherein σ (x, y, k) represents whether the drone considers (x, y) to be present; when p (x, y, k) > p _ max, the drone considers that there is a target at (x, y);

s203, uncertain value reward r_BThe setting is as follows:

r_B＝χ(x，y，k+1)-χ(x，y，k)；

s204, flight cost reward r_CThe setting is as follows:

s205, expecting the exploration cost r_DThe setting is as follows:

5. The collaborative search method for unmanned aerial vehicles based on heuristic Q-learning algorithm as claimed in claim 1, wherein the heuristic Q-learning algorithm in step S3 is as follows:

s302, circulating:

initializing an unmanned aerial vehicle state s;

loop (for each step within round):

Taking action a to obtain heuristic reward r and next state s'

e(s，a)←1

s←s′

Q(s，a)←Q(s，a)+αδe(s，a)

e(s，a)←γλe(s，a)

Wherein α is a learning rate; λ is the learning step size.

Until the end of the number of round steps

Until all Q (s, a) converge;

s303, outputting a final strategy, wherein the formula is as follows:

wherein p is_kRepresenting the position of the unmanned aerial vehicle at the moment k; p is a radical of_{p_max}Indicating the presence of a targetThe position with the maximum probability value; η represents a positive coefficient for adjusting the magnitude of the harmonic function E (s, a).

6. The collaborative unmanned aerial vehicle searching method based on the heuristic Q-learning algorithm of claim 1, wherein the environment-aware map update in step S4 is specifically as follows:

7. The collaborative search method for unmanned aerial vehicles based on heuristic Q-learning algorithm of any of claims 1-6, wherein the attribute value fusion of the multi-unmanned aerial vehicle environment perception map in step S5 is specifically as follows:

wherein n represents the number of unmanned aerial vehicles in the communication range; v. of_i，k(x, y) represents the attribute value of the environment perception map at (x, y) at the moment k of the unmanned plane i; v_kAnd (x, y) represents an attribute value of the environment perception map after the information of (x, y) is fused at the time k.

8. An unmanned aerial vehicle collaborative search system based on heuristic Q-learning algorithm is characterized by comprising,

the planning module is used for exploring the search area by using a heuristic Q-learning algorithm and an unmanned aerial vehicle search reward mechanism, planning a search path and realizing optimal search of the unmanned aerial vehicle;

9. An electronic device, comprising: a memory and at least one processor;

wherein the memory has stored thereon a computer program;

the at least one processor executing the memory-stored computer program causes the at least one processor to perform the unmanned aerial vehicle collaborative search method based on the heuristic Q-learning algorithm of any of claims 1 to 7.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executable by a processor to implement the unmanned aerial vehicle collaborative search method based on the heuristic Q-learning algorithm according to any one of claims 1 to 7.