CN117032247A

CN117032247A - Marine rescue search path planning method, device and equipment

Info

Publication number: CN117032247A
Application number: CN202311060494.8A
Authority: CN
Inventors: 高盈盈; 杨清清; 杨志伟; 杨鹏程; 郭玙; 许婧; 杨克巍; 姜江; 王翔汉
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-10
Anticipated expiration: 2043-08-22
Also published as: CN117032247B

Abstract

The invention provides a method, a device and equipment for planning a search path of marine rescue, wherein the method comprises the following steps: determining a task target based at least on the characteristics of the search unit in case the search area at sea and the search unit are determined; respectively constructing a state space and an action space based on the search area and/or the search unit, wherein the state space comprises state characteristics of each area in the search area, and the action space comprises an action set executable by the search unit in each area in the search area; and constructing a reward function, and carrying out iterative computation based on the reward function, the task target, the state space, the action space, the first action strategy or the second action strategy, the learning rate strategy and the target algorithm to converge to obtain a search path. The method for planning the search path of the marine rescue can quickly and effectively determine the search path when the marine rescue task is executed.

Description

Marine rescue search path planning method, device and equipment

Technical Field

The invention belongs to the technical field of marine rescue, and particularly relates to a method, a device and equipment for planning a marine rescue search path.

Background

Offshore moving object search path planning is a complex task that takes into account the complexity and uncertainty of the offshore environment, such as current, wind direction, object location, and search unit search capabilities. Therefore, how to scientifically plan a search path ensures that the algorithm has highest efficiency and shortest search time, ensures the accuracy of search results, and is vital to offshore safety and disaster emergency response. With the continuous development of information technology and artificial intelligence, the offshore mobile target search path planning technology can be continuously innovated and improved. The technology can be widely applied to the fields of offshore rescue, marine environment monitoring, offshore traffic control, sea battlefield target searching and the like, and can bring more reliable and efficient guarantee for the safety and development of the offshore industry.

The search theory was first proposed, the aim of which was to investigate how under limited resources, one tries to find a target with uncertainty in its location, i.e. to maximize the probability of success (Probability of successful search, POS), while the magnitude of the POS value depends on the inclusion probability (Probability of containment, POC) and the discovery probability (Probability of detection, POD). The search planning theory is a search criterion which is extracted by the search theory through a large number of simplification and summarization, and mainly aims at solving the problems of determining an optimal search area, distributing search resources, making a search plan and the like by utilizing the search theory under the condition of consuming the least search resources, which are also hot spots and difficulties in the research of the offshore search planning field.

Currently, the search planning theory mainly comprises the aspects of optimal search area determination, search resource allocation, search path planning and the like. The search path planning is to determine a search path through a certain algorithm so as to ensure planning efficiency and search time. However, in the aspect of determining the optimal search path, the existing algorithm is difficult to consider the conditions of searching a high probability area preferentially, reducing turning time consumption, limiting endurance mileage and the like, and is still inflexible in the aspects of learning rule modeling, low probability area exploration strategy, action selection strategy, learning rate strategy and the like, so that the algorithm efficiency is required to be improved.

Disclosure of Invention

The invention aims to provide a method, a device and equipment for planning an offshore rescue search path, which can quickly and effectively determine the search path when an offshore rescue task is executed.

The invention relates to a method for planning a search path of marine rescue, which comprises the following steps:

determining a task target based at least on features of a search unit in case of a search area at sea and the search unit determination;

respectively constructing a state space and an action space based on the search area and/or the search unit, wherein the state space comprises state characteristics of each area in the search area, and the action space comprises an action set executable by the search unit in each area of the search area;

Constructing a reward function, and carrying out iterative computation based on the reward function, a task target, a state space, an action space, a first action strategy or a second action strategy, a learning rate strategy and a target algorithm to converge and obtain a search path;

the target algorithm is used for calculating estimated values of action values of the search unit at different moments based on a time sequence difference algorithm, wherein the action values represent expected values of return obtained by the search unit executing a certain action.

In some embodiments, the determining a task goal based at least on the characteristics of the search unit includes:

determining task targets based on the type and the volume of the search unit or simultaneously combining the boundary shape and the environment information of the search area, wherein the type of the search unit comprises ships and aircrafts, the task targets comprise minimum time required for covering the search area and maximum accumulated search success rate under the duration constraint condition, and different task targets have different constraint conditions, and the constraint conditions are related to the duration mileage and/or the coverage rate of the search area and/or the time consumption of turning.

In some embodiments, the constructing a state space, an action space based on the search area and/or the search unit, respectively, includes:

Gridding the search area;

marking each grid in the search area with a corresponding state based on the environment information of the search area, the type of the search unit and the search states of different areas in the search area, wherein the type of the search unit comprises a ship and an aircraft, and the search states are in a real-time update state based on actual search conditions;

calculating the search success probability corresponding to each grid in the search area;

and fusing the search areas after the gridding processing based on the search success probability and the marking information to obtain the state space.

In some embodiments, the building a state space, an action space based on the search area and/or the search unit, respectively, further includes:

determining a steering capability of the search unit based on a type and a volume of the search unit;

determining an action space of the search unit based on the steering capability;

when the state space meets the target requirement, the action space is constructed as a matrix corresponding to the state space, parameters corresponding to each grid in the matrix are actions which can be taken by the search unit in the search area corresponding to the grid and in the belonging state, and the actions which can be taken comprise the actions of the search unit are located in the search area and are always in a non-forbidden state.

In some embodiments, the constructing the reward function includes:

constructing an instant rewarding function based on a mechanism that the searching unit obtains exploration rewards when searching the grid area in the unsearched state, obtains turn-back penalties when searching the grid area in the searched state and obtains freeplay penalties when searching the grid area in the freeplay state, wherein the grid area in the freeplay state represents no searching targets or a very small number of searching targets in the corresponding searching area;

based on different task targets, the search success rate POS is accumulated by combining weights _cum Historical maximum weighted cumulative search success rate POS _{cum_max} And historical minimum turn times T _min Respectively design different round rewarding functions R _target ：

When the task objective is to minimize the time required for covering the search area, if the weighted cumulative search success rate exceeds a history peak value and the number of turns of the search unit is lower than a history valley value, then: r is R _target ，if POS _cum ＞POS _{cum_max} andT＜T _min else 0；

When the task target is that the accumulated searching success rate under the endurance constraint condition is maximized, if the searching unit reaches the maximum endurance mileage and the weighted accumulated searching success rate exceeds the historical peak value, the method comprises the following steps:

R _target ,ifPOS _cum >POS _{cum_max} andStep＝N _max else 0

wherein the process of the search unit reaching a stepwise termination state based on a limited trajectory is one turn, the T represents the number of turns of the search unit in the current turn, the Step represents the number of search steps of the search unit in the current turn, N _max For the maximum number of search steps in one round of the search unit, POS _n Step for cumulative search success rate of the search unit into next grid area _n For the step number of the search unit entering the current state in the current round, as the step number increases, the search unit performsThe corresponding round rewards are continuously decayed.

In some embodiments, the first action policy randomly selects search actions for a search unit with a probability ε, ε being a greedy coefficient, the smaller the ε, the more likely the first action policy will be to cause the search unit to select high-value search actions in a set;

the second action strategy is to enable the search unit to select different search actions by configuring different weights and temperature parameters tau;

the method further comprises the steps of:

as the number of iterations increases, the greedy coefficient and temperature parameter are gradually reduced based at least on a cosine decay strategy.

In some embodiments, the learning rate policy includes updating the learning rate α based on historical round data at the end of each round, increasing the learning rate α if the average of rewards in the historical rounds increases, and decreasing the learning rate α otherwise.

In some embodiments, the performing iterative computation based on the reward function, task goal, state space, action space, first action strategy or second action strategy, learning rate strategy, and goal algorithm to converge to obtain a search path includes:

Constructing a model for selecting a search action whose action value meets a requirement based on a target algorithm, the model comprising:

q _t+1 (s _t ,a _t )＝q _t (s _t ,a _t )-α _t (s _t ,a _t )[q _t (s _t ,a _t )-[r _t+1 +γmaxq _t (s _t+1 ,a)]

(s _t ,a _t ) For t moment state-action pair, q _t (s _t ,a _t ) Representation(s) _t ,a _t ) Estimated value of motion value, alpha _t (s _t ,a _t ) Is that(s _t ,a _t ) Learning rate at time t, gamma is discount factor, r _t+1 A reward value at time t+1;

updating a policy based on the selected search actions, the policy characterizing a probability that the search unit takes each search action in a different state;

comparing the historical optimal strategies, and if the current updated strategy is better than the historical optimal strategy, adopting a searching action corresponding to the current strategy;

under the search area, the search actions selected based on policy comparison form the search path.

The other embodiment of the invention also provides a marine rescue search path planning device, which comprises:

a determination module for determining a task target based at least on characteristics of a search unit in case of determination of the search area and the search unit at sea;

the construction module is used for respectively constructing a state space and an action space according to the search area and/or the search unit, wherein the state space comprises state characteristics of each area in the search area, and the action space comprises an action set executable by the search unit in each area of the search area;

The calculation module is used for constructing a reward function, and carrying out iterative calculation based on the reward function, a task target, a state space, an action space, a first action strategy or a second action strategy, a learning rate strategy and a target algorithm so as to converge to obtain a search path;

Another embodiment of the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for planning a rescue search path according to any one of the embodiments above when the program is executed by the processor.

The method has the advantages that task targets are determined through matching, a state space, an action strategy and a learning rate strategy which are in accordance with the characteristics of an actual search environment and a search unit are constructed, a reward function which can meet the execution of various different search actions is constructed, iterative computation is performed by combining a target algorithm based on the determined and constructed strategies, functions and targets, and a search path is planned.

Drawings

Fig. 1 is a flowchart of a method for planning a search path for rescue at sea in an embodiment of the invention.

Fig. 2 is an application flowchart of a marine rescue search path planning method in an embodiment of the present invention.

FIG. 3 is a diagram illustrating interaction between a search unit and an environment in an embodiment of the present invention.

Fig. 4 is a block diagram of the state space of the present invention.

Fig. 5 is a spatial distribution diagram of the operation of the search unit according to the present invention.

Fig. 6 is a block diagram of a marine rescue search path planning apparatus according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, an embodiment of the present invention provides a method for planning a search path for rescue at sea, including:

s1: determining a task target based at least on the characteristics of the search unit in case the search area at sea and the search unit are determined;

s2: respectively constructing a state space and an action space based on the search area and/or the search unit, wherein the state space comprises state characteristics of each area in the search area, and the action space comprises an action set executable by the search unit in each area in the search area;

s3: constructing a reward function, and carrying out iterative computation based on the reward function, a task target, a state space, an action space, a first action strategy or a second action strategy, a learning rate strategy and a target algorithm to converge to obtain a search path;

The target algorithm is used for calculating estimated values of action values of the search unit at different moments based on the time sequence difference algorithm, wherein the action values represent expected values of the search unit for executing a certain action to obtain return.

The method has the advantages that task targets are determined through matching, a state space, an action strategy and a learning rate strategy which accord with the actual search environment and the characteristics of the search unit are built, meanwhile, a reward function capable of meeting execution of various different search actions is built, iterative calculation is conducted by combining a target algorithm based on the determined and built strategies, functions and targets, and a search path is planned. Based on the method, the search path can be rapidly and effectively planned, the whole process is not complex, the accuracy of the generated result is high, and the result is more matched with the actual search environment, so that the search and rescue efficiency is remarkably improved.

Further, for better understanding of the method according to the present embodiment, basic parameters involved in the method are described below:

a State (State) and a State space (State space), the State referring to information describing the environment in which the agent is located, including various attributes of the environment, denoted s _t . State space refers to the set of all possible states, i.e

Action (Action) and Action space (Action space), an Action refers to an Action that an agent can take in each state, denoted as a _t . An action space refers to a collection of all possible actions that can be taken for each state, noted as

State transition (State transition) and State transition probability (State transitionprobability), a State transition refers to a process in which an agent takes an action and then reaches another State, and is described asThe probability of state transition means that it is in a certain state s _t Takes a certain action a by the agent of (a) _t To the next state s _t+1 The probability of (c) is denoted as p(s) _t+1 |s _t ,a _t )。

Policy (Policy), which refers to the probability of an agent taking an individual action in a particular state, is denoted pi (a _t |s _t )。

Rewards (Reward) and rewards transition probabilities (Reward transition probability), rewards refer to corresponding environmental feedback after an agent is in a certain state and takes a certain action. This feedback may be positive, negative or zero, denoted r. The probability of bonus transition refers to the probability distribution of taking a certain action to obtain a bonus when the agent is in a certain state, and is denoted as p (r|s) _t ,a _t )。

For example, in reinforcement learning, the interaction process between the agent (i.e. the search unit) and the environment is as shown in fig. 3, the agent acquires its own state according to the environment, and selects a certain action to enter a new state in the state, and the reinforcement learning model gives a reward according to the correctness of the action, so that the agent selects the action with the largest reward.

Trace (trace), total prize (Return), and round (ephode or real), the trace is a state-action-prize chain, and can be expressed by the following formula:

total prize refers to the sum of the prizes obtained for the entire track, i.e. return = r ₁ +r ₂ +r ₃ …. The round refers to a limited trajectory, i.e. one experiment or round, obtained when the agent reaches the end state. That is, the process by which the search unit reaches a stepwise termination state based on a limited trajectory is one round.

Instant rewards (Immediate rewards), delay rewards (Delay rewards) and Discount factors (discover factors), which are rewards obtained immediately after a search unit takes an action from a state, can guide the search unit to take an optimal action in the current state to achieve the goal of maximizing cumulative rewards. The delay prize is the sum of prizes obtained from the initial state to the current state, and overfitting the current state can be avoided so as to better adapt to future environment. However, where the track length is very large, the delay prize may tend to be infinite and the instant prize may be ignored. To balance the near and far vision of the search unit, a discount factor needs to be introduced, denoted gamma, gamma e 0, 1. Gamma represents the degree of disturbness of the search unit, with smaller gamma focusing more on the current reward and larger gamma focusing more on the future reward.

Let the bonus sequence from time T to final time T be r _t+1 ,r _t+2 ,r _t+3 ,...,r _T Discount rewards (Discounted return) G after introduction of discount factors _t Can be expressed as:

G _t ＝r _t+1 +γr _t+2 +γ ² r _t+3 +...+γ ^T-t-1 r _T

the state-cost function (State value function), while rewards may evaluate the policy for quality, cannot be applied to random processes because different trajectories and rewards may be obtained from a state. Thus, the introduced state cost function is as follows:

wherein v _π (s) is called the state-cost function representing the state value of s. Representing slave state s _t Starting with s, all possible rewards are expected.

Markov property (Markov property), which can be described simply as a characteristic of an irrelevant history, refers to the state S at the next moment _t+1 Depending on the state S at the present moment only _t Irrespective of all previous states, or the state transition process is inefficient. The expression can be expressed as follows:

markov Process (MP), which is a set of discrete random variable sequences with markov properties, can be represented by a binary set < S, P >, where S represents the state space and P represents the state transition probability.

The markov decision process, MDP, may be represented by a five-tuple < S, a, R, P, γ >: s is the set of all states. A is an action set, corresponding to actions that each state can take. R is a prize, corresponding to each state-action pair (s, a). P is the state transition probability. Gamma is the aforementioned discount factor.

Bellman Equation (BE), BE is a set of linear equations describing all state values. G _t And G _t+1 The relationship of (2) is as follows: g _t ＝r _t+1 +γG _t+1

Now, the state value calculation formula is as follows:

wherein,for the desired value of the instant prize +.>To delay the desired value of the prize. The bellman equation is derived using the full expectation formula as follows:

the problem of solving a state value from the bellman equation is often referred to as policy evaluation.

Action value (Action value), which represents the expected value of rewarding by executing a certain Action, can be used to solve the optimal strategy. Because state value is difficult to calculate, during model trainingAction value is often more important than state value. The calculation formula of the action value is as follows:

note that the action value is for the state-action pair (s, a) instead of a separate action. The relationship between action value and state value can be expressed by the following formula:

the bellman best equation BOE can find the best strategy that satisfies a particular condition without enumerating all possible strategies. The method greatly reduces the calculated amount and shortens the time for solving the optimal strategy, is a powerful tool for solving the optimal strategy, and can be expressed by the following formula:

Further, in the process of searching the offshore moving target, the main target is to search the high probability area preferentially, and meanwhile, the autonomous obstacle avoidance and the repeated path reduction are considered. The task targets may be classified according to characteristics and constraints of different search units, and when determining the corresponding task targets according to different search conditions, for example, determining the task targets based on at least features of the search units includes:

s4: the method comprises the steps of determining task targets based on the type and the volume of a search unit or by combining the boundary shape and the environment information of a search area, wherein the type of the search unit comprises ships and aircrafts, the task targets comprise minimum time required for covering the search area and maximum accumulated search success rate under the duration constraint condition, and different task targets have different constraint conditions, and the constraint conditions are related to the duration mileage and/or the coverage rate of the search area and/or the turning time.

In particular, for the purpose of minimizing the time required for the full coverage search area, the purpose is generally applicable to a search unit such as a commercial ship or a large-sized ship, which has a sufficient range and can support the full coverage search area. However, most hulls are long and have limited maneuverability, resulting in long cornering times. Thus, the sub-targets should also include full coverage search areas and reduced turn time consumption.

For the target with the maximum accumulated searching success rate under the endurance constraint condition, the target is more suitable for searching units such as small ships, unmanned aerial vehicles, unmanned ships and helicopters, has more flexible steering performance, but has most limited endurance mileage, and is difficult to fully cover a searching area. Thus, the sub-objective should also include a range constraint. However, it should be noted that, no matter which task objective is to find an optimal path, different constraint conditions and sub-objectives are balanced, so that the searching unit can search conveniently, so as to shorten the searching time as much as possible and improve the searching efficiency.

In some embodiments, building a state space, an action space, based on the search region and/or the search unit, respectively, includes:

s5: gridding the search area;

s6: marking each grid in the search area with a corresponding state based on the environment information of the search area, the type of the search unit and the search states of different areas in the search area, wherein the type of the search unit comprises a ship and an aircraft, and the search states are in real-time update states based on actual search conditions;

s7: calculating the search success probability corresponding to each grid in the search area;

S8: and fusing the search areas after the gridding processing based on the search success probability and the marking information to obtain a state space.

For example, after the search area is gridded, a one-dimensional array or a two-dimensional matrix can be used for storage according to the number of grids, and each grid has the following four optional states:

forbidden state: the search unit includes ships and aircrafts, and there may be a submerged reef, an island, a no-go or no-go area, a work environment severe area, a passing ship, etc. in the search area, and thus the obstacle area is marked as a no-go state. Represented by code-2;

bare-time state: since convex hulls and minimum coverage rectangles (i.e., current search areas) are obtained for scattered points (objects to be searched) when determining the optimal search area, the search area is inevitably enlarged, so that some grid cells may not contain scattered points or contain a very small amount of scattered points, and the search value is not too great, and searching the grid cells is definitely futile, represented by code-1;

unsearched state: the search unit has not completed the search of the grid unit, indicated by code 0;

searched state: after the search unit completes the grid cell search, its state is marked as searched, denoted by code 1.

Based on the above states, coding information is added correspondingly in the grids, in addition to that, each grid (unit) can calculate the corresponding POS probability (search success probability) and embody it in the grids (the gray level of each grid is different to indicate that the POS probability is different), which can be understood as the attribute of each grid having two dimensions, namely the POS probability and the corresponding state. The two attributes are fused to obtain the state space described in this embodiment, and the specific structure is shown in fig. 4.

In other embodiments, the state space, the action space are respectively constructed based on the search area and/or the search unit, further comprising:

determining a steering capability of the search unit based on the type and the volume of the search unit;

when the state space meets the target requirement, the action space is constructed as a matrix corresponding to the state space, and parameters corresponding to each grid in the matrix are actions which can be taken by the search unit in the search area corresponding to the grid in the belonging state, wherein the actions which can be taken include the actions of the search unit are positioned in the search area and are always in a non-forbidden state.

For example, in order to improve the solving efficiency of the model and reduce the adverse effect of irregular search on the search security, the present embodiment decomposes the motion of the search unit according to its steering capability, and abstracts the motion space of the search unit into two forms: four directions (east, west, south, north) and eight directions (east, west, south, north, southeast, northeast, southwest, northwest) as shown in fig. 5. In addition, when the state space is small (i.e. the number of grid cells is small), the embodiment also establishes an action space matrix corresponding to the grid cells of the state space one by one, and enumerates actions that can be taken by the search cell in each state. When a search unit is in a certain state and takes a certain action, the action is regarded as invalid search when the search unit enters a grid area in a forbidden state or crosses the boundary of a given search area, and the model is removed from an action space corresponding to the state. The model convergence speed can be remarkably improved based on constraint of the action space, and excessive unnecessary exploration is avoided when the search unit is used for avoiding the obstacle independently, so that the calculation resource is wasted.

Further, during the training process, if a certain reward is given only after the task goal is achieved, the sparse reward problem occurs. During the initial stage of training, the search unit is more difficult to rewards, which can result in no destination wandering in the search area, resulting in a very slow learning process, i.e. "altitude problems". To solve this problem, the present embodiment constructs a bonus function, specifically including:

s9: constructing an instant rewarding function based on a mechanism of obtaining exploration rewards when a searching unit searches a grid area in an unsearched state, obtaining turn-back penalties when searches a grid area in a searched state and obtaining futile penalties when searches a grid area in a futile state, wherein the grid area in the futile state represents no search targets or a very small number of search targets in the corresponding search area;

s10: based on different task targets, the search success rate POS is accumulated by combining weights _cum Historical maximum weighted cumulative search success rate POS _{cum_max} And historical minimum turn times T _min Respectively design different round rewarding functions R _target ：

S11: when the task objective is to minimize the time required for covering the search area, if the weighted cumulative search success exceeds the historical peak value, and the search unit is turned The number of turns is below the historical valley, then: r is R _target ,if POS _cum >POS _{cum_max} and T<T _min else 0；

S12: when the task target is that the accumulated searching success rate under the endurance constraint condition is maximized, if the searching unit reaches the maximum endurance mileage and the weighted accumulated searching success rate exceeds the historical peak value, the method comprises the following steps:

R _target ,if POS _cum >POS _{cum_max} and Step＝N _max else 0

wherein the search unit reaches a stepwise end state based on a limited track in one turn, T represents the number of turns of the current turn search unit, step represents the number of search steps of the current turn search unit,N _max for the maximum number of search steps in a round of the search unit, POS _n Step for cumulative search success rate of search unit into next grid area _n For the number of steps of the search unit entering the current state in the current round, the corresponding round rewards are continuously attenuated as the number of steps increases.

For example, to avoid that the search unit fails to obtain rewards in time without destination wandering, the present embodiment constructs instant rewards, specifically, search rewards R are obtained when the search unit searches grid cells in an unsearched state _explore The method comprises the steps of carrying out a first treatment on the surface of the When searching the grid cell of the searched state, the turn-back penalty R is obtained _repeat The method comprises the steps of carrying out a first treatment on the surface of the When searching for the grid cells in the freeform state, the freeform penalty R is obtained _vain . The instant prize may be expressed by the following formula:

at the same time, rewards R in design round _target When the method is used, the conflict is generated between the prior searching of the high probability area (the area containing a large number of targets to be searched and the time spent in turning is reduced, and the prior searching of the high probability area is considered too much, so that the time spent in turning is increasedMinimizing turn time would be similar to the traditional parallel line search approach, and the goal of preferentially searching for high probability regions could not be achieved. Thus, a weighted cumulative search success rate POS is introduced _cum To resolve sub-target conflicts, the formula is as follows:

wherein N is _max For maximum number of search steps for one round, POS _n Step for cumulative search success rate of search unit into next grid unit _n For the number of steps the search unit enters the current state in that round. As the number of steps increases, the corresponding rewards also decay.

In addition, a historical maximum weighted cumulative search success rate POS is introduced _{cum_max} And the historical minimum turning times are two variables, two round rewarding functions are designed according to different task targets, and the rewarding function can enable the searching unit to obtain rewards only when corresponding constraint conditions are met and a target threshold value can be further optimized. The round-trip bonus function is as follows:

Under the MTC target, when POS _cum The prize is awarded when the historical peak is exceeded and the number of turns is below the historical valley, as follows:

R _target ,if POS _cum >POS _{cum_max} and T<T _min else 0

where T represents the number of turns of the current round.

At MASP target, when maximum range is reached, and POS _cum The prize is awarded beyond the historical peak as follows:

R _target ,if POS _cum >POS _{cum_max} and Step＝N _max else 0

where Step represents the number of search steps of the current round.

Preferably, in order to make the search unit find a balance between exploration and utilization, i.e. select a search action with an excellent effect, and explore a potentially better action, the model is prevented from being difficult to stably converge due to excessive exploration, or from being trapped into local optimum due to excessive utilization, two action strategies are set for the model to select, and a learning rate strategy is set at the same time, so that the model can learn effectively. In this embodiment, two action strategies are set for the model to select, so that the search unit is helped to avoid sinking into the local optimal solution by reasonably weighting exploration and utilization, and meanwhile, the convergence speed of the model can be accelerated. Specifically, when selecting an action policy, policy comparison is required:

the first action strategy in this embodiment is epsilon-greedy strategy, which is an approximate greedy strategy, the core idea is based on probability selection actions, the search unit randomly selects the exploration action with a certain probability epsilon, but in most cases, the utilization action with the highest value currently is selected with a probability of 1-epsilon, so that the situation that the local optimal solution is trapped as the complete greedy strategy can be avoided. Epsilon is also known as a greedy coefficient, the smaller its value, the more the policy tends to utilize. The strategy is simple and easy to use, but when epsilon is too small, local optimum can be trapped, and when epsilon is too large, the optimum solution can be easily skipped, and the greedy coefficient needs to be adjusted in combination with the actual situation in training so as to achieve the best effect.

The second action policy is a SoftMax policy, which is to replace the action selection probability with a weight, the higher the weight assigned to the high value action, the higher the probability of being taken, and vice versa. The flatness of the probability distribution can be controlled specifically by the temperature parameter τ. The smaller the value of τ, the closer to 1 the probability of taking the optimal action and the greater the probability of taking each action, the more even. The expression can be expressed as follows:

wherein N is _act To the number of actions that can be taken. The strategy is relatively complex, but is more balanced and flexible, the exploration degree can be controlled through tau, and the strategy is suitable for different environments and problems.

As training proceeds, the environmental knowledge increases, and the model should be controlled to reduce exploration and to be used. For both of the above action selection strategies, i.e. gradually decreasing the epsilon and tau values over time, implementation is biased towards the use side. The attenuation strategies generally comprise linearity, natural index, cosine, sine and the like, and epsilon attenuation is taken as an example, and the corresponding calculation formulas of different attenuation strategies are as follows:

wherein ε ₀ The initial value of epsilon, N is the total round number, and i is the current round number. The linear attenuation speed is unchanged, the exponential attenuation is firstly fast and then slow, the cosine attenuation is firstly slow and then fast, the sine attenuation is also firstly fast and then slow, but the trend is more gentle. In this embodiment it is preferred to gradually decrease the epsilon and tau values with a sinusoidal decay strategy.

For the learning rate strategy, the learning rate alpha is a scale factor when the action value is updated, namely, the leap degree when the optimal strategy is found. The optimization of alpha not only can improve training efficiency, but also influences whether the model can determine convergence and stably output an optimal strategy. The environment and tasks in reinforcement learning are often complex and dynamic, and employing the same on different rounds may result in difficulty in steady convergence of the model. The self-adaptive adjustment of the learning rate can jump over the current local optimal condition, and the convergence speed and performance are improved. In this embodiment, the convergence speed and stability of the model are optimized based on the ALR policy, that is, the learning rate is updated according to the historical round data at the end of each round, if the average reward of the historical round increases, the learning rate is increased, otherwise the learning rate is decreased. This approach may make the model more focused to explore in the early stages of training, and will gradually tend to take advantage of the experience already in place as training progresses.

Further, as described above, the target algorithm in the present embodiment is implemented based on a time-series Difference (TD), specifically, based on a Q-Learning algorithm in the time-series Difference algorithm. First, a basic form of the TD algorithm is given:

υ _t+1 (s _t )＝υ _t (s _t )-α _t (s _t )[υ _t (s _t )-[r _t+1 +γυ _t (s _t+1 )]

Wherein v _t (s _t ) Indicating v _π (s _t ) An estimate of state value. Alpha _t (s _t ) Is s _t Learning rate at time t.

Performing iterative computation based on the reward function, the task target, the state space, the action space, the first action strategy or the second action strategy, the learning rate strategy and the target algorithm to converge to obtain a search path, wherein the method comprises the following steps:

s3: constructing a model for selecting a search action whose action value meets the requirement based on a target algorithm, the model comprising:

(s _t ,a _t ) For t moment state-action pair, q _t (s _t ,a _t ) Representation(s) _t ,a _t ) Estimated value of motion value, alpha _t (s _t ,a _t ) Is(s) _t ,a _t ) Learning rate at time t, gamma is discount factor, r _t+1 A reward value at time t+1;

s4: updating a policy based on the selected search action, the policy characterizing the probability that the search unit takes each search action in a different state;

s5: comparing the historical optimal strategy, and if the current updated strategy is better than the historical optimal strategy, adopting a searching action corresponding to the current strategy;

s6: under the search area, the search actions selected based on the policy comparison form a search path.

For example, action value is updated based on the form of the Q-Learning algorithm:

wherein q _t (s _t ,a _t ) Representation(s) _t ,a _t ) An estimate of the motion value. Alpha _t (s _t ,a _t ) Is(s) _t ,a _t ) Learning rate at time t.

After continuous iterative computation, the action value gradually converges, and an optimal strategy can be obtained corresponding to each state q _t (s _t ,a _t ) The highest action, the optimal strategy can be solved by the following formula:

for example, update policy:

epsilon or tau is attenuated according to a set attenuation strategy, such as a sinusoidal attenuation strategy.

Under the MTC goal (minimizing the time required to fully cover the search area), the inner layer cycle end logic may end the cycle if: the unsearched state within the search area has been cleared. Under MASP objectives (maximizing cumulative search success under endurance constraints), the inner layer cycle end logic may end the cycle if: the number of steps exceeds the range limit. The reward logic is: the current strategy is better than the historical optimal strategy, otherwise no rewards are generated. And after the circulation is finished, obtaining a current optimal strategy, comparing the current optimal strategy with a historical optimal strategy based on the strategy, selecting a search action based on the current strategy if the current optimal strategy is superior to the historical strategy, and planning a search path in a search area based on the search action.

The pseudo code of the method in this embodiment is shown below, and based on this, corresponding adjustment can be made according to different task targets, so as to adapt to the actual situation. Where the initial value of epsilon or tau is not settable too large, otherwise it would be difficult to explore the optimal strategy.

/>

As shown in fig. 6, another embodiment of the present invention also provides an offshore rescue search path planning apparatus 100, including:

As an alternative embodiment, the determining a task target based at least on the characteristics of the search unit includes:

As an optional embodiment, the constructing a state space and an action space based on the search area and/or the search unit respectively includes:

gridding the search area;

As an optional embodiment, the constructing a state space and an action space based on the search area and/or the search unit respectively further includes:

As an alternative embodiment, the constructing the bonus function includes:

When the task objective is to minimize the time required for covering the search area, if the weighted cumulative search success rate exceeds a history peak value and the number of turns of the search unit is lower than a history valley value, then: r is R _target ,if POS _cum >POS _{cum_max} andT<T _min else 0；

R _target ,if POS _cum >POS _{cum_max} andStep＝N _max else 0

wherein the process of the search unit reaching a staged termination state based on a limited track is a loopA turn, wherein T represents the number of turns of the search unit in the current turn, step represents the number of search steps of the search unit in the current turn,N _max for the maximum number of search steps in one round of the search unit, POS _n Step for the cumulative search success rate of the search unit entering the next grid area _n For the step number of the search unit entering the current state in the current round, the corresponding round rewards are continuously attenuated along with the increase of the step number.

As an optional embodiment, the first action policy is that the search unit randomly selects the search action with probability epsilon, epsilon is a greedy coefficient, and the smaller epsilon is, the more the first action policy tends to make the search unit select the high-value search action in a centralized manner;

the method further comprises the steps of:

As an alternative embodiment, the learning rate strategy comprises updating the learning rate α at the end of each round based on historical round data, increasing the learning rate α if the average of rewards in the historical rounds increases, and decreasing the learning rate α if the average increases.

As an optional embodiment, the performing iterative computation based on the reward function, the task target, the state space, the action space, the first action policy or the second action policy, the learning rate policy, and the target algorithm to converge to obtain a search path includes:

(s _t ,a _t ) For t moment state-action pair, q _t (s _t ,q _t ) Representation(s) _t ,α _t ) Estimated value of motion value, alpha _t (s _t ,q _t ) Is(s) _t ,a _t ) Learning rate at time t, gamma is discount factor, r _t+1 A reward value at time t+1;

Another embodiment of the present invention also provides an electronic device, including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to implement an offshore rescue search path planning method as described in any one of the embodiments above.

Further, an embodiment of the present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements an offshore rescue search path planning method as described above. It should be understood that each solution in this embodiment has a corresponding technical effect in the foregoing method embodiment, which is not described herein.

Further, embodiments of the present invention also provide a computer program product tangibly stored on a computer-readable medium and comprising computer-readable instructions that, when executed, cause at least one processor to perform an offshore rescue search path planning method, such as in the embodiments described above.

The computer storage medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage media element, a magnetic storage media element, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, antenna, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Additionally, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

The above embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this invention will occur to those skilled in the art, and are intended to be within the spirit and scope of the invention.

Claims

1. The marine rescue search path planning method is characterized by comprising the following steps of:

2. Marine rescue search path planning method according to claim 1, characterized in that the determining a mission objective based at least on the characteristics of the search unit comprises:

3. The marine rescue search path planning method according to claim 2, wherein the constructing a state space and an action space based on the search area and/or the search unit respectively includes:

gridding the search area;

4. An offshore rescue search path planning method according to claim 3, wherein the constructing a state space and an action space based on the search area and/or the search unit, respectively, further comprises:

5. An offshore rescue search path planning method according to claim 3, wherein the constructing a reward function comprises:

When the task objective is to minimize the time required for covering the search area, if the weighted cumulative search success rate exceeds a history peak value and the number of turns of the search unit is lower than a history valley value, then: r is R _target ,if POS _cum >POS _{cum_max} and T<T _min else 0；

R _target ,if POS _cum >POS _{cum_max} and Step＝N _max else 0

wherein the search unit reaches a stepwise end state based on a limited trajectory, the T representing the current round of the search The number of turns of the cable unit, the Step represents the number of search steps of the search unit of the current round,N _max for the maximum number of search steps in one round of the search unit, POS _n Step for cumulative search success rate of the search unit into next grid area _n For the step number of the search unit entering the current state in the current round, the corresponding round rewards are continuously attenuated along with the increase of the step number.

6. The marine rescue search path planning method according to claim 1, wherein the first action strategy randomly selects a search action for a search unit with probability epsilon, epsilon being a greedy coefficient, the smaller epsilon, the more the first action strategy tends to cause the search unit to select a high-value search action in a concentrated manner;

the method further comprises the steps of:

7. The marine rescue search path planning method according to claim 5, wherein the learning rate strategy comprises updating a learning rate α according to historical round data at the end of each round, increasing the learning rate α if a reward mean in a historical round increases, and decreasing the learning rate α otherwise.

8. The method for planning a search path for marine rescue according to claim 5, wherein the iterative computation based on the reward function, task objective, state space, action space, first action strategy or second action strategy, learning rate strategy, and objective algorithm to converge to obtain a search path comprises:

q _t+1 (s _t ，a _t )＝q _t (s _t ，a _t )-α _t (s _t ，a _t )[q _t (s _t ，a _t )-[r _t+1 +γmaxq _t (s _t+1 ，a)]

(s _t ，a _t ) For t moment state-action pair, q _t (s _t ，a _t ) Representation(s) _t ，a _t ) Estimated value of motion value, alpha _t (s _t ，a _t ) Is(s) _t ，a _t ) Learning rate at time t, gamma is discount factor, r _t+1 A reward value at time t+1;

9. An offshore rescue search path planning apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the marine rescue search path planning method according to any one of claims 1 to 8 when the program is executed by the processor.