CN115206157A

CN115206157A - Unmanned underwater vehicle path finding training method and device and unmanned underwater vehicle

Info

Publication number: CN115206157A
Application number: CN202210939126.XA
Authority: CN
Inventors: 黄安付; 曹一丁; 尹辉; 郭伟
Original assignee: Baiyang Times Beijing Technology Co ltd
Current assignee: Baiyang Times Beijing Technology Co ltd
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-10-18

Abstract

The application provides a path finding training method and device for an unmanned underwater vehicle and the unmanned underwater vehicle, and belongs to the technical field of unmanned underwater vehicles. By the aid of the scheme, the unmanned underwater vehicle can be trained based on continuous judgment and correction of the execution decision of the path finding of the underwater vehicle, and the unmanned underwater vehicle has certain autonomous judgment capability. The unmanned aerial vehicle trained by the method provided by the application can rely on experience obtained by self training, and can adopt modes of avoiding, detouring and the like to smoothly pass through a strange water area even facing an underwater complex environment. Meanwhile, by utilizing the training method provided by the application, the closer the water area for training the unmanned underwater vehicle is to actual combat, the more the unmanned underwater vehicle can have the capability of automatically searching the way under the actual combat environment, so that the unmanned underwater vehicle can smoothly reach a target place and complete the assigned task.

Description

Unmanned underwater vehicle path finding training method and device and unmanned underwater vehicle

Technical Field

The application relates to the technical field of unmanned underwater vehicles, in particular to a method and a device for path finding training of an unmanned underwater vehicle and the unmanned underwater vehicle.

Background

An unmanned underwater vehicle is an unmanned airplane operated by a radio remote control device and a self-contained program control device, is an instrument which is unmanned and navigated underwater by remote control or automatic control, mainly refers to an intelligent system which replaces divers or manned small-sized submarines to carry out high-risk underwater operations such as deep sea detection, lifesaving, mine removal and the like, and can play an important role in a plurality of fields.

In the military field, the unmanned underwater vehicle can be used as bait to confuse the enemy for seeing and hearing, can also be used as a scout to go deep into enemy for carrying out target scouting, can also carry out early scouting and counterscouting on a specified target area in advance, and can also carry out early scouting and searching on an unknown area. Meanwhile, the unmanned underwater vehicle of the American 'snakehead' can also realize the secret arrangement of torpedoes underwater, and further highlights the important role of the underwater vehicle in future wars.

In the above-described functions of the unmanned underwater vehicle, the unmanned underwater vehicle is required to be capable of realizing accurate positioning and path finding in a complex underwater environment. In the prior art, a method for rapidly finding an optimal path formed by sequentially connecting a plurality of line segments or a plurality of path points by an unmanned underwater vehicle in a planned space generally adopts a traditional classical algorithm, namely, an environment of the unmanned underwater vehicle is accurately modeled, and the unmanned underwater vehicle simulates running in a modeling environment by a dynamic programming method, a derivative correlation method and an optimal control method in the completed environment modeling, so that the unmanned underwater vehicle is controlled to run in an actual water area. However, with the complication of the application scenario of the unmanned underwater vehicle, for example, the situation that the environmental parameters are difficult to obtain in the deep sea working environment, the environmental parameters are not available in the military application scenario, the modeling environment cannot simulate the emergency in the actual scenario, and the like, the accurate modeling in advance of the traditional classical algorithm is difficult to meet the requirement of the actual application scenario, and the unmanned underwater vehicle cannot realize the path-finding work in the specific environment.

Aiming at the problems, a novel training method for automatically searching the path of the unmanned underwater vehicle is provided to solve the problems.

Disclosure of Invention

Based on the problems, the application provides a method and a device for training the unmanned underwater vehicle to find a path and the unmanned underwater vehicle, which can realize the path finding training of the unmanned underwater vehicle, so that the unmanned underwater vehicle has the capability of automatic training in strange environment after being trained. The embodiment of the application specifically discloses the following technical scheme:

a method for training a unmanned underwater vehicle to find a path is characterized by comprising the following steps:

reading an execution decision of the unmanned underwater vehicle;

controlling the unmanned underwater vehicle to execute a path finding action according to the execution decision;

evaluating an execution decision taken in the path searching action according to the path searching action result;

modifying the score of the execution decision in the way finding action according to the evaluation of the execution decision;

selecting the execution decision according to the score of the execution decision;

repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle until the unmanned underwater vehicle runs to a training terminal to obtain a running path;

and repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle, and after the driving path is obtained every time, modifying the score of the execution decision according to the evaluation of the execution decision of the latest driving path until a preset condition is reached, thereby finishing the unmanned underwater vehicle path-finding training.

Alternatively,

the reading of the execution decision of the unmanned underwater vehicle specifically comprises the following steps:

and when the execution decision is empty and/or the score of the execution decision is lower than a threshold value, and the execution decision of the unmanned underwater vehicle cannot be read, randomly reading one other action instruction as the execution decision of the unmanned underwater vehicle.

Alternatively,

the evaluating the execution decision taken in the path searching action according to the path searching action result specifically comprises the following steps:

dividing the path searching action result into an excitation action result, a punishment action result and a steady action result;

when the result of the path searching action is the result of an excitation action, carrying out excitation evaluation on an execution decision taken in the path searching action;

when the result of the path searching action is a punishment action result, punishment evaluation is carried out on the execution decision taken in the path searching action;

and when the result of the path searching action is a stable action result, performing stable evaluation on an execution decision taken in the path searching action.

Alternatively, the first and second liquid crystal display panels may be,

the modifying the score of the execution decision in the way finding action according to the evaluation of the execution decision specifically includes:

when the evaluation of the execution decision is the incentive evaluation, increasing the score of the execution decision in the path finding action;

when the evaluation of the execution decision is the penalty evaluation, reducing the score of the execution decision in the way searching action;

when the evaluation of the execution decision is the stable evaluation, keeping the score of the execution decision in the path searching action unchanged.

Alternatively, the first and second liquid crystal display panels may be,

the preset conditions specifically include:

and modifying the score of the execution decision according to the execution decision executed by the latest driving path, wherein the variation value of the score is smaller than a threshold value, and the current execution decision enables the consumption time value of the unmanned underwater vehicle reaching the training end point to fluctuate in a fixed interval.

Alternatively, the first and second liquid crystal display panels may be,

selecting the execution decision according to the score of the execution decision specifically comprises:

and selecting an existing execution strategy item, or selecting to regenerate a new execution strategy item, or selecting an execution strategy for adjusting the content of the historical execution strategy according to the score of the execution decision.

Alternatively,

after the obtaining the travel path, the method further comprises:

and modifying the evaluation of the execution strategy and/or the execution strategy item in the current driving path according to all the execution actions in the driving path.

Alternatively,

the method further comprises the following steps:

saving any of the execution decisions and the evaluations of the execution decisions, the saved specific execution decisions and evaluations of the execution decisions applying to the adjustment of the scores of the execution decisions.

An unmanned underwater vehicle path finding training device, characterized in that the device comprises:

the execution decision reading module is used for reading an execution decision of the unmanned underwater vehicle;

the route searching action control module is used for controlling the unmanned underwater vehicle to execute a route searching action according to the execution decision;

the execution decision evaluation module is used for evaluating the execution decision taken in the path searching action according to the path searching action result;

the execution decision scoring module is used for modifying the score of the execution decision in the path searching action according to the evaluation of the execution decision;

an execution decision selection module for selecting the execution decision according to the score of the execution decision;

the single-path execution decision updating module is used for repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle until the unmanned underwater vehicle runs to a training terminal to obtain a running path;

and the multi-path execution strategy updating module is used for repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle, and after the driving path is obtained every time, modifying the score of the execution decision according to the evaluation of the execution decision of the latest driving path until a preset condition is reached, and finishing the unmanned underwater vehicle path finding training.

An unmanned underwater vehicle is characterized by comprising the unmanned underwater vehicle used for realizing a path-finding training method of the unmanned underwater vehicle.

Compared with the prior art, the technical scheme provided by the application enables the trained unmanned underwater vehicle to have the capability of automatically finding the way. When the unmanned underwater vehicle works in an unfamiliar environment, tasks can be executed without guidance of environment through modeling and command of remote communication connection, so that the unmanned underwater vehicle is free from clamping of a traditional classical algorithm on the use of the unmanned underwater vehicle. The unmanned underwater vehicle trained by the training method provided by the application has a certain intelligent automatic path finding function after multi-scene and high-order training. Even in a complex new environment, an optimal path to a target place can be found by adjusting the motion state of the self according to the change of the environment by means of repeated training experience.

The application also provides a way trainer, unmanned underwater vehicle are sought to unmanned underwater vehicle, has above-mentioned beneficial effect, no longer gives unnecessary details here.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present application, and for a task of ordinary skill in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of an unmanned underwater vehicle path-finding training method;

FIG. 2 is a logic diagram illustrating selection of an execution policy at a decision node;

FIG. 3 is a logic diagram illustrating the modification of the execution decision score;

FIG. 4 is a diagram illustrating an exemplary correspondence between a location and an execution decision;

FIG. 5 is a matrix of correspondences between a location and an execution decision derived from FIG. 4;

FIG. 6 is an initial matrix corresponding to FIG. 5;

fig. 7 is a schematic diagram of a path-finding training device of an unmanned underwater vehicle.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments which can be derived from the embodiments given herein by the person skilled in the art without making any creative effort shall fall within the protection scope of the present application.

In addition, the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In the unmanned underwater vehicle widely used at present, in order to control the unmanned underwater vehicle to execute a corresponding underwater task, the underwater environment of the unmanned underwater vehicle executing the underwater task is required to be accurately modeled in advance, and the completed model is loaded into a system of the unmanned underwater vehicle, so that the unmanned underwater vehicle utilizes the carried model to realize accurate positioning and path finding, or is in communication connection with the unmanned underwater vehicle in real time through a satellite, and utilizes remote control to realize the control of the unmanned underwater vehicle. However, the method needs to know the working environment of the unmanned underwater vehicle in advance or ensure the normal transmission of communication in the water area where the unmanned underwater vehicle works. If in some application fields with high time-efficiency requirements, such as in the process of executing military missions, the unmanned submersible vehicle executes missions in strange water areas, specific underwater terrain environments cannot be known in advance, and normal connection of remote communication in the water areas cannot be guaranteed. Therefore, when the unmanned underwater vehicle acts underwater, a certain autonomous path-finding capability is required, and after the unmanned underwater vehicle safely and smoothly reaches a target site, a corresponding underwater task is executed.

The invention of this solution will be described in detail below with reference to the accompanying drawings and embodiments.

Fig. 1 is a schematic flow chart of a method for training a unmanned underwater vehicle to find a path according to an embodiment of the present application, where the method is used for training the autonomous path-finding capability of an unmanned underwater vehicle, and an unmanned underwater vehicle trained by the method provided by the present application can have an automatic path-finding capability in an unfamiliar environment. The unmanned underwater vehicle path finding training method comprises the following steps:

step S101, reading an execution decision of the unmanned underwater vehicle;

when the unmanned underwater vehicle navigates in a training environment and reaches a certain judgment node, whether the motion state of the unmanned underwater vehicle meets the next motion requirement or not needs to be switched or not needs to be judged autonomously, and the process is an autonomous decision making process.

It is noted that autonomous decision making is a decision-making process, the main purpose of which is to determine an available execution decision suitable for the current node. Each execution decision item represents the decision entity itself for each execution decision. For example, in the "straight line" decision, the straight line represents the execution decision, and the command occupied by the "straight line" represents the execution decision item.

If the motion state of the current unmanned underwater vehicle can meet the motion requirement of the next unmanned underwater vehicle, the unmanned underwater vehicle is instructed to keep the current motion state, namely the underwater vehicle makes an execution decision for keeping the motion state unchanged; and if the motion state of the current unmanned underwater vehicle is considered to be incapable of meeting the motion requirement under the control of the underwater vehicle, the unmanned underwater vehicle is instructed to switch the motion state, namely the underwater vehicle makes an execution decision for switching the motion state. The basis for selecting the execution decision is to compare scores among the execution decision items, and preferentially select the execution decision with a high score as the execution decision selected under the judgment node.

The method includes the steps of obtaining a time interval, a distance interval, a water area image and sound information change parameter of the unmanned underwater vehicle, and starting the autonomous decision by using the time interval, the distance interval, the water area image and the sound information change parameter.

For the above, for example, when the unmanned underwater vehicle travels straight in a water area and reaches a certain judgment node, it is judged that if the unmanned underwater vehicle continues to travel straight, an obstacle may be touched, the motion requirement of the unmanned underwater vehicle for safe travel is not met, an execution decision for turning left by 90 degrees is made, and "turning left by 90 degrees" is the execution decision for the unmanned underwater vehicle.

After the execution decision of the unmanned underwater vehicle is determined, the specific meaning of the executed decision is read.

Specific example As shown in FIG. 2, at a judgment node S ₁ When the system is used, two decision items a are included ₁ 、a ₂ Wherein a is ₁ Has a Q value score of-2,a ₂ Is 1, then a is preferably selected ₂ As an execution decision for the selection under this judgment node, execute a ₂ Decision of execution of the representative, equivalently, arriving at the judgment node S ₂ Then, select the higher Q-score a ₂ And executing the decision. And repeating the selection process, and sequentially selecting and executing the decision at each judgment node step by step.

In an embodiment 1 of the present application, reading an execution decision of an unmanned underwater vehicle specifically includes: and when the execution decision is null and/or the score of the execution decision is lower than a threshold value, and the execution decision of the unmanned underwater vehicle cannot be read, randomly reading one other action instruction as the execution decision of the unmanned underwater vehicle.

In the way-finding training process proposed by the application, the condition that no or few way-finding experience data exist in the early stage of the way-finding training executed by the unmanned underwater vehicle is considered. In this case, existing empirical data cannot guide the unmanned underwater vehicle to make an execution decision suitable for the current point location. The execution decision is null, the execution decision represents that empirical data of the current point location execution decision is not determined at all, the score of the execution decision is lower than the threshold value, the execution decision determined according to the empirical data is lower although empirical data corresponding to the current point location exists, and the historical execution decision is not recommended to be executed continuously. In this case, the present embodiment proposes to randomly select one of the other action commands as an execution decision of the unmanned underwater vehicle for reading. Wherein the other action instructions represent execution decisions other than historical execution decisions.

And S102, controlling the unmanned underwater vehicle to execute a path searching action according to the execution decision.

And after the execution decision is read, controlling the unmanned underwater vehicle to execute a path searching action according to the motion instruction represented by the execution decision and the content of the motion instruction.

And step S103, evaluating the execution decision taken in the path searching action according to the path searching action result.

After the path searching action is executed, a corresponding action result is generated according to the executed action, and the executed decision is evaluated.

It should be noted that a result of a routing action may be associated with one or more routing actions that are performed, and therefore, all of the execution decisions associated with that action result are evaluated based on each routing action result.

In an embodiment 2 of the present application, there is disclosed:

dividing the path-finding action result into an excitation action result, a punishment action result and a stable action result;

when the result of the path searching action is the result of the excitation action, carrying out excitation evaluation on the execution decision adopted in the path searching action;

when the result of the path searching action is a punishment action result, punishment evaluation is carried out on an execution decision adopted in the path searching action;

and when the result of the path searching action is a stable action result, performing stable evaluation on the execution decision taken in the path searching action.

In the embodiment, the action results are divided into three types, namely an excitation action result, a punishment action result and a steady action result, wherein the excitation action result is a result beneficial to the automatic path finding of the unmanned underwater vehicle, for example, the result passes through a certain obstacle; the punishment action result is an action result which does not occur again in the automatic route searching process, such as an action of an unmanned underwater vehicle colliding with a wall; the stable action result is that the current action result is temporarily not available or cannot be judged to influence the automatic path finding training of the unmanned underwater vehicle.

And judging whether the action result is triggered or not according to the action result, wherein the action result can be classified into an excitation action result, a punishment action result and a steady action result according to the preset corresponding action result, and the evaluation is carried out according to the action result of the corresponding category.

The action result is divided and evaluated. The unmanned underwater vehicle can be made to "know" what action result does not appear any more, and the action result encourages continuous appearance, and the action result can be continued to appear temporarily. After the unmanned underwater vehicle knows the specific action result, the process of generating the result, namely the selected execution decision can be adjusted correspondingly.

And step S104, modifying the score of the execution decision in the way searching action according to the evaluation of the execution decision.

In each of the autonomous decision making processes, a selection is made among a plurality of selectable execution decisions. And how to select one of the execution decisions as the current execution decision depends on the comparison of scores of the execution decisions at the current time point, and the highest score of the execution decisions is selected as the execution decision to control the unmanned underwater vehicle to execute the contained route searching action.

In an embodiment 3 of the present application, a specific implementation of modifying the score for the evaluation of the execution decision is specifically disclosed:

when the evaluation of the execution decision is the excitation evaluation, increasing the score of the execution decision in the way finding action;

when the evaluation of the execution decision is penalty evaluation, the evaluation of the execution decision in the path searching action is reduced;

when the evaluation of the execution decision is stable evaluation, the evaluation of the execution decision in the way searching action is kept unchanged.

When the evaluation of the execution decision is determined to be an excitation decision, the decision under the node is subjected to scoring processing;

when the evaluation of the execution decision is determined to be a punishment decision, carrying out the substraction processing on the decision under the node;

when it is determined that the evaluation of the executed decision is a smooth decision, the decision score for this node is temporarily unchanged.

For the above embodiments, it is possible to use the method according to Q (s, a) as shown in FIG. 3 _{Difference value} The formula of = r + γ maxaQ (S ', a'), depending on S ₁ Determining selected a under a node ₂ Make a decision according to ₂ Maximum value or greedy algorithm for executing decision action result, for a ₂ The Q-value score of the execution decision is modified, where r is the execution a ₂ The latter benefit value, γ, is the discount factor.

The specific modification mode is as follows: q (s, a) _New ←Q(s，a) _{Old age} +α[r+γmax _a Q(s′，a′)-Q(s，a)]Wherein the new Q value is calculated by modifying the old Q value score before scoring.

Step S105, selecting an execution decision according to the score of the execution decision.

And after the score of the execution decision is continuously modified, selecting the modified execution decision according to the score of the execution decision when a judgment node of the autonomous decision is faced. For selection of execution decision, the execution decision item with the highest score is preferentially selected.

In an embodiment 4 of the present application, selecting an execution decision according to the score of the execution decision mainly includes: and selecting an existing execution strategy item, or selecting to regenerate a new execution strategy item, or selecting an execution strategy for adjusting the content of the historical execution strategy according to the score of the execution decision.

The selection logic for selecting the execution decision is to select according to the execution decision with high score, and the specific process of selection mainly comprises the following steps:

1. reading the score according to the execution decision and selecting the existing execution strategy item

A plurality of execution decisions are included and the execution decision item with the highest score is selected as the selected execution decision.

2. Selecting to regenerate new execution policy items

And when the execution decision is empty or less, randomly commanding the unmanned underwater vehicle to execute a navigation action in a certain direction as the selected execution decision.

3. Selecting execution policy that adjusts execution policy content of history

For the historical execution decisions, after the content of the execution decisions is adjusted, as the selected execution decisions, for example: and the existing executing decision turns in the direction of turning left by 90 degrees, and the executing decision content is modified into turning driving in the direction of turning left by 45 degrees.

And step 106, repeatedly executing the execution decision of reading the unmanned underwater vehicle and the subsequent steps until the unmanned underwater vehicle runs to the training end point, and acquiring a running path.

In the training process, the contents of the steps S101 to S105 are executed in a circulating manner, and in the process of executing the steps, the direction and the action of the unmanned underwater vehicle are switched by continuously adjusting the execution decision, so that the unmanned underwater vehicle finally reaches the training end point.

And after the training end point is reached, finishing the path searching training in the small period, and acquiring the driving path of the unmanned underwater vehicle in the small period.

In an embodiment 5 of the present application, it is proposed to modify the score of the execution strategy and/or the execution strategy item in the current travel path according to the execution strategy of the currently acquired overall travel path.

In the present embodiment, it is proposed to perform the overall modification of the execution strategy of the overall travel path once per time. The modified content mainly includes the rating of the execution policy and/or the execution policy item.

The scheme of the embodiment is mainly used for solving the problem that the unmanned underwater vehicle detours due to improper scoring of the selected execution decision or improper execution decision items during one road-seeking training period. After the scheme of the embodiment is adopted, the driving path of each small period is analyzed from the integral angle, and the problems on the driving path are corrected in time according to the analysis result, so that the same problem is avoided from appearing again in the subsequent training process, the training times are effectively reduced, and the training resources are saved.

And 107, repeatedly executing the execution decision of reading the unmanned underwater vehicle and the subsequent steps, after the driving path is obtained every time, modifying the score of the execution decision according to the evaluation of the execution decision of the latest driving path until the preset condition is reached, and finishing the path finding training of the unmanned underwater vehicle.

And circularly executing the contents of the steps S101 to S106, comparing and analyzing the driving path of the unmanned underwater vehicle acquired in each small period, and revising the score of the existing execution decision according to the whole evaluation of the execution decision in the latest driving path so that the score of the execution decision is more suitable for the practical usable execution decision value.

In an embodiment 6, the preset condition specifically includes: the variation value of the score of the execution decision according to the execution decision executed by the latest driving path is smaller than the threshold value, and the current execution decision can enable the consumption time value of the unmanned underwater vehicle reaching the training end point to fluctuate in a fixed interval.

In the scheme provided by the implementation, the basis for judging that the selection logic of the execution decision of the current unmanned underwater vehicle is adapted to the current training place is as follows:

1. the variation value of the score of the execution decision modified according to the execution decision executed by the latest driving path is less than the threshold value.

In the latest path-finding training process of the unmanned underwater vehicle, the score of the executed decision corrected in the path-finding process is very small, or the score of the executed decision is unchanged.

2. The consumption time value of the unmanned underwater vehicle reaching the training end point fluctuates in a fixed interval

According to the current execution decision value, the time of the unmanned underwater vehicle reaching the training end point at any starting point does not change greatly but only fluctuates in a fixed interval.

According to the two points, the unmanned underwater vehicle can be confirmed to be trained in the current scene and can realize the autonomous path finding function of the unmanned underwater vehicle according to the execution decision of the unmanned underwater vehicle.

In addition, in the present application, an embodiment 7 is also provided, in which any execution decision and evaluation of the execution decision are saved, and the saved specific execution decision and evaluation of the execution decision are applied to the adjustment of the score of the execution decision.

In the process of training and searching the way, aiming at any execution decision and evaluating the execution decision, the evaluation is stored in a decision pool for storage. The saved execution decisions and the evaluations of the execution decisions are all saved in an experience pool as learning samples in the training process.

The learning sample stores the execution decision items and the evaluations of the execution decision items under each node together, and meanwhile, the execution decision items and the evaluations of the execution decision items under all related nodes can be called according to one action result.

According to the unmanned underwater vehicle path-finding training method, a planar image is taken as an example, as shown in fig. 4, the planar image represents a movable track between 1 and 6 places, and the arrow direction and the positive and negative numerical values between any two directly connected places in the 1 to 6 places represent the scoring for executing decision according to the arrow direction. From fig. 4, a graph 5 can be drawn, fig. 5 representing the scoring of a selection of an execution decision between points 1-6, the states 1-6 in the column representing the selected points, the actions 1-6 in the horizontal representing the executed decision, i.e. the target point to be reached from the selected point, the numbers near the matrix arrows representing the scoring of an execution decision to reach the target point from the point of departure (if there is no path to go between the two points, the score is-100). With this matrix, the optimal path can be found.

In this example, a specific way-finding process is assumed that we randomly select 4 to start, and the final goal is to reach 6, where there are 1, 5, and 6 execution decisions that can be selected. Decision 6 was chosen randomly, at which point the final point in the simulation environment has been reached, so the test is over and a score of 90 (point reached, point awarded 30) is achieved. Then, the record of this time is kept. We start again with random drawing 1, and randomly draw action 4, thus getting a reward of 110.

The above-mentioned score for performing decision shown in fig. 3 does not exist at the beginning of training, and needs to be obtained by testing in continuous simulation test. According to fig. 6, the score of the execution decision shown in fig. 5 is an all-zero matrix at the beginning of the training, and in the continuous training, the all-zero matrix of fig. 6 gradually generates the complete matrix of fig. 5 according to the score between the execution strategy and the execution strategy which is continuously searched.

The matrix data shown in fig. 4 is modified according to the formula Q (s, a) = r + γ maxQ (s ', a'), and when the learning rate a =1, the greedy factor γ is 0.8, where point 4 is taken as the starting point, point 6 is taken as the final point, and Q (6, 6) represents the score of starting point 6 and reaching point 6.

Q(4，6) _new ＝R(4，6)+0.8max{Q(6，4)，Q(6，5)，Q(6，6)}＝90+0.8*0＝90；

Therefore, the updating is continuously iterated and updated in the simulation environment, the updating change value of the score value of the final matrix is smaller than the threshold value, and the consumed time value of the unmanned underwater vehicle reaching the training end point fluctuates in a fixed interval, so that the optimal path is found according to the final matrix.

Based on the above-mentioned embodiment, the present application also discloses a way-finding training method for an unmanned underwater vehicle, and with reference to fig. 7, the device includes:

an execution decision reading module 701, configured to read an execution decision of the unmanned underwater vehicle;

a path-finding action control module 702, configured to control the unmanned underwater vehicle to perform a path-finding action according to the execution decision;

an execution decision evaluation module 703, configured to evaluate an execution decision taken in the way finding action according to a way finding action result;

an execution decision scoring module 704, configured to modify a score of the execution decision in the way finding action according to the evaluation of the execution decision;

an execution decision selection module 705 for selecting the execution decision according to the score of the execution decision;

a one-way execution decision updating module 706, configured to repeatedly execute the execution decision and subsequent steps of reading the unmanned underwater vehicle until the unmanned underwater vehicle travels to a training end point, and obtain a travel path;

and the multi-path execution strategy updating module 707 is configured to repeatedly execute the execution decision and subsequent steps of reading the unmanned underwater vehicle, modify the score of the execution decision according to the evaluation of the execution decision on the latest driving path after the driving path is obtained each time until a preset condition is reached, and complete the unmanned underwater vehicle path finding training.

Alternatively,

the execution decision reading module 701 is further configured to randomly read one other action instruction as the execution decision of the unmanned underwater vehicle when the execution decision is null and/or the score of the execution decision is lower than a threshold and the execution decision of the unmanned underwater vehicle cannot be read.

Alternatively,

the action result is divided into an excitation action result, a punishment action result and a steady action result;

the decision evaluation execution module 703 is specifically configured to:

when the action result is an excitation action result, carrying out excitation evaluation on an execution decision taken in the path searching action;

when the action result is a punishment action result, punishment evaluation is carried out on an execution decision taken in the path searching action;

and when the action result is a stable action result, performing stable evaluation on the execution decision taken in the path searching action.

Alternatively, the first and second liquid crystal display panels may be,

the decision scoring module 704 is specifically configured to:

when the evaluation of the execution decision is the excitation evaluation, increasing the score of the execution decision in the way-finding action;

Optionally, the preset condition specifically includes: ,

and the variation value of the score of the execution decision according to the execution decision executed by the latest driving path is smaller than a threshold value, and the current execution decision can enable the consumption time value of the unmanned underwater vehicle reaching the training end point to fluctuate in a fixed interval.

Alternatively,

the execution decision selection module 705 is specifically configured to read the score according to the execution decision, select an existing execution policy item, or select to regenerate a new execution policy item, or select an execution policy that adjusts the content of the historical execution policy.

Alternatively,

the device also comprises

And the path execution strategy checking module is used for modifying the score of the execution strategy and/or the execution strategy item in the current driving path according to the execution strategy of the whole driving path acquired at this time.

Alternatively,

the device also comprises

Data saving means for saving any of the execution decisions and the evaluations of the execution decisions, the saved specific execution decisions and evaluations of the execution decisions being applied to the adjustment of the scores of the execution decisions.

An unmanned underwater vehicle is used for realizing all steps of the unmanned underwater vehicle path-finding training method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The unmanned underwater vehicle path finding training method and device and the unmanned underwater vehicle provided by the application are introduced in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for training a unmanned underwater vehicle to find a path is characterized by comprising the following steps:

reading an execution decision of the unmanned underwater vehicle;

evaluating an execution decision adopted in the path searching action according to the path searching action result;

modifying the score of the execution decision in the way-finding action according to the evaluation of the execution decision;

and repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle, and after the driving path is obtained every time, modifying the score of the execution decision according to the evaluation of the execution decision of the latest driving path until a preset condition is reached, and finishing the path finding training of the unmanned underwater vehicle.

2. The method according to claim 1, wherein the decision to execute the unmanned underwater vehicle reading specifically comprises:

3. The method according to claim 1, wherein the evaluating the execution decision taken in the way-finding action according to the result of the way-finding action specifically comprises:

when the result of the path searching action is a punishment action result, punishment evaluation is carried out on an execution decision taken in the path searching action;

4. The method according to claim 3, wherein modifying the score of the execution decision in the routing action based on the evaluation of the execution decision comprises:

5. The method according to claim 1, wherein the preset condition specifically includes:

6. The method of claim 1, wherein selecting the execution decision based on the score of the execution decision comprises:

7. The method of claim 1, wherein after the obtaining the travel path, the method further comprises:

8. The method according to any one of claims 1 to 7, further comprising:

9. An unmanned underwater vehicle path finding training device, characterized in that the device comprises:

the execution decision scoring module is used for modifying the score of the execution decision in the way searching action according to the evaluation of the execution decision;

and the multi-path execution strategy updating module is used for repeatedly executing the execution decision and the subsequent steps of reading the unmanned underwater vehicle, and after the driving path is obtained every time, modifying the score of the execution decision according to the evaluation of the execution decision of the latest driving path until a preset condition is reached, and finishing the unmanned underwater vehicle path-finding training.

10. An unmanned underwater vehicle comprising the unmanned underwater vehicle for implementing the unmanned underwater vehicle path-finding training method according to any one of claims 1 to 8.