CN114021773A

CN114021773A - Path planning method and device, electronic equipment and storage medium

Info

Publication number: CN114021773A
Application number: CN202111132464.4A
Authority: CN
Inventors: 周英敏
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2022-02-08

Abstract

The disclosure provides a path planning method, a path planning device, electronic equipment and a storage medium, and relates to the technical field of deep learning and space-time big data, in particular to the field of path planning. The specific implementation scheme is as follows: constructing a grid map; obtaining a Q table; determining a first corresponding relation to which the current grid state data of the agent belongs from the Q table; executing corresponding action according to the action data in the first corresponding relation and returning next grid state data and reward value; updating the Q value of the first corresponding relation according to the reward value and the parameters of the deep reinforcement learning model to obtain an updated Q table; judging whether a termination condition is met, and if the termination condition is met, obtaining a path according to a grid passed by the intelligent agent; and if the termination condition is not met, returning to execute the operation of determining the first corresponding relation to which the current grid state data of the agent belongs from the Q table. The method has the effects of high timeliness and strong robustness, and the planned route is more scientific.

Description

Path planning method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of deep learning and space-time big data, and in particular, to a path planning method and apparatus, an electronic device, and a storage medium in the field of path planning.

Background

Ocean transport development has been in history for hundreds of years to date, and is limited by the scientificity of ship route planning. Traditional ocean path planning mainly relies on that crews draw through manual means, and this has both consumed a large amount of manpowers, and the route that plans out is also inaccurate inadequately simultaneously.

Disclosure of Invention

The disclosure provides a path planning method, a path planning device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a path planning method, including:

constructing a grid map of an agent, wherein each grid in the grid map corresponds to one grid state data;

acquiring a Q table, wherein the Q table is used for recording the corresponding relation between the grid state data and the action data and representing the Q value of the corresponding relation;

determining a first corresponding relation to which the current grid state data of the agent belongs from the Q table;

executing corresponding action according to the action data in the first corresponding relation so that the intelligent agent moves to the next grid, and returning the next grid state data of the intelligent agent;

returning a reward value corresponding to the next grid state data according to the reward function;

updating the Q value of the first corresponding relation according to the reward value and the parameters of the deep reinforcement learning model to obtain an updated Q table;

judging whether a termination condition is met, and if the termination condition is met, obtaining a path according to a grid passed by the intelligent agent; and if the termination condition is not met, returning to execute the operation of determining the first corresponding relation to which the current grid state data of the agent belongs from the Q table.

According to another aspect of the present disclosure, there is provided a path planning apparatus including:

the system comprises a construction module, a data processing module and a data processing module, wherein the construction module is used for constructing a grid map of an intelligent agent, and each grid in the grid map corresponds to one grid state data;

the acquisition module is used for acquiring a Q table, and the Q table is used for recording the corresponding relation between the raster state data and the action data and representing the Q value of the corresponding relation;

the determining module is used for determining a first corresponding relation to which the current grid state data of the intelligent agent belongs from the Q table;

the execution module is used for executing corresponding actions according to the action data in the first corresponding relation so as to enable the intelligent agent to move to the next grid and return the state data of the next grid of the intelligent agent;

the system is also used for returning a reward value corresponding to the next grid state data according to a reward function;

the updating module is used for updating the Q value of the first corresponding relation according to the reward value and the parameters of the depth reinforcement learning model to obtain an updated Q table;

the judging module is used for judging whether the termination condition is met or not, and if the termination condition is met, obtaining a path according to the grids passed by the intelligent agent; and if the termination condition is not met, returning to execute the operation of determining the first corresponding relation to which the current grid state data of the agent belongs from the Q table.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the path planning method.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute the path planning method.

According to yet another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the path planning method.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a path planning method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a grid map provided in accordance with an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a path planning apparatus provided in accordance with an embodiment of the present disclosure;

fig. 4 is a block diagram of an electronic device for implementing a path planning method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The path planning is one of the main research contents of the motion planning, the motion planning is composed of the path planning and the trajectory planning, sequence points or curves connecting the starting position and the end position are called paths, and the strategy for forming the paths is called the path planning.

Path planning has wide applications in many fields, such as autonomous collision-free action of robots, obstacle avoidance and penetration flight of unmanned planes, path planning of marine vessels, and the like. Wherein, robot, unmanned aerial vehicle, ship etc. all belong to the agent. The traditional ocean path planning mainly depends on that a crew draws by a manual means, which consumes a great amount of manpower and simultaneously the planned route is not accurate enough; in addition, algorithms such as a genetic algorithm, an annealing algorithm, a particle swarm and the like are adopted, so that timeliness is low.

In order to solve the above problem, an embodiment of the present disclosure provides a path planning method, where an agent to which the path planning method is applied is not limited to only ships, as shown in fig. 1, and the method includes:

step S101, constructing a grid map of the intelligent agent, wherein each grid in the grid map corresponds to one grid state data.

The method comprises the steps of modeling the marine environment by using a fence method, and constructing a grid map from an original marine map according to input data, wherein the input data comprises meteorological data, marine data, geographic environment data and ship data.

The meteorological data includes: wind speed, direction, temperature, humidity, etc.;

the ocean data includes: water flow speed, water flow direction;

the geographic environmental data includes: islands, submerged reefs, restricted zones, etc.;

the ship data includes: ship size, load, ship angle constraint, time constraint and the like.

Fig. 2 shows a grid map obtained by cutting the marine environment by the fence method, the size of the grid map is 10 × 10, and there are 100 grids, each grid represents 1 state, and the grid map shown in fig. 2 has 100 states. If the grid map size is 20 × 20 and there are 400 grids, the grid map has 400 states. In order to guarantee the safety of the sailing ship, the area replacement is carried out by solving the minimum circumscribed matrix mode for the areas such as islands, reefs and restricted zones, as shown in fig. 2, 0 represents the non-travelable areas such as islands and reefs, 1 and 2 represent travelable areas constructed by using ocean data and wind direction data, wherein the area of 2 is more favorable for traveling due to meteorological environment and marine environment than the area of 1, and 3 represents the starting place of the ship. For convenience of description of the grid map, the position of each grid in the grid map is represented by coordinates (x, y), the abscissa x represents a row, x increases from left to right, the ordinate y represents a column, and y increases from top to bottom, such as (1,1) representing the grid of the first row and the first column, and (2,5) representing the grid of the second row and the fifth column.

As there are two vessels in the grid map of fig. 2, it is now necessary to obtain the sailing routes of the two vessels from their respective starting points to their respective destinations, i.e. the optimal paths of the vessels. The optimal path of the present disclosure needs to meet the following requirements: (1) the ships with different routes do not collide in the running process, and when a plurality of ships are in voyage, the collision among the ships needs to be avoided; (2) accidents such as ship turning and the like caused by reef touch and weather factors do not occur on a single ship; (3) and driving from the starting point to the destination with minimum energy consumption within a prescribed time.

In one example, the reward function is set according to circumstances, and is set as follows:

the reward feedback mechanism is perfected by optimizing the reward function, so that the learning efficiency is improved. The disclosed agent takes a ship as an example for explanation:

when the ship touches an obstacle, the ship is punished, so that the obtained reward is a negative reward, and the reward value is-b; when the obstacles are areas where the ship cannot travel, such as islands, reefs, restricted areas and the like, namely, the ship enters the area range of 0 in the grid map, the reward value-b is returned.

When the ship normally travels, i.e., the ship travels in a feasible region in the grid map, such as the region ranges 1 and 2 in fig. 2, since the ship consumes energy during traveling, the resulting prize is also a negative prize, and the prize value is set to-c. The value of c is obtained by calculating the comprehensive marine environment condition and the ship condition, and the energy consumed by the ship in unit time is the value of c. The marine environment condition comprises meteorological data and marine data, and if the wind direction or water flow direction is consistent with the ship advancing direction, the marine environment at the moment is favorable for the ship to travel, the ship consumes less energy in unit time, and the reward value is assumed to be-c₁(ii) a If the wind direction or the water flow direction is not consistent with the ship advancing direction, and the wind speed or the water flow speed is high, the marine environment is not favorable for the ship to run at the moment, the ship consumes large energy in unit time in order to overcome the resistance of wind force and water flow, and the reward value is assumed to be-c₂Then-c₁Is greater than-c₂. The ship conditions comprise ship size, load, ship angle constraint and the like, and when the marine environment conditions are the same, the larger the ship is and the heavier the load is, the more energy is consumed by the ship per unit time.

When the ship arrives at the destination, the obtained reward is positive reward, the reward value is set to be d, and the reward value d is larger if the time for the ship to arrive at the destination from the starting place is shorter.

Step S102, obtaining a Q table, wherein the Q table is used for recording the corresponding relation between the raster state data and the action data and representing the Q value of the corresponding relation.

In one example, obtaining the Q table includes: obtain the initialized Q table or obtain the Q table of the intermediate state, this disclosure takes the initialized Q table as an example.

Initializing a Q table according to prior knowledge, wherein the Q table is used for recording the corresponding relation between the grid state data and the action data and representing the Q value of the corresponding relation, and then selecting the action capable of obtaining the maximum reward according to the Q value, wherein the Q value is larger when the ship is closer to a destination.

With reference to fig. 2, the grid map includes 100 grid state data, and the first grid state data at the upper left corner is assumed to be denoted as S₁The first row of raster state data is sequentially marked as S₂～S₁₀The second row of grid state data from left to right is sequentially marked as S₁₁～S₂₀By analogy, the last grid state data of the lower right corner is S₁₀₀. The action data characterizes a specific action of the agent, such as a move, back, turn, etc., assuming a₁Indicating that the agent is travelling 1 grid, a₂Indicating that agent recedes by 2 grids, a₃The intelligent agent turns 20 degrees, and the like, and all actions of the intelligent agent are numbered.

For example, if there are k actions of an agent and n raster status data, the Q table is shown in table 1 below:

	a₁	a₂	a₃	...	a_k
						S₁	Q(S₁，a₁)	Q(S₁，a₂)	Q(S₁，a₃)	...	Q(S₁，a_k)
S₂	Q(S₂，a₁)	Q(S₂，a₂)	Q(S₂，a₃)	...	Q(S₂，a_k)
						S₃	Q(S₃，a₁)	Q(S₃，a₂)	Q(S₃，a₃)	...	Q(S₃，a_k)
...	...	...	...	...
						S_n	Q(S_n，a₁)	Q(S_n，a₂)	Q(S_n，a₃)	...	Q(S_n，a_k)

TABLE 1

Wherein the motion space A ═ { a ═ a }₁，a₂，...，a_kState space S ═ S₁，S₂，...，S_n}。

Initializing the Q-table based on a priori knowledge, for example, the initialized Q-table is shown in table 2 below:

	a₁	a₂	a₃	...	a_k
						S₁	0	1	0	...	0
S₂	0	0	0	...	0
						S₃	0	0	2	...	0
...	...	...	...	...
						S_n	0	0	0	...	0

TABLE 2

And S103, determining a first corresponding relation of the current grid state data of the intelligent agent from the Q table.

And continuously transmitting the state information forward, selecting specific actions of the ship in the action space according to a greedy algorithm, and selecting the action corresponding to the maximum Q value corresponding to the current state by the ship. In connection with Table 2, e.g. the vessel is currently in state S₁Then at S₁The action for a row with the maximum Q value of 1 and Q value of 1 is a₂Thus the specific action selected by the vessel is a₂. E.g. the ship is currently in state S₂，S₂Each Q value in a row is 0, then at this time the ship randomly selects a₁～a_kAny of the acts in (1) may be performed.

Step S104, executing corresponding action according to the action data in the first corresponding relation to enable the intelligent agent to move to the next grid, and returning the state data of the next grid of the intelligent agent;

and returning the reward value corresponding to the next grid state data according to the reward function.

After the ship executes corresponding actions, the intelligent agent moves to the next grid, for example, the current position of the ship is (8,8), the specific action selected by the ship is to advance 1 grid forwards, then the position of the next state of the ship is (7,8), when the ship contacts the reef after advancing 1 grid forwards, according to the reward function, the reward obtained when the ship contacts the obstacle is negative, and then the negative reward is returned, and the reward value is assumed to be-10; when the ship moves forward for 1 grid and is in a normal running state, returning a negative reward, and if the energy consumed by the ship in unit time is 5, returning a reward value of-5; when the ship reaches the destination after traveling forward 1 grid, a positive prize is returned, assuming a prize value of 6.

And S105, updating the Q value of the first corresponding relation according to the reward value and the parameters of the depth reinforcement learning model to obtain an updated Q table.

In one example, the deep reinforcement learning model:

Q(S_t，a_t)←Q(S_t，a_t)+α·[r_t+γmax_πQ(S_t+1，a_t+1)-Q(S_t，a_t)]formula (1);

wherein S_tIs the state at time t, a_tThe motion of the ship at the time t; q (S)_t，a_t) As a function of the value of the S state at time t; r is_tPerforming an action a for the vessel at time t_tAnd state S when the ship reaches t +1_t+1An instant reward later obtained; q (S)_t+1，a_t+1) Taking action a for time t +1_t+1To S_t+1A value function of the state. Pi is a series of action decision processes, i.e., the set of all actions performed by a ship from a starting point to a destination.

The parameters of the deep reinforcement learning model comprise a maximum iteration period Tmax of the termination algorithm learning, a learning rate alpha, a discount factor gamma and an exploratory degree A. Tmax represents the maximum time for learning the acceptable depth reinforcement learning model, and the maximum iteration period Tmax may also be represented by the maximum iteration number, which is not limited in this disclosure. For example, the maximum iteration period Tmax is set to be 12h, and when the model learning time exceeds 12h, the model learning is terminated; or setting the maximum iteration number to 10000, and terminating the model learning when the iteration number exceeds 10000. The learning rate α is a real number in a range of 0 to 1, and α is set according to an empirical value. The discount factor γ is a real number ranging from 0 to 1, and in the embodiment of the present disclosure, γ is 0.8 or 0.9, and the larger the discount factor is, the later the time for obtaining the reward is, the more the discount is. The search degree a is an action space representing a set of actions a, and the action space includes actions such as forward movement, backward movement, and turning, and the forward movement represents a continuous movement in the current direction, the backward movement represents a movement in the direction opposite to the current direction, the turning represents a change in the movement direction, and the turning angle is an integer between-180 ° and 180 °.

In one example, the parameters of the deep reinforcement learning model further include a capacity N of the experience pool and a target Q network weight update period C, and the capacity N of the experience pool and the target Q network weight update period C are preset according to an experience value.

Substituting the learning rate alpha and the discount factor gamma into the above formula (1), and according to the next state S_t+1The largest Q (S) is selected_t+1A) the value multiplied by a discount factor gamma plus the true prize value r_tTo obtain the true Q value, i.e. r_t+γmax_πQ(S_t+1A), and Q (S, a) in the Q table is an approximate value of Q, and the difference between the real value and the approximate value is as follows: r is_t+γmax_πQ(S_t+1，a)-Q(S_t，A_t). The updated target Q value is equal to the Q value + α × difference before update.

In connection with Table 2, e.g. the vessel is currently in state S₁The specific action selected by the vessel is a₂The prize r value then obtained is 2 and the ship enters the next state S₃The target Q value is updated according to the above formula (1), assuming that the learning rate α is 1 and the discount factor γ is 0.9, then:

Q(S₁，a₂)＝2+0.9×max_πQ(S₃a), the ship is one line at S3 state, the maximum Q value is 2, therefore Q (S)₁，a₂)＝2+0.9×2＝3.8。

The updated Q table is therefore shown in table 3 below:

	a₁	a₂	a₃	...	a_k
						S₁	0	3.8	0	...	0
S₂	0	0	0	...	0
						S₃	0	0	2	...	0
...	...	...	...	...
						S_n	0	0	0	...	0

TABLE 3

The vessel is currently in state S₃The specific action selected by the vessel is a₃The prize r value then obtained is-1 and the ship enters the next state S₄According to the above-mentioned method, the first step,

Q(S₃，a₃)＝-1+0.9×max_πQ(S₄a), assuming that the ship is at S₄The maximum Q value in this row of states is 3, the updated Q (S)₃，a₃)＝-1+0.9×3＝1.7。

The updated Q table is thus shown in table 4 below:

	a₁	a₂	a₃	...	a_k
						S₁	0	3.8	0	...	0
S₂	0	0	0	...	0
						S₃	0	0	1.7	...	0
...	...	...	...	...
						S_n	0	0	0	...	0

TABLE 4

And repeating the method, and continuously and iteratively updating the Q value so as to update the Q table. In one example, after obtaining the updated Q table, the method further comprises:

the weights of the target value Qnet and the weights of the state value Qnet are updated.

In one example, the updating the weights of the target value qnet and the weights of the state value qnet includes:

determining a target Q value according to the learning rate, the discount factor, the reward value and the Q value with the maximum next state;

updating the weight of a state value Q network according to the target Q value;

updating the weight of the target value Qnet once every target Q net weight updating period so that the weight of the target value Qnet is equal to the weight of the state value Qnet.

By updating the weight of the target value Q network and the weight of the state value Q network, the timeliness of model learning is improved, and the planned path is more scientific.

Taking the updated Q value as a target value, and then training a state value Q network, wherein the training loss function of the state value Q network is as follows:

wherein S_t+1In the next state, a_t+1For the next action, ω is the weight of the state value Q network.

After step S105 is executed, S is executed_t、S_t+1、a_t、r_tAnd storing the samples as a sample in an experience pool, and deleting the earliest sample if the number of samples exceeds the experience pool capacity N. For example, the experience pool has a capacity of 10000 samples, when the 10001 th sample is stored in the experience pool, the earliest first sample in the experience pool is deleted, and so on, so as to avoid the total amount of samples exceeding the capacity of the experience pool.

The reason the present disclosure stores the samples in the experience pool first is as follows: if the Q value is updated every time a sample is obtained, the Q network training effect is poor due to the fact that samples are continuous and influenced by sample distribution, and therefore the samples are stored in the experience pool firstly. When storing one in the experience poolAfter a certain number of samples are obtained, randomly collecting samples from an experience pool, carrying out step-by-step iterative solution by a gradient descent method to obtain the minimum value of a loss function and the weight omega of a state value Q network, updating the weight omega' of a target value Q network once after a preset updating period C so that omega is equal to omega, and updating the weight omega of the target value Q network so as to enable the Q value to approach to a target Q value, namely Q (S)_t，a_tω) approaches to

As the Q value approaches the target Q value, L approaches 0.

Step S106, judging whether a termination condition is met, and if the termination condition is met, obtaining a path according to a grid passed by the intelligent agent; and if the termination condition is not met, returning to execute the operation of determining the first corresponding relation to which the current grid state data of the agent belongs from the Q table.

The termination condition is that the model training reaches the convergence condition or the model training time exceeds the maximum iteration period, and the convergence condition is that the L value of the loss function is not reduced, namely the Q value approaches to the target Q value. If the termination condition is not met, repeating the steps S103 to S106 until the termination condition is met and then terminating to obtain the path of the ship.

According to the path planning method, a Q table is initialized by fusing priori knowledge, a large amount of invalid iterations in the initial stage of the model are reduced, and a deep reinforcement learning model is adopted, so that the path planning method has the effects of stronger robustness and higher timeliness on one hand; on the other hand, the problem of a large amount of manpower for planning the ship route is solved, and the planned route is more scientific.

In an example, an embodiment of the present disclosure further provides a path planning apparatus, as shown in fig. 3, the apparatus including:

a building module 201, configured to build a grid map of an agent, where each grid in the grid map corresponds to a grid state data;

an obtaining module 202, configured to obtain a Q table, where the Q table is used to record a correspondence between raster state data and action data, and a Q value representing the correspondence;

a determining module 203, configured to determine, from the Q table, a first corresponding relationship to which the current grid state data of the agent belongs;

the execution module 204 is configured to execute a corresponding action according to the action data in the first corresponding relationship to move the agent to a next grid, and return to next grid state data of the agent;

an updating module 205, configured to update the Q value of the first corresponding relationship according to the reward value and a parameter of the deep reinforcement learning model, so as to obtain an updated Q table;

a judging module 206, configured to judge whether a termination condition is met, and if the termination condition is met, obtain a path according to a grid through which the agent passes; and if the termination condition is not met, returning to execute the operation of determining the first corresponding relation to which the current grid state data of the agent belongs from the Q table.

In one example, the updating module 205 is further configured to update the weights of the target value qnet and the weights of the state value qnet.

In one example, the reward function is set as follows:

wherein b, c and d are positive integers, and b > d > c.

In one example, the parameters of the deep reinforcement learning model include: maximum iteration cycle, learning rate, depreciation factor and exploration degree;

in one example, the parameters of the deep reinforcement learning model further include: empirical pool capacity and Q network weight update period.

In an example, the update module 205 is specifically configured to:

and updating the weight of the Q network according to the target Q value.

In one example, the termination condition is that the model training reaches a convergence condition or that the model training time exceeds a maximum iteration period, and the convergence condition is that the L value of the loss function is not decreasing, i.e., the Q value approaches the target Q value.

According to an embodiment of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon computer instructions for causing the computer to execute the path planning method.

According to an embodiment of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the path planning method.

According to an embodiment of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

FIG. 4 shows a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The calculation unit 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 301 performs the respective methods and processes described above, such as a path planning method. For example, in some embodiments, the path planning method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 300 via ROM 302 and/or communication unit 309. When the computer program is loaded into RAM 303 and executed by the computing unit 301, one or more steps of the path planning method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the path planning method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A path planning method, comprising:

2. The method of claim 1, after obtaining the updated Q-table, the method further comprising:

3. The method of claim 2, wherein the parameters of the deep reinforcement learning model comprise: maximum iteration period, learning rate, discount factor, and exploratory degree.

4. The method of claim 3, wherein the parameters of the deep reinforcement learning model further comprise: empirical pool capacity, target value Q network weight update period.

5. The method of claim 4, wherein the updating the weights of the target value Qnet and the weights of the state value Qnet comprises:

updating the weight of a state value Q network according to the target Q value;

6. The method of claim 1, wherein the reward function is set to:

if the intelligent agent touches the obstacle, obtaining a reward value-b;

if the intelligent agent normally runs, obtaining a reward value-c;

if the agent arrives at the destination, obtaining a reward value d;

wherein b, c and d are positive integers, and b > d > c.

7. The method of claim 1, wherein the termination condition is that an iteration time of the deep reinforcement learning model is greater than a maximum iteration period or an agent arrives at a destination.

8. A path planner, comprising:

9. The apparatus of claim 8, wherein the update module is further configured to update the weights of the target value Qnet and the weights of the state value Qnet.

10. The apparatus of claim 9, wherein the parameters of the deep reinforcement learning model comprise: maximum iteration period, learning rate, discount factor, and exploratory degree.

11. The apparatus of claim 10, wherein the parameters of the deep reinforcement learning model further comprise: empirical pool capacity, target value Q network weight update period.

12. The apparatus of claim 9, wherein the update module is specifically configured to:

updating the weight of a state value Q network according to the target Q value;

13. The apparatus of claim 8, wherein the reward function is arranged to:

if the intelligent agent touches the obstacle, obtaining a reward value-b;

if the intelligent agent normally runs, obtaining a reward value-c;

if the agent arrives at the destination, obtaining a reward value d;

wherein b, c and d are positive integers, and b > d > c.

14. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

15. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.