CN114518758B

CN114518758B - Indoor measurement robot multi-target point moving path planning method based on Q learning

Info

Publication number: CN114518758B
Application number: CN202210118037.9A
Authority: CN
Inventors: 俞方罡; 刘阳; 叶嵩; 王晓雨; 闫国祚; 吕强华; 黄俊博
Original assignee: Nanjing China Construction Eighth Bureau Smart Technology Co ltd; Third Construction Co Ltd of China Construction Eighth Engineering Divison Co Ltd
Current assignee: Nanjing China Construction Eighth Bureau Smart Technology Co ltd; Third Construction Co Ltd of China Construction Eighth Engineering Divison Co Ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2023-12-12
Anticipated expiration: 2042-02-08
Also published as: CN114518758A

Abstract

The invention provides a method for planning a multi-target point moving path of an indoor measuring robot based on Q learning, which is based on a model-independent Q learning algorithm in reinforcement learning, wherein a moving strategy conforming to the measuring robot is set in a return function for robot practice learning, three elements of states, actions and rewards are utilized, actions are taken according to the current state of the measuring robot, the rewards which are fed back are recorded, and better actions which can be taken when the measuring robot is in the next state are decided, so that the measuring robot automatically acquires a moving path with reasonable sequence under the condition that the number and the position of measuring points are known. The moving path planning method provided by the invention can realize global moving planning of multiple measuring points of the measuring robot in an indoor environment, help the measuring robot to complete work at the fastest speed without no need in the process of carrying out a measuring task, and greatly improve the measuring efficiency.

Description

Indoor measurement robot multi-target point moving path planning method based on Q learning

Technical Field

The invention belongs to the technical field of mobile robots, and particularly relates to a multi-target point moving path planning method of an indoor measurement robot based on Q learning.

Background

In traditional house measurement work, need a large amount of manpower to intervene, work load is heavy, in addition, can also lead to producing more measurement errors because of the human factor in the measurement process, is difficult to guarantee the accuracy of measurement. The robot is used for automatic measurement, so that the measurement precision and the working efficiency can be effectively improved, the workload of measurement staff and excessive labor expenditure are reduced, and the aims of reducing the cost and shortening the construction period can be achieved.

In an indoor environment, the measuring robots are required to enter each room one by one to finish the measuring task, the moving paths of the existing measuring robots are selected by personnel manually in sequence, are ordered according to the selecting sequence and move according to the sequence, so that the moving paths cannot be guaranteed to be most efficient, the construction efficiency is not improved, and the method is not applicable after the robots automatically identify multiple measuring points at one time according to a map. Therefore, in order to realize that the robot passes through all measuring points in the process of going in and going out, and the moving process does not make an unnecessary way, a most efficient path needs to be reasonably planned.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multi-target point moving path planning method for an indoor measuring robot based on Q learning, which effectively solves the problems of moving target point sequencing and moving planning of the measuring robot in the indoor measuring process and greatly improves the measuring efficiency.

The present invention achieves the above technical object by the following means.

The method for planning the movement path of the multiple target points of the indoor measurement robot based on Q learning comprises the following steps:

step 1: the measuring robot identifies all measuring target points according to the house type graph, wherein each room is a measuring point and marked by a digital serial number, and a path schematic diagram is obtained according to the positions of the measuring target points and the communication mode among the measuring target points;

step 2: setting a return function, taking the direction facing the entrance of the robot as a reference direction, and defining the direction of the starting point facing the measurement target point as a positive direction if the included angle between the direction of the starting point facing the measurement target point and the reference direction is smaller than or equal to 90 degrees, and defining the direction as a negative direction if the included angle is smaller than or equal to 90 degrees; all the measurement target points are connected according to actual paths, wherein the paths with the same route and opposite directions are two positive and negative paths, and the distance between the paths corresponding to the positive direction is a positive distance, and the distance between the paths is a negative distance; setting a set of measurement target points as a state space S, setting a set of all paths as an action space A, setting a path return value rewards of each measurement target point, and acquiring a return matrix matched with the house type graph according to a return function;

step 3: the measuring robot explores in a way of taking an action from one state to another until reaching a target state; taking each exploration as one experience, wherein each experience comprises a process of measuring the robot to reach a target state from an initial state, and turning to the next exploration after the measuring robot reaches the target state each time;

constructing a two-dimensional robot memory table Q_table according to all states and actions experienced by the measuring robot, wherein the initial state of the Q_table is 0, under the condition that the environment is unknown, the Q_table is initially set to be only one element, and if a new state is found, the Q_table is added with a new row and a new column;

step 4: defining a Q function shown in the following formula (1) as a utility value function of the robot action for updating the Q_table:

Q(s，a)＝R(s，a)+γV ^* ；

wherein S is one of the states of S; a is one action in A; r is a return function, and is used for calculating an immediate return value obtained by the measuring robot after selecting actions from the current state and transferring to the next state, and the immediate return value is determined by the current state and the selected actions; gamma is a discount factor for determining the relative proportion of delayed return to immediate return; v is the maximum return in all actions in the current state, V ^* ＝max[R(s，A)]；

Step 5: updating the Q_table according to the following formula to obtain an optimal moving path:

Q(s _t ，a _t )←R(s _t ，a _t )+γ*max[Q(s _t+1 ，A _t+1 )]

where t represents the current time.

Further, the return function satisfies a multi-target point efficient movement strategy of the measuring robot, namely, the already-moved measuring target point is not used as the measuring target point any more, and the path originally divided by the measuring target point becomes a continuous path.

Further, the path reward value rewards of each measurement target point are set as follows:

the rewarding value satisfying the shortest positive distance between the paths of each measuring target point is 70;

the prize value satisfying the positive direction between the paths of each measurement target point is 90;

the rewarding value which satisfies the positive direction and has the shortest positive distance between the paths of each measuring target point is 110;

the shortest reward value satisfying the inverse distance between the paths of each measurement target point is 10;

the reward value satisfying the opposite direction between the paths of each measurement target point is 30;

the reward value which satisfies the opposite direction and has the shortest opposite distance between the paths of each measuring target point is 50;

the prize value for which all conditions are not met is 0.

Further, the step of updating the q_table is as follows:

s1: setting a parameter gamma and an environmental reward value in a return matrix;

s2: initializing Q_table to 0;

s3: determining an initial state of the measuring robot;

s4: selecting one possible action from all possible actions of the current state;

s5: using this possible action, the measuring robot reaches the next state;

s6: for the next state, obtaining the maximum Q value based on all possible actions thereof;

s7: the Q value before the replacement is replaced according to the Q_table updating formula;

s8: repeating the steps S4 to S7 until the Q_table converges.

The invention has the following beneficial effects:

the invention utilizes a model-independent Q learning algorithm in reinforcement learning, sets a movement strategy conforming to the measuring robot in a return function for robot practice learning based on the learning capability of Q learning, takes action according to the current state of the measuring robot by using three elements of state (state), action and reward (reward), records the rewards fed back, and decides a better action which can be taken when the measuring robot is in the next state, thereby realizing that the measuring robot automatically acquires a movement path with reasonable sequence, namely a path with shortest movement time and least repeated route under the condition of known number and position of measuring points. The moving path planning method provided by the invention can realize global moving planning of multiple measuring points of the measuring robot in an indoor environment, help the measuring robot to complete work at the fastest speed without no need in the process of carrying out a measuring task, and greatly improve the measuring efficiency.

Drawings

FIG. 1 is a schematic diagram of measurement target points identified from a house pattern;

FIG. 2 is a schematic diagram showing the communication between measurement target points;

FIG. 3 is a flow chart of robot memory table training.

Detailed Description

The invention will be further described with reference to the drawings and the specific embodiments, but the scope of the invention is not limited thereto.

The invention relates to a method for planning a multi-target point moving path of an indoor measurement robot based on Q learning, which comprises the following steps:

step 1: the measuring robot identifies all measuring target points shown in figure 1 according to the house type diagram, wherein each room is a measuring point and is marked by a digital serial number; according to the positions of the measurement target points and the communication mode between the measurement target points, simplifying the process of fig. 1 into a path schematic diagram shown in fig. 2 (the numerical marks in fig. 1 and 2 represent room marks);

step 2: setting a return function, wherein the return function meets a multi-target point efficient movement strategy of the measuring robot, namely the measurement target point which is already moved is not used as the measurement target point any more, so that a path originally divided by the measurement target point becomes a continuous path; the setting strategy of the reporting function is as follows: taking the direction facing the robot entrance as a reference direction, if the included angle between the direction facing the measurement target point of the starting point and the reference direction is smaller than or equal to 90 degrees, defining the direction as a positive direction, and otherwise defining the direction as a negative direction; all the measurement target points are connected according to the actual paths (the paths with the same route and opposite directions are two positive and negative paths), and the distance between the paths corresponding to the positive direction is a positive distance, and the distance between the paths is a negative distance;

setting a set of measurement target points as a state space S, a set of all paths as an action space a, and setting a path return value reward for each measurement target point as follows:

a prize value of 0 that is not satisfied by all conditions;

and obtaining a return matrix matched with the house type graph according to the return function.

Step 3: the measuring robot searches from one state (state) to another state (state) until reaching the target state; taking each exploration as one experience, wherein each experience comprises the process that the measuring robot reaches a target state (namely the last measuring target point position) from an initial state (namely the starting point position), and turning to the next exploration after each measuring robot reaches the target state;

a two-dimensional robot memory table Q_table is constructed according to all states and actions experienced by the measuring robot, the initial state of the Q_table is 0, under the condition that the environment is unknown, the Q_table is initially set to be only one element, and if a new state is found, the Q_table is added with a new row and a new column.

Q(s，a)＝R(s，a)+γV ^* (1)

wherein S is a discrete bounded state space, S is a state in S; a is a discrete action space, and a is one action in A; r is a return function, and is used for calculating an immediate return value obtained by the measuring robot after selecting actions from the current state and transferring to the next state, and the immediate return value is determined by the current state and the selected actions; gamma is a discount factor for determining the relative proportion of delayed return to immediate return, with a larger gamma value indicating a greater importance of delayed return; v is the maximum return in all actions in the current state, V ^* ＝max[R(s，A)]。

Step 5: the Q_table is updated according to the following formula (2):

Q(s _t ，a _t )←R(s _t ，a _t )+γ*max[Q(s _t+1 ，A _t+1 )] (2)

wherein t represents the current time;

updating the Q_table can enhance the judging capability of the measuring robot when moving, and more updated results lead to better Q_table; in this case, if the q_table has been enhanced, the measurement robot will find the most appropriate target point movement sequence, i.e., the optimal movement path;

the update flow is shown in fig. 3, and the specific update steps are as follows:

s2: initializing Q_table to 0;

s3: determining an initial state of the measuring robot;

s5: using this possible action, the measuring robot reaches the next state;

s7: replacing the previous Q value according to the Q_table updating mode;

s8: repeating the steps S4 to S7 until the Q_table converges.

The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by one skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.

Claims

1. The method for planning the multi-target point moving path of the indoor measurement robot based on Q learning is characterized by comprising the following steps:

step 1: the measuring robot identifies all measuring target points according to the house type graph, and obtains a path schematic diagram according to the positions of the measuring target points and the communication mode among the measuring target points;

step 3: the measuring robot searches in a mode of taking one action from one state to another state until reaching a target state, and then switches to search for the next time; constructing a two-dimensional robot memory table Q_table according to all states and actions experienced by the measuring robot, wherein the initial state of the Q_table is 0, the Q_table initially has only one element, and after a new state is found, the Q_table is added with a new row and a new column;

step 4: defining a Q function as a utility function of the robot action for updating the Q_table:

Q(s，a)＝R(s，a)+γV ^* ；

Q(s _t ，a _t )←R(s _t ，a _t )+γ*max[Q(s _t+1 ，A _t+1 )]

wherein t represents the current time;

the return function meets the efficient movement strategy of multiple target points of the measuring robot, namely the measurement target point which is already moved is not used as the measurement target point any more, and the path originally divided by the measurement target point becomes a continuous path;

the path return value rewards of each measurement target point are set as follows:

the rewarding value which satisfies the positive direction and has the shortest positive distance between the paths of each measuring target point is 110; the shortest reward value satisfying the inverse distance between the paths of each measurement target point is 10;

the reward value which satisfies the opposite direction and has the shortest opposite distance between the paths of each measuring target point is 50; a prize value of 0 that is not satisfied by all conditions;

the Q_table updating step is as follows:

s2: initializing Q_table to 0;

s3: determining an initial state of the measuring robot;

s5: the measuring robot reaches the next state using the possible actions determined in S4;

s8: repeating the steps S4 to S7 until the Q_table converges.