CN112947412B

CN112947412B - Method for autonomously selecting advancing destination of vending robot

Info

Publication number: CN112947412B
Application number: CN202110104046.8A
Authority: CN
Inventors: 宋杰; 赵星辰; 冯晓月; 王蓓蕾
Original assignee: 东北大学
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2023-09-26
Anticipated expiration: 2041-01-26
Also published as: CN112947412A

Abstract

The invention provides a method for autonomously selecting a travel destination by a vending robot, and relates to the technical field of machine learning. According to the method, the automatic vending robot is combined with various scenes to establish an interactive relation to obtain scene information, the destination selected by the vending robot is evaluated through returned information, and the destination capable of bringing maximum profit to the vending robot is found by utilizing continuous trial and error and selection. The method of the invention applies the reinforcement learning algorithm to the problem of the decision destination of the vending robot, has practical application value, not only changes the disadvantage of poor flexibility of the position fixing of the vending machine, but also brings higher economic benefit.

Description

Method for autonomously selecting advancing destination of vending robot

Technical Field

The invention relates to the technical field of machine learning, in particular to a method for autonomously selecting a travel destination by a vending robot.

Background

Along with the development of economy and the rapid improvement of the living standard of economy, people increasingly pursue a convenient living mode. The traditional market supermarket adopts a manual sales mode, has the characteristics of low speed, low efficiency and the like, and is difficult to meet the requirement of fast-paced life on the city. The advent of vending machines increased service efficiency and reduced labor intensity; the contact between food and service personnel is reduced during epidemic situation, and the food safety is improved; and the food can be sold in full time without much space, thereby greatly facilitating the life of people. According to statistics, 2016, the united states has 690 thousands of vending machines with annual income of 251 million dollars, which is the first share worldwide. Because of the huge population base, the number of vending machines owned by people in China is low. At present, the market of vending machines in China has developed to a certain extent, and various vending machines are increased. According to incomplete statistics, the domestic market reaches 300 ten thousand, and the annual retail sum reaches 600 hundred million yuan. In summary, vending machines will become an industry that fills a huge business in China, with great market potential. In recent years, various countries are dedicated to the research of vending machines, but most of the designs are traditional vending machines, the research direction is still under the condition that the vending machines are fixed in position, the vending machines have no capability of moving at any time, and customers cannot be provided with services in a mobile manner, so that the advantages of rapidness and convenience are greatly weakened, and the change of the vending machines is imperative.

In places frequently appearing in some lives, such as schools, if the vending robot can predict places frequently appearing by students and plan paths to destinations, the time spent on the road when the students want to purchase commodities can be saved, great convenience is brought, more commodities can be sold, and higher profits are brought to merchants.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method for autonomously selecting a travel destination by a vending robot, which describes a hotspot feature image on a map, refers to the idea of a reinforcement learning Q-learning algorithm and enables the vending robot to autonomously decide the destination.

In order to solve the technical problems, the invention adopts the following technical scheme: a method for autonomously selecting a travel destination by a vending robot, comprising the steps of:

step 1: modeling and acquiring relevant data of the vending robot; the vending robot travel related data comprises field data, pedestrian attribute data and vending robot attribute data;

step 1.1: selling robot approach site data modeling; the properties of the field comprise { length, width, barrier, charger }, wherein length and width are the length and width of the field respectively; barrier is an obstacle range in the field, namely an area where neither the robot nor the pedestrian can walk, and is expressed in the form of { P } _top ，P _down }，P _top 、P _down The upper left and lower right corner coordinates of barrer, respectively; the Charger is the position of a charging pile of the vending robot and is also the starting position of the vending robot, the representation mode is { x, y }, and x and y are respectively the abscissa and the ordinate of the Charger;

step 1.2: modeling pedestrian data; the attribute of the pedestrians comprises { Account, expected, probability, aview }, wherein Account is the number of pedestrians in the field; the estimated is the advancing speed of the pedestrian, and the probability of purchasing operation of the pedestrian in the vending robot is the probability of the pedestrian; aview is the view angle of the pedestrian, the expression mode is { radius, angle }, radius, angle are the radius and central angle of the view angle of the pedestrian respectively;

step 1.3: vending Robot (Robot) data modeling; the attributes of the vending robot comprise { Rcount, rspeed, rview, electric property }, wherein Rcount is the number of the vending robots in a field, rspeed and Rview are respectively the advancing speed and the viewing angle of the vending robots, electric property is the attribute related to the electric quantity of the vending robots, the expression modes are { capacitance speed, charge speed, electric property }, consumeSpeed, chargeSpeed are respectively the power consumption speed and the charging speed of the vending robots, and electric property is the residual electric quantity of the vending robots;

step 1.4: acquiring site data; after setting all parameters related to the field of the vending robot, inputting the width and length of the field to the vending robot, and enabling the vending robot to walk randomly within the range of the given field to continuously acquire related data of the field, namely updating the cognition of the field; the knowledge of this field is expressed in mathematical form as shown in equation (1):

wherein, -1 indicates that the coordinate point is not explored, 0 and 1 indicate that the coordinate point is explored for selling the interested position of the robot, 0 is an obstacle non-walkable area, and 1 is a walkable area; in the formula (1), a left matrix represents initial cognition of the robot to the field, and a right matrix represents cognition of the vending robot after the robot walks for a period of time to update the field;

step 1.5: judging whether the data acquisition stage is finished, specifically: calculating the proportion lambda of the current explored area of the vending robot to the overall area of the field, if lambda is larger than a set threshold value, the vending robot carries out an autonomous decision stage, otherwise, returning to the step 1.4, and continuing to acquire data by random walk;

step 2: setting hot spot characteristics, determining hot spots in a field, and automatically finding and recording the hot spots by using a vending robot;

determining hot spots in the venue; the hot spot is a position prone to robot decision selling, and a coordinate point with one of the following four characteristics is called a hot spot: (1) position ChargePoint (CP) of the charging post; (2) The pedestrian once purchased the commodity at the vending robot's location SalePoint (SP); (3) A location ActorPoint (AP) through which the pedestrian has walked; (4) Selling locations where the robot was not explored during the random walk, also referred to as locations InterestPoint (IP) of interest to the robot; the priority of the feature (1) is highest, and the features (2) and (3) are next lowest, and the feature (4) is lowest;

the vending robot automatically discovers the record hot spot; the position of the charging pile is known as a starting point of the vending robot; the position AP through which the pedestrian walks once is used as a camera for the vending robot to scan a coordinate point where the pedestrian appears to be recorded; the position SP where the pedestrian purchased the commodity at the vending robot is recorded as a coordinate point where the vending robot cabinet door is opened and the transaction is completed; the position IP of interest of the vending robot is known;

step 3: establishing a hotspot decision model, wherein the model is represented by a triplet < P _current ,P _target T >, wherein P _current Coordinate sequence (x) representing all walkable positions of the current vend robot _current ,y _current )，P _target Representing a sequence of coordinates (x _target ,y _target ) T represents the degree of inclination from the current position to the target position, and t=p _current ×P _target The method comprises the steps of carrying out a first treatment on the surface of the The dynamic process of the decision model is described as: the current position p of a vending robot _c0 ∈P _current Vending robot autonomously determined destination position p _t0 ∈P _target After the vending robot moves to the target position, the target position becomes the position p where the vending robot is located _c1 ∈P _current And get a tendency T ₀ Feeding back;

the tendency degree indicates the tendency degree of the vending robot to go to the target position in the current position decision, and is used for measuring profits possibly obtained from the current position of the robot to the target position; wherein the degree of inclination is equal to P _current And P _target Distance between L, P _target Whether or not it is a sales history point H _sp Whether it is a crowd history point H _ap 、P _target Number of surrounding points of interest H _ip In relation, the following formula is shown:

T＝η ₁ ·L+η ₂ ·H+η ₃ ·H _ip (2)

where h=max { H _sp ，H _ap ，0}，η ₁ 、η ₂ 、η ₃ Are the influencing factors of three influencing factors respectively, and eta ₂ Maximum;

step 4: establishing and updating a decision table; the vending robot trains through the tendency matrix to obtain a decision table to guide the actions of the robot;

step 4.1: establishing a tendency matrix and a decision table of the action of the vending robot; the tendency matrix is represented by P _current For row P _target A matrix describing for a column the inclination of the vending robot from the current position to the target position; the decision table is represented by P _current For row P _target The decision table established for the column is the same as the trend matrix in order, initialized to 0 and updated according to the trend matrix;

step 4.2: updating a decision table; each element on the decision table is called a decision value D, and the optimal decision value is equal to the sum of the tendency of the current position to the next destination and the maximum decision value of the next destination to the final destination; the updating of each decision value D on the decision table is shown in the following formula:

D _t (p _ct p _tt )＝T _t (p _ct -＞p _tt )+γmaxD _t (p _ct+1 p _t ) (3)

wherein D is _t (p _ct p _tt )、D _t (p _ct+1 p _t ) Selling robot slaves at the time of the t-th update, respectivelyDecision value of front position coordinates to destination coordinates and decision value from destination coordinates to final destination coordinates, p _ct Is the current position coordinate of the vending robot, p _tt Is the destination coordinate, p of the vending robot in the current position coordinate decision _ct+1 Is the next destination coordinate after decision, p _t Is the final destination coordinate of the travel of the vending robot, T _t The tendency of the vending robot to the destination coordinate at the current position coordinate is that gamma is an attenuation factor, gamma is more than or equal to 0 and less than 1, the importance degree of future decisions is represented, benefits brought by the current tendency position are only considered when gamma=0, and benefits brought by future decisions are more emphasized when gamma tends to 1;

the update formula (3) of each decision value D on the decision table is established under the condition that the maximum profit can be obtained by the optimal decision, but when the decision table is established, the left and right of the formula (3) are not equal, and errors shown by the following formula exist:

ΛD(p _ct p _tt )＝T _t (p _ct -＞p _tt )+γmaxD _t (p _ct+1 p _t )-D _t (p _ct p _tt ) (4)

wherein ΔD (p _ct p _tt ) Updating errors for the decision value D;

thus, the process of the vending robot to iteratively update the final approach objective step by step towards the optimal decision is represented by equation 5:

D _t+1 (p _ct p _tt )＝D _t (p _ct p _tt )+αΛD(p _ct p _tt ) (5)

wherein D is _t+1 (p _ct p _tt ) For decision value updated for the t+1st iteration, alpha is learning rate, 0 < alpha < 1;

combining equations 4 and 5 results in a final update rule for the decision table, as shown in equation 6:

D _t+1 (p _ct p _tt )＝D _t (p _ct p _tt )+α[T _t+1 (p _ct -＞p _tt )+γmaxD _t (p _ct+1 p _t )-D _t (p _ct p _tt )] (6)

step 5: the vending robot selects a column of the maximum decision value corresponding to the current position as a destination according to the decision table; then planning a walking path according to the shortest path A;

step 6: and when the number of hot spots recorded at the current moment of selling the robot is increased to be mu times of the original number, the step 4 of calculating the trend matrix is re-executed until the trend matrix is not changed or the change degree is smaller than the set threshold value.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: according to the method for autonomously selecting the advancing destination by the vending robot, provided by the invention, the vending robot establishes an interactive relation with a complex dynamic environment, the robot searches and records hot spots in the environment according to the hot spot characteristics and returns the current tendency, the robot evaluates the selected destination according to the tendency, the robot learns the optimal strategy after performing multiple attempts on the position of each destination, and then the robot can decide the destination bringing the maximum profit each time.

Drawings

Fig. 1 is a flowchart of a method for autonomously selecting a travel destination by a vending robot according to an embodiment of the present invention;

fig. 2 is initial data related to the travel of a vending robot according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a site display provided in an embodiment of the present invention;

fig. 4 is a travel route of a vending robot and a pedestrian provided by an embodiment of the present invention;

FIG. 5 is a set of positions with larger decision values in a decision table according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In this embodiment, a method for autonomously selecting a travel destination by a vending robot, as shown in fig. 1, includes the following steps:

step 1: modeling and acquiring relevant data of the vending robot;

step 1.1: vending robot approach (Map) data modeling; the properties of the field comprise { length, width, barrier, charger }, wherein length and width are the length and width of the field respectively; barrier is an obstacle range in the field, namely an area where neither the robot nor the pedestrian can walk, and is expressed in the form of { P } _top ，P _down },P _top 、P _down The upper left and lower right corner coordinates of barrer, respectively; the Charger is the position of a charging pile of the vending robot and is also the starting position of the vending robot, the representation mode is { x, y }, and x and y are respectively the abscissa and the ordinate of the Charger;

step 1.2: pedestrian (Actor) data modeling; the decision of the robot is largely determined according to the behavior of the person, so the invention needs to simulate the attribute and the travelling track of the pedestrian. The attribute of the pedestrians comprises { Account, expected, probability, aview }, wherein Account is the number of pedestrians in the field; the estimated is the advancing speed of the pedestrian, and the probability of purchasing operation of the pedestrian in the vending robot is the probability of the pedestrian; aview is the view angle of the pedestrian, the expression mode is { radius, angle }, radius, angle are the radius and central angle of the view angle of the pedestrian respectively;

step 1.5: judging whether the data acquisition stage is finished, specifically: calculating the proportion lambda of the current explored area of the vending robot to the total area of the field, if lambda is larger than a set threshold value, indicating that most areas are explored, carrying out an autonomous decision stage by the vending robot, otherwise, indicating that the current known area is too small, and returning to the step 1.4, and continuing to acquire data by the vending robot by random walk;

in order to more clearly demonstrate the vending robot decision and travel process, the present embodiment models specific vending scenarios, and the modeled objects include sites, pedestrians, and vending robots, whose specific attributes are shown in fig. 2. Meanwhile, the embodiment adopts a simple 10×10 terrain space, the initial scene and coordinates are shown in fig. 3, wherein the hatched part represents an area where an obstacle exists and the area cannot move; the white portion represents the walkable region; strat is the initial position of the robot, namely the position of the charging pile.

the vending robot automatically discovers the record hot spot; the position of the charging pile is known as a starting point of the vending robot; the position AP through which the pedestrian walks once is used as a camera for the vending robot to scan a coordinate point where the pedestrian appears to be recorded; the position SP where the pedestrian purchased the commodity at the vending robot is recorded as a coordinate point where the vending robot cabinet door is opened and the transaction is completed; the position IP of interest of the vending robot is known in step 1;

the reason is as follows: (1) Maintaining the survival of the robot is the most important, and the robot needs to calculate whether the electric property can support the distance from the robot to the next decision destination to the charging pile or not when deciding each time, otherwise, the robot is likely to stop in the middle to cause inconvenience; (2) The probability is higher for selling and selling goods at the positions with the past sales history and the crowd gathering history; (3) Some pedestrians often occur at locations that are not within the known range of the robot, so the points of interest should also be hot spots, but have little if any effect on the results if no references are incorporated, so the priorities are lowest.

In this embodiment, in order to have a certain knowledge on the ground, the vending robot performs random walk to obtain data, and a route is shown in fig. 4, where a dotted line is a walking route of a pedestrian; the solid line is the walking route of the vending robot, and the destination is represented by D _i Indicating i e {1,2,3. The knowledge of the topography during the walk is determined by the viewing angle of the vending robot, the present embodiment sets the radius of view of the vending robot to 2, and the central angle to 120, which is shown by the matrix under the visible region of the vending robot when the vending robot is about to move rightward at (0, 1), including coordinate points (1, 0), (1, 1), (1, 2), (2, 1), and the rest of the positions are not within the view.

In the process of the robot being moved by the vending robot, pedestrians are moving, even some people purchase goods sold by the robot, three walking routes of the pedestrians are simulated respectively in the embodiment, the starting points are A1, A2 and A3 respectively, and the destination is Da. The vending robot records some hotspots during this period: the AP coordinates include (2, 3), (2, 5); the SP coordinates include (2, 2), (8, 4), and the pedestrian and robot movement trajectories and points of interest are shown in FIG. 4.

When the vending robot reaches each destination, the area S of the current area that the robot has explored is calculated, and it is determined whether the specific gravity of the current total terrain area is greater than a certain threshold, the threshold is set to be at least greater than 0.5 but not too large, otherwise the robot wander phase is too long and may not even stop all the time, so the threshold is set to 0.8 in this embodiment. Until the destination D4, s=86, and the walk phase ends. At this time, the cognition of the robot to the terrain is shown in the following matrix, and the hot spot IP coordinates are the positions of matrix elements of-1.

the tendency degree indicates the tendency degree of the vending robot to go to the target position in the current position decision, and is used for measuring profits possibly obtained from the current position of the robot to the target position; the higher the propensity, the greater the likelihood that the robotic vendor will sell the product, the higher the resulting profit, where the propensity is relative to P _current And P _target Distance L, P between _target Whether or not it is a sales history point H _sp Whether it is a crowd history point H _ap 、P _targett Number of surrounding points of interest H _ip In relation, the following formula is shown:

T＝η ₁ ·L+η ₂ ·H+η ₃ ·H _ip (2)

where h=max { H _sp ，H _ap ，0}，η ₁ 、η ₂ 、η ₃ Are the influencing factors of three influencing factors respectively, and eta ₂ Maximum; because the robot judges whether the position is the most direct basis for people to frequently appear and purchase goods, and the influence factors of each hot spot are different according to the different priorities of each hot spot, the influence degree of obvious feature SP is maximum, the feature AP is secondary and the feature IP is final, so H=max { H } _sp ，H _ap ，0}；

In this embodiment, η1:η2:η3=0.3 is set: 0.5:0.2. next, L, H and H are determined _ip Where L should be calculated as the manhattan distance, but since R and L are inversely proportional, L is calculated as the difference between the longest (diagonal) distance on the map and the manhattan distance, i.e., l=width+length- (|x) _current -x _target |+|y _current -y _target I), where length is the length of the map, width is the width of the map, x _current Is the abscissa of the current position, x _target Is the abscissa of the target position, y _current Is the ordinate of the current position, y _target Is the ordinate of the target position; h _ip For interest points, namely, calculating how much of the robot is around the destination in the topography cognition, wherein the theoretical maximum value is 8 (the surrounding 8 directions can walk); h is the tendency of the interest point, and the embodiment sets the selling history point H _sp Is inclined to (a)Orientation degree is 10, crowd history point H _ap The degree of inclination of (2) was 5.

Step 4: establishing and updating a decision table; the vending robot trains the decision table to guide the action of the robot through the tendency matrix, namely, the decision table is used for deciding which destination is most beneficial to the vending robot;

D _t (p _ct p _tt )＝T _t (p _ct -＞p _tt )+γmaxD _t (p _ct+1 p _t ) (3)

wherein D is _t (p _ct p _tt )、D _t (p _ct+1 p _t ) Selling a decision value of the robot from the current position coordinate to the destination coordinate and a decision value from the destination coordinate to the final destination coordinate at the t-th update, p _ct Is the current position coordinate of the vending robot, p _tt Is the destination coordinate, p of the vending robot in the current position coordinate decision _ct+1 Is the next destination coordinate after decision, p _t Is the final destination coordinate of the travel of the vending robot, T _t The tendency of the vending robot to the destination coordinate at the current position coordinate is that gamma is an attenuation factor, gamma is more than or equal to 0 and less than 1, the importance degree of future decisions is represented, benefits brought by the current tendency position are only considered when gamma=0, and benefits brought by future decisions are more emphasized when gamma tends to 1;

ΛD(p _ct p _tt )＝T _t (p _ct -＞p _tt )+γmaxD(p _ct+1 p _t )-D _t (p _ct p _tt ) (4)

wherein ΔD (p _ct p _tt ) Updating errors for the decision value D;

D _t+1 (p _ct p _tt )＝D(p _ct p _tt )+αΛD(p _ct p _tt ) (5)

wherein D' (p) _ct p _tt ) For iteratively updated decision values, α is the learning rate, 0 < α < 1, the less the effect of retaining previous learning when α tends to 1;

since the robot decision destination is free of termination conditions, it can operate indefinitely as long as the robot does not damage it, so the decision table is no longer chosen to change as a condition for update termination.

In the embodiment, a tendency matrix of 100×100 is established according to the cognition of the robot to the map; because the matrix is too large to be conveniently displayed, the embodiment only displays a1×100 inclination matrix from the position (8, 0) of the robot at the end of the walk to the other robot walkable positions, as shown in the following matrix, wherein the inclination of the area which is considered unreachable by the vending robot is 0, and other areas are calculated according to the formula of the step 3, and meanwhile, the same-order decision table is initialized to be 0.

[3，0，0，0，4.2，0，0，0，0，5.1，2.9，3.2，3.3，3.6，3.9，4.2，4.5，4.8，5.1，0，0，2.7，8，3.3，3.6，3.9，4.2，4.5，4.8，0，0，2.4，5.2，0，0，0，0，4.2，4.5，4.4，0，2.1，2.4，0，0，0，0，3.9，9.2，4.3，1.5，1.8，4.6，0，0，0，0，3.6，3.9，4，0，1.5，1.8，0，0，0，0，3.3，3.6，3.7，0，1.2，1.5，1.8，2.1,2.4，2.7，3，3.3，3.4，0，0.9，1.2，1.5，1.8，2.1,2.4，2.7，3，3.3，0.3，0.6，0，1.4，1.9，2.2，2.5，0，3.1，2.8]

The decision table is then calculated from the tendency matrix, and since the decision destination is more focused on future trends, γ=0.8. The end result when the decision table is no longer changed is as follows:

[39.8，-1，-1，-1，41，-1，-1，-1，36.8，41.9，39.7，40，40.1，40.4，40.7，41，41.3，41.6，41.9，-1，-1，39.5，47.8，40.1，40.4，40.7，41，41.3，41.6，-1，-1，39.2，44.5，-1，-1，-1，-1，41，41.3，41.2，-1，38.9，41.1，-1，-1，-1，-1，40.7，46，41.1，38.3，38.6，42.9，-1，-1，-1，-1，40.4，40.7，40.8，-1，38.3，38.6，-1，-1，-1，-1，40.1，40.4，40.5，-1，38，38.3，38.6，38.9，39.2，39.5，39.8，40.1，40.2，-1，37.7，38，38.3，38.6，38.9，39.2，39.5，39.8，40.1，37.1，37.4，-1，38.2，38.7，39，39.3，-1，39.9，39.6]。

step 5: the vending robot selects a column of the maximum decision value corresponding to the current position as a destination according to the decision table; then planning a walking path according to the shortest path A algorithm so as to save electric quantity and improve the effective running time of the robot;

step 6: and when the number of hot spots recorded at the current moment of selling the robot is increased to be mu times of the original number, the step 4 of calculating the trend matrix is re-executed until the trend matrix is not changed or the change degree is smaller than the set threshold value. The reason is that as the robot continues to explore the scene, more hot spot data will be collected, and the previous trends are no longer applicable. The decision table also needs to be retrained with the updating of the trend matrix.

In this embodiment, when the vending robot makes an initial decision at the destination D4, it can be seen that the maximum column of the corresponding values in the decision table is 22, so the vending robot autonomously determines D5 (2, 2) as the destination. According to the tendency matrix we probably consider that the location of the robot decision should be (8, 5) as this is the nearest hot spot to the robot, but instead the location of the robot decision is the far point of interest D5 as D5 is the most pedestrian-coming location, and D5 can obtain higher profits in the long term.

According to the analysis of the decision table, the destinations with larger decision values are counted, and the destinations are found to have a common characteristic, namely, the destinations are not far away from the destinations and pedestrians frequently appear/shop positions or areas unknown to the current robot, the long-distance positions tend to be more important when the robot makes a decision, and the positions with larger decision values are shown as the areas circled by thick lines in fig. 5.

In this embodiment, after a period of time has elapsed from the vending robot, step 4-5 is re-executed, the tendency matrix is updated, and the decision table is re-iteratively calculated, so that the destination of the last decision is found to be more advantageous than the previous one.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A method for autonomously selecting a travel destination by a vending robot, comprising the steps of: the method comprises the following steps:

step 3: establishing a hotspot decision model, wherein the model is represented by a triplet<P _current ,P _target ,T>Wherein P is _current Coordinate sequence (x) representing all walkable positions of the current vend robot _current ,y _current )，P _target Representing a sequence of coordinates (x _target ,y _target ) T represents the degree of inclination from the current position to the target position, and t=p _current ×P _target The method comprises the steps of carrying out a first treatment on the surface of the The dynamic process of the decision model is described as: the current position p of a vending robot _c0 ∈P _current Vending robot autonomously determined destination position p _t0 ∈P _target After the vending robot moves to the target position, the target position becomes the position p where the vending robot is located _c1 ∈P _current And get a tendency T ₀ Feeding back;

D _t (p _ct p _tt )＝T _t (p _ct ->p _tt )+γmaxD _t (p _ct+1 p _t ) (3)

wherein D is _t (p _ct p _tt )、D _t (p _ct+1 p _t ) Selling a decision value of the robot from the current position coordinate to the destination coordinate and a decision value from the destination coordinate to the final destination coordinate at the t-th update, p _ct Is the current position coordinate of the vending robot, p _tt Is the destination coordinate, p of the vending robot in the current position coordinate decision _ct+1 Is the next destination coordinate after decision, p _t Is the final destination coordinate of the travel of the vending robot, T _t The tendency degree of the vending robot on the current position coordinate to the destination coordinate is that gamma is an attenuation factor which is more than or equal to 0 and less than or equal to gamma<1, representing the importance of future decisions, considering only benefits brought by the current trend position when gamma=0, and paying more attention to benefits brought by future decisions when gamma tends to 1;

ΛD(p _ct p _tt )＝T _t (p _ct ->p _tt )+γmaxD _t (p _ct+1 p _t )-D _t (p _ct p _tt ) (4)

wherein ΔD (p _ct p _tt ) Updating errors for the decision value D;

D _t+1 (p _ct p _tt )＝D _t (p _ct p _tt )+αΛD(p _ct p _tt ) (5)

wherein D is _t+1 (p _ct p _tt ) Decision value updated for the t+1st iteration, α is learning rate, 0<α<1；

D _t+1 (p _ct p _tt )＝D _t (p _ct p _tt )+α[T _t+1 (p _ct ->p _tt )+γmaxD _t (p _ct+1 p _t )-D _t (p _ct p _tt )] (6)

2. A method for autonomous selection of a travel destination by a vending robot as recited in claim 1, wherein: the specific method of the step 1 is as follows:

step 1.5: judging whether the data acquisition stage is finished, specifically: calculating the proportion lambda of the current explored area of the vending robot to the overall area of the field, if lambda is larger than a set threshold value, carrying out an autonomous decision stage by the vending robot, otherwise returning to the step 1.4, and continuing to acquire data by random walk by the vending robot.

3. A method of autonomous selection of a travel destination by a vending robot as recited in claim 2, wherein: the specific method of the step 2 is as follows:

the vending robot automatically discovers the record hot spot; the position of the charging pile is known as a starting point of the vending robot; the position AP through which the pedestrian walks once is used as a camera for the vending robot to scan a coordinate point where the pedestrian appears to be recorded; the position SP where the pedestrian purchased the commodity at the vending robot is recorded as a coordinate point where the vending robot cabinet door is opened and the transaction is completed; the positions IP of interest of the vending robot are known.

4. A method of autonomous selection of a travel destination by a vending robot as claimed in claim 3, wherein: step 3, the tendency degree indicates the tendency degree of the vending robot to go to the target position in the current position decision, and the tendency degree is used for measuring profits possibly obtained from the current position of the robot to the target position; wherein the degree of inclination is equal to P _current And P _target Distance between L, P _target Whether or not it is a sales history point H _sp Whether it is a crowd history point H _ap 、P _target Number of surrounding points of interest H _ip In relation, the following formula is shown:

T＝η ₁ ·L+η ₂ ·H+η ₃ ·H _ip (2)

where h=max { H _sp ，H _ap ，0}，η ₁ 、η ₂ 、η ₃ Are the influencing factors of three influencing factors respectively, and eta ₂ Maximum.