WO2001078951A1

WO2001078951A1 - Semi-optimal path finding in a wholly unknown environment

Info

Publication number: WO2001078951A1
Application number: PCT/JP2000/002427
Authority: WO
Inventors: Zhimin Lin
Original assignee: Zhimin Lin
Priority date: 2000-04-13
Filing date: 2000-04-13
Publication date: 2001-10-25
Also published as: AU3678500A

Abstract

A mobile robot, which is guided by a new learning algorithm is presented. The first task for the robot is searching a path like a human being. That is, finding a semi-optimal path to a given goal in a wholly unknown, unpredictable and partly, dynamic large-scale environment. Therefore, it demands high technique of environment modeling through learning. Because the robot should realize its missions within the framework of the task knowledge, there appears a new phenomenon and a Dead Corner theory: genetic algorithm and reinforcement learning algorithm are the basis for the presented theory. Powerful methods are described in the mathematics tool of MDP models. The essential method of the sub-goal method is similar with the Sub-goal method of Options over MDP theory and has been successfully proved. The presented theory is a refined theory of A Learning Classifier System theory. Most of the theory has been proved in a visualization platform of an animation simulator.

Description

DESC RIPTION

SEM I-OPTIYl AL P\TH FIN DING IN \ \V HOLL\ I N kNOW N EN\ IRON M ENT

TFXHMQI E FIELD

The first task toi oui autonomous learnina robot is searching a path like a human beine in a wholK unknown em ironment In other words it is finding a semi-optimal path to a giv en uoal in a whoih unknown env ironment 1 he task description tor the robot is to tollow goal s direction and approach to goal Thus there is onlv one pioblem which is collision avoidance in a whollv unknown environment

Therefore it demands a high technique ot learning tor env ironment modeling Being similar with the iheoiΛ ot Λ Learning C lassifier vstem which is a MIT press in 1998 our theoiΛ also has two basis aliionthms uenetic learnιn_. aliioπthm and reinforcement learnimi algonthm

Learning algorithm is an essential topic in artificial intelligence Utilizing the theorv ot Markov Decision Process which is a discrete time process is a powerful approach loi analvzing human decision problems in artificial intelligence How to avoid obstacles is a kind of decision process Moreover the motion of a lobot can be considered as a discrete time process Therefore we utilize the MDP theorv to descπbe and analvze the basis of our HiMBS theorv

Temporal abstraction is the essential topic in artificial intelligence area tor analvzing human decision process Options ov er MDP is proposed bv S Sutton basing on MDP theorv and semi-MDP theorv w hich is a theorv ot tempoial abstraction within the framework ot reinforcement learning W e also utilize the Option concept to descnbe our powerful novel environment modeling method Sub-uoal task method

Λ reat important note is that although our HiMBS theorv is also an approach ot temporal abstraction in artificial intelligence our approach is not an Ac tion Rep/ax Process (ARP) or a Synchronous l alue leration (SI I) process as the theorv ot Options over MDP and Q-learning Therefore the most important things for understanding the MDP part ot our HiMBS theorv are the understandinεs ot several fundamental MDP concepts MDP MDP tree State Action State value Pohcv and Options

BACKGROUND ART

Our aiuoπthm ot HiMBS is a new learninii aiuoπthm tor a mobile robot it is designed tor making an authoπtative autonomous learning mobile robot Λ conventional opinion on the role ot the autonomous learning robot is that it should be able to interact with a wholly unknown natural environment, which is unpredictable in both πme and space and perhaps, contains dynamic obstacles sometimes This is one of the most difficult problems for most man-made machines ail over the worid

Most convendonal learning algoπthms on mobile robot demand much human empirical knowledge to enable a robot to explore and exploit a wholly unknown environment Mostly, the algorithm s designer not onlv gives the task question to a machine but also spends tremendous time to implant non-task method answer keys to this machine, that is. these methods cannot be realized as task descnpnon does but greatly depend on human^'s other non-task method empirical knowledge to realize task Strictly speaking, the environment in most of these conventional learning theories is not exactly wholly unknown Some representative algoπthms are discussed latterly

Once more, human^'s other empirical knowledge we mention here means that the answer keys of these kinds of knowledge do not utilize knowledge of task to realize task For instance, to the general task of path finding, that is. the task of following goal^'s direction from a start point and approaching to this given goal, method of pre-givmg an environment map to a robot and method of wall following are not considered as task method answer keys They are the human^'s other non-task empiπcal knowledge answer keys to realize the task

In our HiMBS algorithm, we utilize task knowledge, which is the task descπption of following goal^'s direction and approach to goal, to realize the task of finding a semi-optimal path from a start point to an arbitrary given goal

Our Background of Some Conventional Learning Algorithm Theories and Path Finding Theories

The superv ising algorithm is a conventional famous learning algorithm However, it often wastes much time on desiring a rarely happening architecture in an actual experiment It cannot deal with a wholly unknown environment exactly

\ learning classifier system affords a very good algorithm framework It consists of reinforcement learning algoπthm and genetic algorithm However, it is impossible to make a robot to av oid arbitrary obstacles by just constructing a blank algoπthm structure to show avoidmg several simple obstacles We are not sure that this aigoπthm is possible to deal with a wholly unknown environment and make robot show more complex behaviors

Jong Hwan Lim and Dong Woo Cho in Robottca 1998 propose the method of employing the concept of Quadtree to solve paths finding task 'Robust Sub-goal Planning and Motion Execution for Robots in . uzzv tnv nonment ^". w hich is proposed bv Jianwei Zhang and J Raczkowskv in IROS ι . also tor leahzing the same task as ours Both of them also utilize the same Sub-goal method as ours to solv e the environment-modeling problem

The env ironment in path finding can be descπbed as a tree structure a particular Quadtree or a general MDP tree tor a whole environment ( See F ig 0 ) In the general M DP tree, each node represents a possible passing point tor a robot Every group of nodes that starts from the starting point to the final point of task goal forms a path An optimal path can be deπved from all of the found paths Therefore, how to find the nodes of a path is the first essential topic for ev ery path finding theory

The method of employing a Quadtree to find nodes is not a task method, that is. it is other additional human empiπcal knowledge from the task knowledge This method assumes the robot has a Quadtree knowledge and can decompose its workspace into several sub-nodes These nodes are chosen as sub-goals to be reached successiv ely However, this decomposition method can not solve the obstacle av oidance problem, in other words, this method can not model env ironment In order to solve this problem, the designer implants another human empiπcal knowledge into robot, which is following the boundaπes of objects, that means robot has the empiπcal knowledge of wall following knowledge to pass obstacles In a word, this theory employ s much human non-task empiπcal knowledge to enable a robot to successfully find a path in an unknown environment

The method of robust sub-goal planning theory also does not utilize task method to find nodes This method assumes that obstacles are polygons and these polygons are pre-known Thus, it is possible for this method to utilize the method of searching a Tangent-graph to find nodes. Obviously, environment in this method is not an actually unknown env ironment

In a word, although these two methods also treat the nodes of a tree as sub-goals to find a path, they utilize too much human non-task empiπcal knowledge to find the nodes of an environment s Markov tree However, our method of finding these nodes is a novel method These nodes are found by a robot^'s own expeπence of learning and interacting with its env ironment but not by any pre-assumption of our human being The robot is guided by its task knowledge, following goal^'s direction and approaching to goal

Therefore, our invention here focuses on how our task method can find the nodes of a semi-optimal path by robot itself That is. how can our HiMBS robot realize the problem of environment modeling through learning'' An optimal path can be deπved after all semi-optimal paths are found in our HiMBS theorv Our invention here firstly considers how our robot can find a semi-opπmal path Our Background of Options over VI DP

ITie theorv of Options ov er MDP is newer than Dvna-Q learning, which is also dev eloped bv R S Sutton It introduces manv methods on exploπng and exploiting an environment and how to calculate an optimal pohcv Manv methods are such as I The reinforcement learning ( MDP) framework 2 SMDP ( option-to-option ) methods Sub-goals for learning options Etc Thev have proposed many mathematics methods to describe and solv e human decision process as a Markov Decision Process C ollision avoidance is a kind of human decision process and a discrete time process Therefore, we utilize MDP to descπbe our method

Howev er, as mentioned by R S Sutton himself The theory of Options over MDP has offers a lot. but the most interesting cases are bevond it W e consider it means that the details of ev aluating actions ci edits, actual methods of how to interact with a large-scale whoilv unknown environment and one of the most difficult actual problems robot s position nav igation svstem problem Thev also sav that manv key issues remain incompletely understood For example the source ot sub-goals ( and ot course the method of realizing them) and integration with state abstraction, etc Although our approach is different from these ARP processes, a similar sub-goal method, which is the only essential method for solving our problem of environment modeling through learning, is proposed bv us to successfullv realize the task of env lronment modeling

Benchmark Mobile Learning Robot Problems: Environment Problem. Learning Problem and Position Navigation Problem.

The benchmark mobile learning robot problem is how to make an authoπtative autonomous learning robot That is. to judge whether a mobile robot is this kind of robot or not is according to its interacting env ironment s fuzzy degree A mobile robot can be said an autonomous learning robot if its interacting env ironment is wholly unknown, as complex and arbitrarv as an actual space will be. which has ev en kind of the possibilities of containing dynamic obstacles at ev erv corner of its However, bemg capable of interacting with a wholly unknown environment is not the only ludging cπteπon of an autonomous learning robot This environment capability has been a pursuing final goal for the long past human s lnstorv That is. in order to enable a robot to interact with a ieallv whoilv unknown env ironment we as designers, should implant a high certain learning capability to a robot What kind of a learning capability e should implant into a robot is a great problem Moreov er, there is one ot the most difficult actual problems tor a mobile robot Setting up an accurate and robust position-localizing sy stem is alw ay s arguablv one of the essential necessities to meet the obiectiv es ot an autonomous guided mobile robot in these environments There are two representative existing position navigation systems Global Position Svstem (GPS ) and Inertial Nav igation Sy stem ( INS ) Realizing these localization svstems in a small range area may be accurate and cost reasonable On the ontrarv. the accuracy of these navigation systems in a large lange arbitrary natural env ironment will greatiy decrease, cost expensive and destroy the correctness of robot s interactions

DESCRIPTION OF THE INVENTION

1. Our robot's conditions, improvements and the structure of this description of the invention.

Perhaps, there exist many other methods to realize the same task as our system the task of env ironment modeling through learning However, our methods are realized under a robot s worst conditions the robot onlv has one kind of extern sensor, which is only capable of detecting obstacles one step around robot Here, one step refers to the length between robot^'s two footmarks. The robot does not have any outer position navigation system Furthermore, the robot should realize its task within the framework of the task knowledge, which is the least knowledge that we should give to a robot when it is put into an env ironment Howev er, the robot has to interact with a wholly unknown environment

Therefore, what is the task description for our robot We giv e an example in Fig I In such an empty and hghtless env ironment except tor the Goal, the task is that robot should approach to goal and reach goal by following the goal ^'s direction. One cπteπon for judging reaching the goal is that when the robot is stopped, in other words, the robot can |udge that there are two same states, we can inform robot that it has reach its task goal Corresponding to the theory of Options over MDP [R S Sutton]. the task can be descπbed in such a MDP Policv model l S < A goal = /( . -» f 0.1], method = θ —Task MDPP model ( 1 )

Here θ means following the goal ^'s direction and approaching to the goal We will descπbe more details of this model latter The theory of Options over MDP is a framework of temporal abstraction in l einforcement learning It is as general as a mathematics tool on analyzing methods ot exploπng an env ironment and exploiting it bv calculating pohcy-v aiues. action-values and state-values for a robot v . e have made some important improvements trom these two theoπes 1 w e touch on some most important actual issues: how to enable robot to interact with a large-scale wholly unknown environment. We propose a particular method of the framework of the task knowledge to realize environment modeling through sub-goals learning. 2. we propose a Zhinan Zen Inertial Navigation System with only one inertial sensor of encoder to guide a robot and evaluate its actions^" credits. 3. we propose the essential issue of sub-goal method to enable robot to overcome the complexity of an environment, to find a semi-optimal path quickly. 4. Our sub-goal method can absorb the actual great mobile robot^'s problem: motor slipping (Because the slipping occurring state will be treated as a deadlock state) and the accumulative dead reckonmg errors (Because robot moves according on orientation).

Our method is quite novel because we try to use the task knowledge to finish task and this method is as natural as human^'s daily intelligent activities. We compare our HiMBS robot with our human-being under the same task of finding a semi-optimal path to a given goal to show our robot^'s naturalness. (See table 1.0)

Table 1.0. The conditions of a natural autonomous robot.

A human I Our HiMBS robot

0. Can interact with a wholly unknown i 0. The same as a human environment

1. No need of any position navigation system I 1. The same as a human

2. Only need to know the goal^'s direction. 2. The same as a human, better to hav e an eigenvalue or goal's some literature features. i estimated position of goal

3. Has a strong memory^'s problem ' 3. Has a strong memory.

4. Has expeπence knowledge about environment | 4. No knowledge on environment, and methods: e.g. walk along a wall to pass.

5. Has knowledge of its front, back. etc... j 5. Not have such knowledge

6. Can see further than one footstep around I 6. Can only detect obstacle a step around it.

The structure of this descπption of our invention is: 1. Brief introduction of our robot^'s conditions, improvements and the structure of this description of the invention. 2. Our HiMBS theory, its program procedure and an essential sub-goal theorem. 3. Expeπment results and an error function. 4. Conclusions and our future work.

2. Our HiMBS theory, its program procedure and an essential sub-goal theorem. \n autonomous learning system should be able to endow iobot with human-level intelligence to interact w ith a natural env ironment Mostlv . the natural environment is wholly unknown, unpredictable at both time and space Sometimes, it also contains dynamic obiects. for instance, an obstacle, an activ e predator or a locomotiv e goal Of course, it consists of arbitrary obstacles, which hav e uncertain size and _hape Conv entionally. if an autonomous robot can not successfully realize this mission in such a natural environment, it is not considered as an authoπtative autonomous learning robot Since an autonomous learning robot interacts with an environment as uncertain as a natural creature does Therefore, its interaction is a process of from sub-optimization to optimization, a process of from tπals to planning, a process from errors to correctness The genetic algoπthm is quite suitable to such a process We utilize it as the control central of robot s actions Howevei. the environment is so far complex There are various different stimuli Wliereas. there is onlv one standard of credit for genetic algorithm to run rules ( actions etc ) process The reinforcement learning algoπthm is suitable for translating v aπous stimuli into single credit It is a good ev aluating tool Thus, we use the reinforcement algoπthm as a cooperative algoπthm of GA

The architecture of our HiMBS system is shown as Fig 3 Since the environment is wholly unknown to our robot, genetic algoπthm make our robot be sure to explore and exploit the environment thoroughly Genetic algoπthm has four modules reproduction, selection, crossover and mutation Reproduction produce some fundamental strategies and of course, pπmary actions at each footmarks ot robot For instance, strategy of the framework of the task method knowledge is executed in this module and the pπmarv strategy of sub-goal method Selection acts as an auction selector to select every action and ev en strategy to command the actuator or other component systems to produce actions or rules The roles of both crossov er and mutation are only realized at finding other paths in order to find an optimal path, which are not the central topic in this paper They both change their genes of a path to find other paths Gene of a path means the important points of a path but not all the points of a path The roles ot reinforcement learning algoπthm are evaluating credits of each actions and each control rules in order to let the Selection module works, evaluating stimuli of a robot, for example the stimulus of encountering an obstacle, the stimulus of reaching task goal, etc

W e use the sub-goal^'s MDP model to descπbe the most fundament strategy, which is the framework of the task method knowledge For instance, as we show in the figure, when a robot is stopped, genetic -ilgoπthm transfer the current goal and the termination condition of the stopped process to MDP sub-goal model, this model then produces sub-goals as the following current goal to reinforcement learning algoπthm Reinforcement learning module use the current goal to continue to evaluate robot s ev en action The MDP sub-goal model judges the termination condition of the sub-goal process The procedures of our whole system s programming are

• In the 1 st whole loop robot utilizes the task method knowledge to explore the environment, then it can find one quite short path and can find the optimal path at once if there is no any obstacle because of our task method method Our robot utilizes an inertial navigation method to build a two dimensions coordinates plane for its environment Moreover, in the first whole loop, after robot arm es at its task goal, it can correct the estimated coordinates of task goal s position

• Duπng the 2nd whole loop robot will try to optimize the found path by calling the mutation module and calculate the path^'s pohcv value according to its learning expeπence on environment^'s model

We hav e mentioned that our HiMBS theorv is a refined theory of Options over MDP One of the biggest improv ements is the most fundament problem of the credit model Together with the Q-learning theory, the Dyna-Q learning theory, their credit models are only in a real number space, that is. they only consider the award of a credit, which is . = /^* . / = 0,1.2.3 However, according to the newest reinforcement learning theory, a credit does consist of not only an award but also a punishment Therefore, we upgrade the most pπmarv credit model into a complex number space, which is

( = r_t -+^■ / * ω . i = 0,1.2.3 Here, to represents the punishment part of a credit Thus, from our credit model, we can have more accurate lnfoπnation on robot s interaction and can enable us to separate two paths that have the same policy-value as Q-learmng but has two different environments We can find the optimal path then Therefore, all of the values calculations of policy-v alues ( in mobile robot case simplified speaking, path-values), state-values and action-values are changed For example, the calculation ot state-values is changed from formulation (2) to (3)

! ^'" ( s )

However, in the whole loop of exploπng an environment, we do not utilize the S\ I process ( Svnchronous \alue Iteration process) In other words, we do not start with an arbitran approximation calculation to find paths and then make paths planning Because exploπng even points of an environment bv starting with an arbitran approximation calculation demands numerous memones and takes plenfv of time r^Λ ( s, ι

= (∑ r. s.tf)[r/ + ∑ _{/7 ι}Re(r/V . )]. - ) ae I '

+ i ;) ,τ( s , )[<y;^, + ;/ I /. ;, Im( J ^{' ,} ( N^, ))]i ae 1

Our method of utilizing the framework of the task method knowledge has an immediate approximation In addition we utilize the cumulative calculation formulations, such as formulation (3). at the second whole loop, that is. soon after a robot has finished its task Therefore, our calculations of state-vales in the first whole loop of exploπng an environment is simplified as formulation of (4) Here. π{ s„ __ ) means the probability of taking action a in the state of s under the policy of π (task MDPP model ( 1)) r ' means a step forward reward of taking action a in the state of s

,V_j π _( ,u)r;' } + . i . . V π _{,s . , a .).ω.. .' }. <4)

I a _ I

Thus, how to evaluate actions^' credit and how to realize our three objectives are descπbed as below To realize the first objectiv e of environment modeling, there is a new dead-corner theory that appears in the first whole loop

As we mentioned before, the method condition of our robot is descπbed in an empty and hghtless environment In the task MDPP model ( 1 ) Pv means a policy S means the state space of a robot^'s

environment A means the actions unit of a robot [0 1] means the probability of a policy is between 0 to 1 θ represents to a certain method to realize the policy A policy means the mapping from states to the probabilities of taking each available pπmitive action A bπefly and more accurate explanation of policy is that policy is a possible sequence of actions-states pairs In paths finding theory, a policy can be simply called as a path

Therefore, what does the model ( 1 ) means exactly⁹ When a robot is inputted into a wholly unknown environment for the goal, it receives the only environment knowledge the goal^'s direction and the estimated coordinates of the goal (Stπctly speaking, coordinates of the goal are possible to be gotten from the goal's direction ) The task for the robot, even in an empty and hghtless environment (except for the robot and the goal), is descπbed as Following the goal^'s direction If robot is not stopped, robot keeps on approaching The initiating conditions are written as / If the goal is treated as an obstacle, robot will fall into an oscillating state, which is called a deadlock state Thus, the deadlock state conditions can be a cπteπon to )udge task goal The deadlock termination condition of a global task process is represented bv β

Thus, in order not to use an outer navigation system and to evaluate the actions credits, we propose a Zhinan Zen inertial Nav igation System and Ev aluation method ( abbreviated as ZZINS EF. see Fig 3 )

Our robot has eight moving freedoms We use the ZZINS EF to evaluate credits of these eight possible actions according to whether the direction of an action makes robot go farther or nearer to the task goal The credit of an action has the highest credit among the eight actions if the direction of an action is the same as the current direction of task goal Because of our ZZINS EF method, our robot has an immediate efficiency of finding the shortest path at once in an empty and hghtless environment and does not need any outer position navigation system Our robot is initiated because of the direction of the current goal but not the coordinates of robot Therefore, our robot can absorb a quite high level of motor^'s slipping errors

Thus, eight possible moving directions have their own corresponding credit

Therefore, what do the initiating conditions / and the deadlock termination conditions β mean exactly in this empty and hghtless environment¹ The most simplified meaning of / is that if the initiating conditions are not equivalent with the deadlock termination conditions β at the goal, the robot

can keep on going We define the state of robot^'s /th footmark as s, „ / = 0,1.2.3.4 Therefore, its next

state is ₍__! The deadlock termination conditions β means that there exist at least two same footmarks^' states at the same position of around goal m this empty and hghtless environment Mostly, a wholly unknown environment is quite complex Plenty of particular literature characteπstics of an objective ( obstacle) can represent one unique state element to a human, however is difficult for a computer Whereas, coordinates of a position are not difficult for a computer, we consider the coordinates of a footmark as a state element of the state at that footmark

Thus, the value of a state at the /th footmark is wπtten as \ ^l ( s ). / = 0.1 l ^l ( " ) is the

same as the foπnulation (4) It summarizes all of the possible actions values at state \,

I ( ₍ ) = ( .r, . . , ) is the coordinates particular state element of coordinates of state s

Therefore, corresponding to the theory of Options over MDP. the Options-MDP (abbreviated as OMDP) model of our framework of the task method knowledge can be written as

Here, the task can be descπbed by the option of () It consists of the initiating conditions of / at the start point and the deadlock termination conditions of β at the position of task goal CG means the current goal. TG means the task goal. I'y means the V1DP policy model of ( F

OU. Py. β)

Py : S x Λ \ CG = TG. method = θ

pi -. V ( s: ) = l ' (sl_k ) \ CG = TG .1 = 0.1.2.3.4...: k = 2.3.4.5.6... 7 = 0.1. Task OMDP model (TOMDP) (5)

We assume the state now is .s,_+i . Because the method condition of our OMDP model framework, there appears a Dead-Corner problems. However, in order to try to utilize the task knowledge to solve the Dead-Corners, to realize the given task, we find that it is still possible to use the OMDP model to finish the task. In our following paper, we will introduce our dead corner theory.

To solve these dead-comers problems, in other words, the dead corners environment modeling, we propose a sub-goal theorem to realize it. which is mentioned as an essential issue by R.S. Sutton. in order to construct a sub-goal OMDP model, we just change the task OMDP model little. We assume that the state of Λ, is the state at the position of judging a deadlock state. When a robot is stopped by a dead comer at the point of (xl.y l ), it will set up sub-goals near the directions of its possible moving directions at a changeable distance of len . Normally, a 80 times length of a step of len is enough for passmg most arbitrary dead corner and obstacle if the environment does not contain a particular big obstacle. We give an example of sub-goal in the Fig. 4.2. The following (6) is our Sub-goal^'s OMDP model.

⁽M β,Py. f )

I : S x A ( ^'G = SG TG, method = θ

/ : I ^{' /V1} ( .v ) ≠ V^,^ (s ^lX \ GG = SG Sub-εoal OMDP model (6)

. i = 0,1,2.3.4...; k = dynamic const e [2.3,4.5,6.7....] / = 0.1.

We only switch the current goal between TG (task goal) and SG . switch the position of β and

/ and limit dynamic changeable constant of k but not arbitrary variable of k in the task OMDP model (5 ). We do not change the essential part of task OMDP model, which is θ method.

An important note is why k is dynamic changeable. There are several reasons: one is that the env ironment is whoilv unknown One is that the robot has a non-zero size One is that there exist arbitran size dead co ers One is that sub-goals mav be set up inside the bodv of an obstacle That means the sub-goals are unreachable The final reason is that we want to enable robot to learn an accurate dead-comer model

We summaπze a sub-goal theorem as below

In order to use me natural sub-goal method and in to use the task know ledge to solve the I m tronmeni Modeling problem We set up a Sub-goal OMDP model (SOMDP) from the task OMDP model b\ changing little of it

When a robot is follow ing into a aeadlock state some sub-goals mil be set up near to the possible moving directions at a certain distance We assume the stale at this footmark of judging a deadlock slate

's s I he SOMDP model w ill begin to be set up accordin to the termination condition of fOMDP s β

1 hat means the c urrent goal w ill be changed to sun-goal (SO) and then robot w ill begin to

e for

A steps after s under the sub-goal polιc \

er and the state-\ alue of

( ' ( . _j ) satisfies me termination condition 1 the current goal w ill be sw itched back to task goal (TO) f hat is the option 0 of SOMDP is finished However the option () w ill be set up again at the 2k steps after s _] if the state \ alue of V " ( s/^ ) satisfies the initial condition β under the same task

goal polic x Pvϋ If at the 2k step footmark β is incorrect the SOMDP w ill be over that means the

robot has pas sed the dead corner and this dead corner s environment modeling has been finished It β is correct k w ill increase to be k + \ I he same SOMDP is set up again and the sub-goal proces s w ill begin as the former sub-goal proces s

Therefore, there occurs an interesting learning feature of TRIAL The robot usually tries manv times to make an accurate dead-comer model, which is shown bv our experiment

Thus, the first ob]ectiv e ot our environment modeling through learning bv robot itself is realized A model of a dead comer consists of three points noπnallv Please see an example of Fig 4

We treat the model points of a dead comer as genes of a path Therefore, we activate the Mutation module of Genetic Algoπthm to set even gene of a path to be current goal for some time in orders duπng the robot s approaching to the task goal The different orders ot the genes treated as current goal, the different paths will be found Additionallv our sub-goal method in paths finding is also guaranteed to expioπng the environment thoroughly and finds an optimal path More details ot these topics will be presented in our following papers 3. Experiment results and an error function.

In order to show the correctness of most issues of our sub-goal learning theory, we present a dynamic experiment result here. In this experiment, robot encounters a dynamic obstacle firstly, a theoretical dead comer secondly and a particular physical dead corner finally. (See Fig. 6)

Because the interacting environment is wholly unknown to robot and our robot has to learn to get a correct model of its environment, robot cannot get a correct model of this dynamic obstacle without trying. Fig. 5 is the TRIAL graphic of this dynamic experiment.

In the table 5.0. DO means the dynamic obstacle. The red coordinates of (97.440) means the 1 ^SI sub-goal, which is chosen firstly, is unreachable. It is in the body of Obstacle 1. DC1 is the first Dead Comer. DL means the state of Dead Lock. Under DL. S means the start step of Dead Lock. E is the last step of this Dead Lock. TRIALs 1(3) means the steps of the first TRIAL is 3 steps. SGI means the 1^st Sub-goal of this Dead Comer. SP means the position of the Dead Lock. MP means the last point of this Dead Comer^'s model.

Table 3.0. A dynamic experiment result

From this experiment, we can get a TRIAL function on Environment Modeling through learning as Formulation (5). which, according to the sense of modeling task, is an error function. Here. S_n means the steps of robot from initial position to the position when robot encounters a Dead Comer. / means time. A process is considered as a TRIAL duπng which the current goal is sub-goal but not the task goal. A TRIAL consists of several steps. S(t ) = S„ + 2 * S * T(t ) + P_n * 7^"(t)[l + T(t)] 15 )

/' means the minimal moving steps duπng one TRIAL / (/ ) means the number of trying times at

time / . that is the number of TRIALs _nm means the minimal step-size of a dead comer, it can be anv

integer number In this dynamic expeπment. we let S_mm = 3 steps to fast the process

According to this error function, we can get the size of one side of a Dead Comer from the corresponding error function, which is

LengthDC = ( S_mιn + If ) * f ( t ) (6)

We compare steps m the first loop of expioπng and exploiting the environment with the steps in the second loop of oπginal paths optimizing in Fig 5 1 and Fig 5 2 Results show the correctness of the learned dynamic obstacle^'s model and the πghteousness of our sub-goal OMDP theorem

4. Conclusions and our future work.

Our HiMBS algoπthm is a novel learning approach of Environment Modeling in a wholly unknown environment It enables a robot to independently and actually autonomously leam the obstacles model It thoroughly makes a mobile robot being free from the restriction of its environment

We make an animation simulator simulate what an actual wholly unknown environment will be and successfully test our HiMBS algoπthm We try to pre-design all of the possible problems in an actual mobile robot experiment in its two dimensions plane For instance, robot has a non-zero size, it has an arbitrary size and shape Obstacle also has an arbitrary size and shape, furthermore, sometimes, obstacle is in the state of dynamic moving Moreover, there is one of the most difficult problems a difficult dead reckoning problem etc These problems have been successfully overcome by our HiMBS algoπthm Therefore, although there may be still some defects of our HiMBS. we believ e HiMBS algoπthm is a powerful and successful aigoπthm for mobile robot to find a semi-optimal path to a given goal in a wholly unknown environment This ammation test platform has also testified our Zhinan Zen INS navigation theorem and evaluation theory In a word, the industπal program expeπence enables us to find the essential actual issue of sub-goal method and the new Environment Modeling learning method of TRYING in mobile robot

A three-dimension simulating environment will make our aigoπthm behaviors be more interesting and complex Our algoπthm mav fail more times at the first loop Howev er, it will quicklv correct its errors and learn successful modes Additionally, in order to make an authoπtative autonomous learning robot, it is necessan for us to test our algoπthm in other tasks Otherwise we are not sure to make a general algoπthm

How can our sub-goal method in path planning enable a robot to find other paths except the first found path⁹ Is our sub-goal method in path planning guaranteed to explore and exploit a wholly unknown environment thoroughly⁷ What are the details of our path optimizing theory Etc These topics will be made clear in our following papers

Finally, many essential concepts m robotics are left for us to continuously develop, such as. the intelligence definition and the least knowledge for a learning algoπthm. the state concept in a computer, how to judge whether an algoπthm is better or not Etc This paper of ours is a beginning of the research approach of discussing the least knowledge for a learning algoπthm and the state concept in a computer Our this invention will be start of a new artificial intelligence approach

BRIEF DESCRIPTION OF THE DRAWINGS

Fig [ The Markov tree structure of every paths finding environment

Fig 2 The task descnption in an empty and hghtless environment

Fig 3 The architecture of our HiMBS

Fig - Zhinan Zen Inertial Navigation System and Evaluation model (abbreviated as ZZINS EF)

Fig 5 The model of a dead comer

Fig 6 Leam to model a dynamic obstacle

Fig 7 A successful dynamic experiment result

BEST MODE FOR CARRYING OUT THE INVENTION

The least hardware conditions of our robot are one step-around-robot extern sensor or better one for detecting obstacles, motors and wheels or other instruments for moving, an inertial encoder sensor but no needs of Gyro sensor, accelerator sensor or tachometer, a controller for running our algoπthm The spiπt software condition of the robot is our HiMBS executable file Etc INDUSTRIAL APPLICABILITY

If a one-step-around-robot extern sensor or a better one for detecting obstacles is av ailable, our algoπthm can be utilized in industπal applications. For example, a waiter robot in restaurants, a mine searching machine, a deep-maπne-ob ect searching submaπne. etc.

References

[ 1 ] R.S. Sutton. "Options over MDP". Artificial Intelligence. 1998.

[2] Macro Doπgo. Macro Chorombetti. "Robot shaping^"

[3] L.P aelbhng. M.L. Littman and A.W Moore. "Reinforcement Learning: A Survey." Journal of

Artificial Intelligence Research. No 4. pp. 237-285. 1996 [4J . Mivazaki. M Yamamura and S. Kobavashi. "k-Certainty Exploration Method. .An Action

Selec-tor on Reinforcement Learning to Identify the En-vironment .'^" Journal of Japanese Society for

Ar-tificial Intelligence. Vol. 10. No. 3. pp. 454-463, 1995 [5]J.J. Grefenstette. "Credit Assignment in Rule Dis-covery Systems Based on Genetic Algoπthms."

Machine Learning. Vol. 3. pp. 225-245. 1988 [6] Motivational system for regulating human-robot interaction Breazeal-Cynthia

Proceedings-Of-the-National-Conference-on- Artificial-Intelligence. 1998. AAAI. Menlo Park. CA.

USA P 54-61 [7] Poggio T and Beymer D ( 1996). Learning to See. IEEE Spectrum.

Claims

Claim 1 We hope to be protected in any industrial path finding field Any program that utilizes our sub-goal task method should obey the invention law Any program, which has a trying behavior to avoid collision and use our sub-goal method in a wholly unknown environment, is considered as utilizing our invention Claim 2 Any program should obey the laws to utilize any protected idea in this invention of ours