CN109343532A

CN109343532A - A kind of paths planning method and device of dynamic random environment

Info

Publication number: CN109343532A
Application number: CN201811329446.3A
Authority: CN
Inventors: 黄兵明; 廖军; 王泽林
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2019-02-15

Abstract

The embodiment of the present invention provides the paths planning method and device of a kind of dynamic random environment, is related to computer information processing field, can find optimal path under dynamic random environment.This method comprises: defined feature vector space, assign the state value of start node to initial intermediate quantity, according to the initial intermediate quantity, obtain the run action of start node, the state value and run action of advance node, simultaneously according to the recurrence least square Q nitrification enhancement based on CMAC, intermediate parameters are updated；Then above-mentioned process is repeated after assigning the state value for the node that advances to initial intermediate quantity, until repeating the above-mentioned process since the state value of start node assigns initial intermediate quantity when initial intermediate quantity is identical with the state value of terminal node；Determining for weight row vector is calculated according to recurrence least square solution formula to be worth, and to obtain target feature vector space, final Q value table is obtained according to the determining value of target feature vector space and right vector, to obtain optimal path.

Description

A kind of paths planning method and device of dynamic random environment

Technical field

The present invention relates to computer information processing field more particularly to a kind of paths planning method of path random environment and Device.

Background technique

Barrier avoidance is an indispensable ring in the optimizing of path, it may be said that the path optimizing in dynamic random environment It is exactly to find the shortest path from initial point to target point under the premise of avoiding obstacles.Range in existing pathfinding algorithm The paths optimizing algorithms such as first search algorithm, ant group algorithm, genetic algorithm and A* algorithm, need to know the specific of environmental model Information, that is to say, that very high to the required precision of environmental model and route searching space.But large-scale role playing game scene In the barriers such as other players, monster and intrinsic mountain, water, the forest that occur at random so that environmental model and route searching Space is dynamic, is random.Therefore to a certain extent for, for the barrier avoidance problem in the optimizing of path, tradition Path optimizing algorithm be not applicable.

Intensified learning belongs to searching algorithm, can traverse all paths in the case where state and unknown environment, according to giving Fixed money reward function acquires the value of the objective function of each paths, the maximum path of target function value is therefrom chosen, in conjunction with mind The avoidance and path optimizing purpose under dynamic random ring scene may be implemented through network.But neural network is approached due to the overall situation and is led to Normal training speed is slower, and the computing resource (memory etc.) and cost (time etc.) in large-scale scene of game needed for it are not to be inconsistent Share family experience requirements.Therefore partial approximation neural network is usually taken, and the most important potential limitation of partial approximation is exactly Feature unit required for increase with input space dimension is increased with exponential form, and partial approximation cannot achieve The planning in global optimum path.

Summary of the invention

The embodiment of the present invention provides the paths planning method and device of a kind of dynamic random environment, for saving calculating On the basis of resource, to the optimum route search in dynamic random environment between two nodes.

In order to achieve the above objectives, the embodiment of the present invention adopts the following technical scheme that

In a first aspect, providing a kind of paths planning method of dynamic random environment, comprising:

Obtain the initial value of eligibility trace, the state of the construction initial value of column vector, the initial value of structural matrix, start node The state value of value and terminal node；The state value of start node includes the space coordinate of start node, the state value of terminal node Space coordinate including terminal node；

According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building is dynamic The characteristic vector space of state random environment；

Assign the state value of start node to initial intermediate quantity；

According to initial intermediate quantity, the fortune of the run action of start node, the state value of advance node and advance node is obtained Action is made；

According to initial intermediate quantity, the initial value of eligibility trace, characteristic vector space, the initial value for constructing column vector, construction square The initial value of battle array, the run action of start node, the run action of the state value of advance node and advance node, foundation are based on The recurrence least square Q nitrification enhancement of CMAC to the initial value of eligibility trace, constructs the initial value and structural matrix of column vector Initial value be updated；

After assigning the state value for the node that advances to initial intermediate quantity, according to initial intermediate quantity, the operation of start node is obtained It acts, the run action of the state value of advance node and advance node；The run action of start node and the state of advance node Value corresponds；

When the initial intermediate quantity of determination is identical with the state value of terminal node, the state value of start node is assigned in initial Behind the area of a room, according to initial intermediate quantity, the operation of the run action of start node, the state value of advance node and advance node is obtained Movement；

When determining that there are the initial intermediate quantity of predetermined number is identical as the state value of terminal node in all initial intermediate quantities When, it is minimum according to recurrence according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time Two multiply the determining value that solution formula calculates weight row vector；

Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector space； Value, target feature vector space are determined according to weight row vector, calculate final Q value table according to default value of Q calculation formula；According to Final Q value table determines the optimal path in dynamic random environment between start node and terminal node.

Above-described embodiment provides technical solution, first by the weight row vector initial value of CMAC and activation primitive to entire The space of dynamic random environment is defined, and obtains characteristic vector space, assigns the state value of start node to a centre Value is initial intermediate quantity, according to the initial intermediate quantity, obtains the run action of start node, the next node of start node is advanced The state value of node and the run action of advance node, while recurrence least square Q nitrification enhancement of the foundation based on CMAC, Eligibility trace relevant to the final determining value of weight row vector, structural matrix and construction column vector are updated；Then will before Into node state value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after stream Journey, until repeating to assign from the state value of start node initial intermediate when initial intermediate quantity is identical with the state value of terminal node Measuring the process started, initially intermediate quantity is identical as the state value of terminal node until there is predetermined number；Then most according to recurrence Small two multiply solution formula calculate the weight row vector determine value, with to characteristic vector space be updated acquisition target signature to Quantity space can get according to the determining value in target feature vector space and right vector and obtain by multiple intensified learning Final Q value table, according to the final Q value table can obtain start node to terminal node optimal path.Because of the invention The technical solution that embodiment provides, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, and form three The algorithm to recirculate, both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount, but also with CMAC The fast advantage of velocity of approch is also equipped with the advantage of the optimum search of multistep Q nitrification enhancement, so that the algorithm is in elephant Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of multiplayer online games while saving computing resource And the optimal path obtained according to final Q value table.

Second aspect provides a kind of path planning apparatus of dynamic random environment, comprising: obtains module, establishes module, sentences Disconnected module, node processing module, update module, loop module, weight computing module, feature calculation module, Q value table computing module And path selection module；

Obtain module, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix initial value, The state value of start node and the state value of terminal node；The state value of start node includes the space coordinate of start node, eventually Only the state value of node includes the space coordinate of terminal node；

Module is established, for according to the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and swashing for CMAC Function living, constructs the characteristic vector space of dynamic random environment；

Loop module assigns initial intermediate quantity for will acquire the state value of start node of module acquisition；

Node processing module is used for the initial intermediate quantity that generates according to loop module, obtain start node run action, The state value of advance node and the run action of advance node；

Update module, initial intermediate quantity for being generated according to loop module obtain the initial of the eligibility trace that module obtains Be worth, establish module building characteristic vector space, obtain module obtain construction column vector initial value, obtain module acquisition The advance that the run action for the start node that initial value, the node processing module of structural matrix obtain, node processing module obtain The run action for the advance node that the state value and node processing module of node obtain, according to the recurrence least square based on CMAC Q nitrification enhancement is updated the initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector；

The state value imparting for the advance node that node processing module is also used to obtain node processing module in loop module After initial intermediate quantity, according to the initial intermediate quantity that loop module generates, the run action of start node, the shape of advance node are obtained The run action of state value and advance node；The run action of start node and the state value of advance node correspond；

When judgment module determines the initial intermediate quantity that loop module generates and obtains the state for the terminal node that module obtains When being worth identical, node processing module is also used to assign in the state value that loop module will acquire the start node of module acquisition initial After intermediate quantity, according to the initial intermediate quantity that loop module generates, the run action of start node, the state value of advance node are obtained With the run action of advance node；

In all initial intermediate quantities that judgment module determines loop module generation, there are the initial intermediate quantities of predetermined number When identical as the state value of terminal node for obtaining module acquisition, weight computing module is current for being updated according to update module The initial value of the initial value of the structural matrix at moment and construction column vector, according to recurrence least square solution formula calculate weight row to Amount determines value；

Feature calculation module, weight row vector for being calculated according to weight computing module determine value to establishing module structure The characteristic vector space built is updated, to obtain target feature vector space；

Q value table computing module, the determining value and feature calculation of the weight row vector for being calculated according to weight computing module The target feature vector space that module obtains calculates final Q value table according to default value of Q calculation formula；

Path selection module, the final Q value table for being calculated according to Q value table computing module determine in dynamic random environment just Optimal path between beginning node and terminal node.

The third aspect provides a kind of path planning apparatus of dynamic random environment, comprising: memory, processor, bus and Communication interface；For storing computer executed instructions, processor is connect with memory by bus memory；When dynamic random ring When the path planning apparatus operation in border, processor executes the computer executed instructions of memory storage, so that dynamic random environment Path planning apparatus execute as first aspect provide dynamic random environment paths planning method.

Fourth aspect provides a kind of computer storage medium, including computer executed instructions, when computer executed instructions exist When being run on computer, so that computer executes the paths planning method of the dynamic random environment provided such as first aspect.

The paths planning method and device of dynamic random environment provided in an embodiment of the present invention, this method comprises: obtaining money The initial value of lattice mark constructs the initial value of column vector, the initial value of structural matrix, the state value of start node and terminal node State value；The state value of start node includes the space coordinate of start node, and the state value of terminal node includes terminal node Space coordinate；According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building The characteristic vector space of dynamic random environment；Assign the state value of start node to initial intermediate quantity；According to initial intermediate quantity, obtain Take the run action of the run action of start node, the state value of advance node and advance node；According to initial intermediate quantity, qualification The initial value of mark, characteristic vector space, the construction initial value of column vector, the initial value of structural matrix, the operation of start node are dynamic Make, the run action of the state value of advance node and advance node, is calculated according to the recurrence least square Q intensified learning based on CMAC Method is updated the initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector；To advance node State value assign initial intermediate quantity after, according to initial intermediate quantity, obtain the run action of start node, the state of advance node The run action of value and advance node；The run action of start node and the state value of advance node correspond；It is first when determining When beginning intermediate quantity is identical with the state value of terminal node, after assigning the state value of start node to initial intermediate quantity, according to initial Intermediate quantity obtains the run action of the run action of start node, the state value of advance node and advance node；It is all when determining When identical as the state value of terminal node there are the initial intermediate quantity of predetermined number in initial intermediate quantity, according to the structure at current time The initial value of the initial value of matrix and the construction column vector at current time is made, calculates weight row according to recurrence least square solution formula Vector determines value；Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector Space；Value, target feature vector space are determined according to weight row vector, calculate final Q value according to default value of Q calculation formula Table；The optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.So of the invention Embodiment provides technical solution, can be first by the weight row vector initial value of CMAC and activation primitive to entire dynamic random The space of environment is defined, and obtains characteristic vector space, assigns the state value of start node to a median, that is, initial Intermediate quantity obtains the shape of the run action of start node, the next node advance node of start node according to the initial intermediate quantity The run action of state value and advance node, while according to the recurrence least square Q nitrification enhancement based on CMAC, to weight The relevant eligibility trace of value, structural matrix and construction column vector are updated for row vector final determining；Then by the node that advances State value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after process, until just When beginning intermediate quantity is identical with the state value of terminal node, the stream since the state value of start node assigns initial intermediate quantity is repeated It is identical as the state value of terminal node that the initial intermediate quantity of predetermined number occurs in Cheng Zhizhi；Then public according to recurrence least square solution Formula calculates the determining value of the weight row vector, obtains target feature vector space, root to be updated to characteristic vector space The final Q value obtained by multiple intensified learning can be got according to the determining value of target feature vector space and right vector Table, according to the final Q value table can obtain start node to terminal node optimal path.Because the embodiment of the present invention provides Technical solution, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, three calculations recirculated are formed Method, it is fast but also with CMAC velocity of approch both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount Advantage, the advantage of the optimum search of multistep Q nitrification enhancement is also equipped with, so that the algorithm is swum online in the more people of elephant Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of play while saving computing resource and according to most The optimal path that whole Q value table obtains.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow diagram of the paths planning method of dynamic random environment provided in an embodiment of the present invention；

Fig. 2 is the idiographic flow schematic diagram of 104 steps in Fig. 1；

Fig. 3 is the idiographic flow schematic diagram of 10412 steps in Fig. 2；

Fig. 4 is the idiographic flow schematic diagram of 10422 steps in Fig. 2；

Fig. 5 is the idiographic flow schematic diagram of 105 steps in Fig. 1；

Fig. 6 is a kind of instantiation figure of the paths planning method of dynamic random environment provided in an embodiment of the present invention；

Fig. 7 is that two kinds of path optimizing algorithms provided in an embodiment of the present invention are imitative in 40 × 40 game gridding environment True comparison；

Fig. 8 is that two kinds of path optimizing algorithms provided in an embodiment of the present invention are imitative in 50 × 50 game gridding environment True comparison；

Fig. 9 is the corresponding average learning curve comparison diagram of Fig. 8；

Figure 10 is the corresponding calculating time comparison diagram of Fig. 8；

Figure 11 provides a kind of structural schematic diagram of the path planning apparatus of dynamic random environment for the embodiment of the present invention；

Figure 12 provides the structural schematic diagram of the path planning apparatus of another dynamic random environment for the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

It should be noted that in the embodiment of the present invention, " illustrative " or " such as " etc. words make example, example for indicating Card or explanation.Be described as in the embodiment of the present invention " illustrative " or " such as " any embodiment or design scheme do not answer It is interpreted than other embodiments or design scheme more preferably or more advantage.Specifically, " illustrative " or " example are used Such as " word is intended to that related notion is presented in specific ways.

It should also be noted that, in the embodiment of the present invention, " (English: of) ", " corresponding (English: Corresponding, relevant) " it sometimes can be mixed with " corresponding (English: corresponding) ", it should be pointed out that It is that, when not emphasizing its difference, meaning to be expressed is consistent.

For the ease of clearly describing the technical solution of the embodiment of the present invention, in an embodiment of the present invention, use " the One ", the printed words such as " second " distinguish function and the essentially identical identical entry of effect or similar item, and those skilled in the art can To understand that the printed words such as " first ", " second " are not to be defined to quantity and execution order.

In computer game, especially in the online role playing game of the more people of elephant or more people sports class game, pathfinding process One of vital task when always, the path optimizing algorithm in game algorithm itself will directly affect the game experiencing of player.And with The development of technology, scene of game will become increasingly complex, resource needed for traditional path optimizing algorithm (memory, time) also refers to Number property increases, and if using traditional path optimizing algorithm always, it will greatly occupy needed for remaining functional task of game Computing resource, seriously affects the usage experience of user, so needing one kind can be on the basis of saving computing resource rapidly The algorithm for finding global optimum path is replaced traditional path optimizing algorithm.

Inventive concept of the invention is introduced below:

In traditional path optimizing algorithm, BFS (Breadth First Search, breadth-first search) algorithm is a kind of Blind search algorithm all nodes can all scan in map, until finding result until, consuming computing resource it is more and Obtained path is not necessarily optimal；

Heuristic search algorithm A* algorithm is to solve the most effective direct search side of shortest path in a kind of static road network Method, and solve the problems, such as the efficient algorithm of many search, the range estimation value and actual value in algorithm are closer, final search speed Degree is faster, but for dynamic random environment, and be not suitable for.

Although nitrification enhancement can traverse all paths in the case where state and unknown environment, according to given reward Golden function acquires the value of the objective function of each paths, the maximum path of target function value is therefrom chosen, in conjunction with neural network The avoidance and path optimizing purpose under dynamic random ring scene may be implemented.But neural network is approached due to the overall situation and is usually trained Speed is slower, and the computing resource (memory etc.) and cost (time etc.) in large-scale scene of game needed for it are not meet user Experience requirements.Therefore partial approximation neural network is usually taken, and the most important potential limitation of partial approximation is exactly with defeated It is increased with exponential form for entering feature unit required for the increase of Spatial Dimension, and partial approximation cannot achieve the overall situation most The planning of shortest path.

CMAC (Cerebellar Model Articulation Controller, Cerebellar Model Articulation Controller) is a kind of The local very strong neural network of generalization ability, therefore it has compared to other neural network advantages: the modified weight of CMAC network Algorithm is simple, stores the information in partial structurtes, and under the premise of guaranteeing function approximation capability, pace of learning is fast, very suitable Close on-line study；CMAC structure is simple, is easy to hardware realization and software realization.It is therefore contemplated that can be by itself and traditional intensified learning The task of automatic pathfinding is realized among algorithm connected applications to online game.But also due to this characteristic, it the shortcomings that be exactly only It can be realized local optimum, and ' optimal ' path under scene of game in path planning is global optimum；And recurrence minimum two Multiplication (Recur sive Least Square, RLS) is that a kind of calculation amount is small and can guarantee that global extremely excellent stable convergence is calculated Method, so technical staff expects obtaining this triplicity the recurrence least square Q nitrification enhancement based on CMAC come to dynamic State random environment carries out path planning.

Based on above-mentioned thought, shown referring to Fig.1, the embodiment of the present invention provides a kind of path planning side of dynamic random environment Method, comprising:

101, initial value, the construction initial value of column vector, the initial value of structural matrix, start node of eligibility trace are obtained The state value of state value and terminal node.

Wherein, the state value of start node includes the space coordinate of start node, and the state value of terminal node includes terminating The space coordinate of node.

102, according to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, structure Build the characteristic vector space of dynamic random environment.

Specifically, feature space vector for representing dynamic random environment, is specifically used as sample space, example in the algorithm Property, feature space vector are as follows:

Wherein, s is start node obtained in algorithm operational process to any node between terminal node in path Described initial intermediate quantity in state value, that is, subsequent step, a are the run action of s, ω₁To ω_NFor in weight row vector first to N-th element, f are the activation primitive of CMAC.

103, the state value of start node is assigned to initial intermediate quantity.

It, in practice can be with specifically, propose the definition of initial intermediate quantity here merely to apparent in statement There is no the initial intermediate quantities, as long as the circulation in completing technology scheme.

104, according to initial intermediate quantity, the run action of start node, the state value of advance node and advance node are obtained Run action.

Wherein, the state value of the run action of start node and advance node corresponds.

When start node and terminal node are same node, 111 are executed after 104 steps.

Illustratively, referring to shown in Fig. 2,104 steps are specifically included:

10411, determine that the executable execution movement of the corresponding node of initial intermediate quantity is the first movement of start node.

Specifically, executable execution movement here refers to that the corresponding node of initial intermediate quantity executes this and executable holds After action is made, the node of arrival is not present barrier, the step for purpose tail avoidance；In actual algorithm, there are the sections of barrier The state value of point is set as 1, and there is no the state values of the node of barrier to be set as 0.

Illustratively, execution movement includes any one of following: up, down, left and right.

10412, according to initial intermediate quantity and initial Q value table, start node is chosen from the first movement according to greedy algorithm Run action.

Illustratively, referring to shown in Fig. 3,10412 are specifically included:

104121, it is acted according to initial intermediate quantity and first, determines the state value of first node；First movement and first segment The state value of point corresponds.

104122, it is selected from initial Q value table according to the state value of the second movement and first node corresponding with the second movement Take the first Q value；Second movement is any first movement.

104123, maximum second movement of the first Q value is determined as to the run action of start node.

10413, according to the run action of initial intermediate quantity and start node, the state value of advance node is determined.

10414, according to the state value of advance node and initial Q value table, the operation of advance node is obtained according to greedy algorithm Movement.

In the run action acquisition process of advance node, do not need to judge that the corresponding node of its forward motion whether there is Barrier.

10421, determine that the execution movement that can be performed according to the corresponding node of initial intermediate quantity is start node first is dynamic Make.

10422, according to the state value of initial intermediate quantity and terminal node, according to selecting bad principle heuristic search algorithm, from the The run action of start node is chosen in one movement.

Specifically, heuristic search algorithm, each position searched for is commented in the search exactly in sample space Estimate, obtains best position, then scan for until target from this position.Here, heuristic factor, which follows, selects bad principle；It selects bad Gesture original is exactly to select a worst track to pass through the environment that can learn according to certain indexs (such as the priori knowledge in the field) Model (map model), to obtain worst feedback i.e. money reward value.In practice, do so makes algorithm than any instead Optimal solution i.e. optimal path is quickly found using the method for other priori knowledges.

Illustratively, referring to shown in Fig. 4,10422 steps are specifically included:

104221, it is acted according to initial intermediate quantity and first, determines the state value of first node；First movement and first segment The state value of point corresponds.

104222, according to the state value of the state value of first node and terminal node, the is calculated according to heuristic factor formula The heuristic factor value of one node.

Illustratively, heuristic factor formula are as follows:

W (s, a)=| | s '-Goal | |²；

Wherein, (s a) is heuristic factor to W, and s ' is the state value of first node, and Goal is the state value of terminal node, and s is Initial intermediate quantity, a are corresponding first movement of s '.

104223, corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as start node Run action.

10423, according to the run action of initial intermediate quantity and start node, the state value of advance node is determined.

10424, according to the state value of the state value of advance node and terminal node, foundation selects bad principle heuristic search algorithm Obtain the run action of advance node.

105, according to initial intermediate quantity, the initial value of eligibility trace, characteristic vector space, initial value, the structure for constructing column vector The run action of the initial value of matrix, the run action of start node, the state value of advance node and advance node is made, according to base In the recurrence least square Q nitrification enhancement of CMAC, to the initial value of eligibility trace, the initial value of construction column vector and construction square The initial value of battle array is updated.

Specifically, being stored in the initial value of the eligibility trace of pre-set space in the circulation of entire algorithm, constructing column vector The initial value of initial value and structural matrix can update again and again as the cycle progresses.

Illustratively, referring to Figure 5,105 steps specifically include:

1051, according to initial intermediate quantity and characteristic vector space, according to default eligibility trace more new formula to the first of eligibility trace Initial value is updated, to obtain the initial value of the eligibility trace updated.

Illustratively, eligibility trace more new formula is preset are as follows:

Wherein, e' is the initial value of the eligibility trace updated, and e is the initial value of eligibility trace, and λ is mark decay factor, and γ is folding The factor is detained, s is initial intermediate quantity, and a is the run action according to the s start node obtained,For the corresponding feature of s and a Vector space.

1052, according to the initial value of the initial value of construction column vector and the eligibility trace of update, according to pre-set configuration column vector More new formula is updated the initial value of construction column vector, to obtain the initial value of the construction column vector updated.

Illustratively, pre-set configuration column vector more new formula are as follows:

B'=e'r+b；

Wherein, b' is the initial value of the construction column vector updated, and r is money reward value, and b is the initial value for constructing column vector.

1053, according to the initial value of the eligibility trace of update, initial intermediate quantity, the run action of start node, advance node State value, the run action of advance node, characteristic vector space and structural matrix initial value, more according to pre-set configuration matrix New formula is updated the initial value of structural matrix, to obtain the initial value of the structural matrix updated.

Illustratively, pre-set configuration matrix update formula are as follows:

Wherein, A~for update structural matrix initial value, A be structural matrix initial value, s' is to be obtained according to s The state value of advance node, a' are the run action according to the s advance node obtained,For the corresponding feature of s' and a' to Quantity space, I be unit matrix, the order of I andThe quantity of middle feature vector is equal.

106, the state value for the node that advances is assigned to initial intermediate quantity.

After 106 steps, 104 are executed.

107, when the initial intermediate quantity of determination is identical with the state value of terminal node, the state value of start node is assigned just Beginning intermediate quantity.

104 are executed after 107 steps.

Specifically, the circulation of 104 steps to 107 steps is the innermost loop of algorithm provided in an embodiment of the present invention, often One cycle is all to find a start node to the path of terminal node.

108, when determining that there are the state values of predetermined number initial intermediate quantity and terminal node in all initial intermediate quantities When identical, according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time, according to recurrence Least square solution formula calculates determining for weight row vector and is worth.

Specifically, the circulation of 104 steps to 107 steps is the path in order to look for different start nodes to terminal node, But there is preset upper limit in this searching process, i.e., stop after finding predetermined number path, so in practice If being in order to avoid will cause the waste of computing resource having found optimal path and go back continuous service algorithm；Predetermined number It is identical as the state value of terminal node that path will also have the initial intermediate quantity of predetermined number.

Illustratively, recurrence least square solution formula are as follows:

θ=A~b'；

Wherein, θ be weight row vector determine value, A~for current time structural matrix initial value, b' be it is current when The initial value of the construction column vector at quarter.

109, determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector sky Between.

Specifically, 109 steps can be according to the newest of acquisition referring to the expression formula of the characteristic vector space in 102 steps Weight row vector is to the ω in feature space vector₁To ω_NIt is replaced, to obtain target feature vector space.

110, value, target feature vector space are determined according to weight row vector, is calculated most according to default value of Q calculation formula Whole Q value table.

Illustratively, default value of Q calculation formula are as follows:

Wherein, Q^πFor final Q value table,For target feature vector space, s is any initial intermediate quantity, and a is according to s The run action of the start node of acquisition.

111, according to the initial value and characteristic vector space of weight row vector, final Q is calculated according to default value of Q calculation formula It is worth table.

112, the optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.

Illustratively, referring to shown in Fig. 6, by taking simple 5 × 8 grid experiment scene as an example, each grid represents one A node, S hardship face represent start node, and G smiling face represents terminal node, and each state point has 4 movements it can be selected that Upper (↑), under (↓), left (←), right (→), thing representing fault present in grid, the embodiment of the present invention may finally obtain figure Shown in path.

Scene of game is thought into grid as shown in FIG. 6, wall is set by barrier similar to labyrinth, by intelligent body Grid locating for Agent, that is, arrow or position regard status point at this time as, and the barrier in each grid is according to trip It plays occurring at random with situation at that time and is set as black rectangle.Before not reaching home, it is transferred to from a state next The consuming cost of a state is set as r=-1, and instantaneous money reward is considered as in intensified learning, therefore, finds optimal path The problem transfer strategy minimum for cost spent from Initial condition to terminal state, and technical side provided in an embodiment of the present invention Case may finally obtain the final Q value table obtained by repetition test, in table there is since start node to terminal node The available money reward value of each node in each paths terminated, therefore can also obtain as shown in institute in Fig. 6 most Shortest path.

In order to become apparent from the advantage for showing technical solution provided in an embodiment of the present invention, carried out below with two specific examples Illustrate:

Simulation comparison in the game gridding environment of example one, 40 × 40: referring to shown in Fig. 7, in the embodiment of the present invention Introduce the recurrence least square Q nitrification enhancement CMAC-wRLSQ (λ) based on CMAC for selecting the bad factor and in practice based on Multistep least square Q nitrification enhancement RBF-LSQ (λ) ((the Radial Basis Function-Least of radial basis function SquaresQ) optimal path searched in one 40 × 40 grid environment is compared, the intensified learning that two kinds of algorithms use Rate α=0.1, Greedy strategy parameter ε=0.1 and eligibility trace parameter lambda=0.5 of greedy algorithm and regularization factors g= 10^-4.The learning curve of two kinds of algorithms is intellectual Agent required step number in each path optimizing event, that is, is spent Cost, program fetch runs 30 average value, and every operation be once just randomly generated one 40 × 40 initial point be S (1, It 4) is the labyrinth of G (35,34) with target point, the probability that each grid in the labyrinth of generation generates barrier uses identical mark Quasi normal distribution, and have following formula expression:

I.e.Tiletype=1 indicates herein to be barrier, tiletype=0 Indicate that the free time can be searched herein；

As shown in Figure 7, two kinds of algorithms take 50 times (value of the predetermined number provided in above-described embodiment) of study respectively Path optimizing result is obtained based on the Q value table obtained afterwards.Matrix is exactly the scene of game of gridding in figure, and black square indicates The barriers such as other players, scenery with hills and waters, animal, two kinds of the track algorithm search that when the 50th study that light color broken line represents generates The optimal path arrived.It can be seen from the figure that two kinds of algorithms all do not search real optimal path, but this present invention is implemented Algorithm expression effect provided by the technical solution that example provides is more preferable, shows as directly advancing to target point from initial point, almost Do not take step, closer to real optimal path.

Simulation comparison in the game gridding environment of example two, 50 × 50: referring to shown in Fig. 8, in order to increase experimental situation Complexity and randomness are generated in barrier and are also become larger on probability not only by increasing in scale, other all parameters and example One is consistent, excellent under this extensive dynamic random environment of Massively Multiplayer Online role playing game to verify this patent algorithm Gesture, wherein initial point is S (Isosorbide-5-Nitrae) and target point is the labyrinth of (45,44) G, and each grid in the labyrinth of generation generates obstacle The probability of object is expressed by following formula:

Referring to shown in Fig. 8, RBF-LSQ (λ) (left side) and this patent are proposed in 50 × 50 game gridding environment CMAC-wRLSQ (λ) (right side) two kinds of algorithms respectively taking study number be 50 when a path optimizing result.It can from figure To find out, as the scale of the game environment complexity that becomes larger is got higher, traditional RBF-LSQ (λ) algorithm pathfinding effect is declined, And the mentioned algorithm of this patent not only still has excellent effect, and can be seen that, as environment difficulty increases, advantage is also more next It is more obvious.

Specifically, curve graph as shown in Figure 9 be exactly in example two two kinds of algorithms run 30 times in 50 × 50 environment Average learning curve comparison diagram, abscissa is study number, and ordinate is step number required for reaching target point, and in game In can be quickly found in the short time optimal path be the key that save game running cost；As can be seen from the figure RBF-LSQ (λ) algorithm learning curve convergence rate is very slow and in fact the algorithm does not search optimal policy, initial step number always It is 2200, is constantly reduced in 100 study by study, be reduced to about 500 when study number is 100.This patent Mentioned CMAC-wR LSQ (λ) algorithm is greatly improved on learning rate compared to traditional Q algorithm, and required step number is by first 1200 to begin are quickly reduced to about 200 steps through overfitting, and gradually restrain when learning number and being 20, to a certain extent may be used To say being to have a qualitative leap.It is opened in figure from Fig. 9 and Figure 10 two as can be seen that still being searched for either in terms of learning rate In terms of the optimal path arrived, all it is greatly improved compared to legacy paths optimizing algorithm.

To sum up, the paths planning method of dynamic random environment provided in an embodiment of the present invention, this method comprises: obtaining qualification Initial value, the construction initial value of column vector, the initial value of structural matrix, the state value of start node and the shape of terminal node of mark State value；The state value of start node includes the space coordinate of start node, and the state value of terminal node includes the sky of terminal node Between coordinate；According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building is dynamic The characteristic vector space of state random environment；Assign the state value of start node to initial intermediate quantity；According to initial intermediate quantity, obtain The run action of the run action of start node, the state value of advance node and advance node；According to initial intermediate quantity, eligibility trace Initial value, characteristic vector space, construction the initial value of column vector, the initial value of structural matrix, start node run action, The state value of advance node and the run action of advance node, recurrence least square Q nitrification enhancement of the foundation based on CMAC, The initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector are updated；By the node that advances After state value assigns initial intermediate quantity, according to initial intermediate quantity, the run action of start node, the state value of advance node are obtained With the run action of advance node；The run action of start node and the state value of advance node correspond；It is initial when determining When intermediate quantity is identical with the state value of terminal node, after assigning the state value of start node to initial intermediate quantity, in initial The area of a room obtains the run action of the run action of start node, the state value of advance node and advance node；It is all first when determining When identical as the state value of terminal node there are the initial intermediate quantity of predetermined number in beginning intermediate quantity, according to the construction at current time The initial value of the construction column vector of the initial value and current time of matrix, according to recurrence least square solution formula calculate weight row to Amount determines value；Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector sky Between；Value, target feature vector space are determined according to weight row vector, calculate final Q value table according to default value of Q calculation formula； The optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.So the present invention is real It applies example and technical solution is provided, it can be first by the weight row vector initial value of CMAC and activation primitive to entire dynamic random ring The space in border is defined, and obtains characteristic vector space, and assigning the state value of start node to a median is in initial The area of a room obtains the state of the run action of start node, the next node advance node of start node according to the initial intermediate quantity The run action of value and advance node, while according to the recurrence least square Q nitrification enhancement based on CMAC, to weight row The relevant eligibility trace of value, structural matrix and construction column vector are updated for vector final determining；Then by the shape for the node that advances State value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after process, until initial When intermediate quantity is identical with the state value of terminal node, the process since the state value of start node assigns initial intermediate quantity is repeated Until it is identical as the state value of terminal node the initial intermediate quantity of predetermined number occur；Then according to recurrence least square solution formula Calculate the weight row vector determines value, obtains target feature vector space to be updated to characteristic vector space, according to The determining value of target feature vector space and right vector can get the final Q value obtained by multiple intensified learning Table, according to the final Q value table can obtain start node to terminal node optimal path.Because the embodiment of the present invention provides Technical solution, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, three calculations recirculated are formed Method, it is fast but also with CMAC velocity of approch both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount Advantage, the advantage of the optimum search of multistep Q nitrification enhancement is also equipped with, so that the algorithm is swum online in the more people of elephant Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of play while saving computing resource and according to most The optimal path that whole Q value table obtains.

Referring to Fig.1 shown in 1, the embodiment of the present invention also provides a kind of path planning apparatus 01 of dynamic random environment, comprising: It obtains module 21, establish module 22, judgment module 23, node processing module 24, update module 25, loop module 26, weight meter Calculate module 27, feature calculation module 28, Q value table computing module 29 and path selection module 30；

Obtain module 21, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix it is initial The state value of value, the state value of start node and terminal node；The state value of start node includes the space coordinate of start node, The state value of terminal node includes the space coordinate of terminal node；

Module 22 is established, for according to the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC Activation primitive constructs the characteristic vector space of dynamic random environment；

Loop module 26 assigns initial intermediate quantity for will acquire the state value of start node of the acquisition of module 21；

Node processing module 24 is used for the initial intermediate quantity generated according to loop module 26, and the operation for obtaining start node is dynamic Make, the run action of the state value of advance node and advance node；

Update module 25, the eligibility trace that initial intermediate quantity, acquisition module 21 for being generated according to loop module 26 obtain Initial value, establish module 22 building characteristic vector space, obtain module 21 obtain construction column vector initial value, acquisition Run action, the node processing for the start node that initial value, the node processing module 24 for the structural matrix that module 21 obtains obtain The run action for the advance node that the state value and node processing module 24 for the advance node that module 24 obtains obtain, foundation are based on The recurrence least square Q nitrification enhancement of CMAC to the initial value of eligibility trace, constructs the initial value and structural matrix of column vector Initial value be updated；

The state for the advance node that node processing module 24 is also used to obtain node processing module 24 in loop module 26 After value assigns initial intermediate quantity, according to the initial intermediate quantity that loop module 26 generates, obtains the run action of start node, advances The state value of node and the run action of advance node；The run action of start node and the state value one of advance node are a pair of It answers；

When judgment module 23 determines the initial intermediate quantity that loop module 26 generates and obtains the terminal node of the acquisition of module 21 State value it is identical when, node processing module 24 be also used to loop module 26 will acquire module 21 acquisition start node shape After state value assigns initial intermediate quantity, according to the initial intermediate quantity that loop module 26 generates, the run action, preceding of start node is obtained Into the state value of node and the run action of advance node；

When judgment module 23 determines in all initial intermediate quantities that loop module 26 generates, there are predetermined number it is initial in When the area of a room is identical as the state value of terminal node for obtaining the acquisition of module 21, weight computing module 27 is used for according to update module 25 Initial and construction column vector the initial value of the structural matrix of update, calculates weight row vector according to recurrence least square solution formula Determine value；

Feature calculation module 28, weight row vector for being calculated according to weight computing module 27 determine value to establishing mould The characteristic vector space that block 22 constructs is updated, to obtain target feature vector space；

Q value table computing module 29, the determining value and feature of the weight row vector for being calculated according to weight computing module 27 The target feature vector space that computing module 28 obtains calculates final Q value table according to default value of Q calculation formula；

Path selection module 30, the final Q value table for being calculated according to Q value table computing module 29 determine dynamic random environment Optimal path between middle start node and terminal node.

Optionally, node processing module 24 is specifically used for:

Determine that the executable execution movement of the corresponding node of initial intermediate quantity that loop module 26 generates is start node First movement；

According to initial intermediate quantity and initial Q value table, the operation of start node is chosen from the first movement according to greedy algorithm Movement；

According to the run action of initial intermediate quantity and start node, the state value of advance node is determined；

According to the state value of advance node and initial Q value table, the run action of advance node is obtained according to greedy algorithm；

Execution movement includes any one of following: up, down, left and right.

Optionally, node processing module 24 is acted according to greedy algorithm from first according to initial intermediate quantity and initial Q value table The process of the middle run action for choosing start node specifically includes:

It is acted according to initial intermediate quantity and first, determines the state value of first node；The shape of first movement and first node State value corresponds；

The first Q is chosen from initial Q value table according to the state value of the second movement and first node corresponding with the second movement Value；Second movement is any first movement；

Maximum second movement of first Q value is determined as to the run action of start node.

Optionally, node processing module 24 is specifically used for:

According to the state value of initial intermediate quantity and terminal node, according to bad principle heuristic search algorithm is selected, from the first movement The middle run action for choosing start node；

According to the state value of the state value of advance node and terminal node, before selecting bad principle heuristic search algorithm acquisition Into the run action of node；

Execution movement includes any one of following: up, down, left and right.

Optionally, node processing module 24 is opened according to the state value of initial intermediate quantity and terminal node according to bad principle is selected Searching algorithm is sent out, the process for the run action for choosing start node from the first movement specifically includes:

According to the state value of the state value of first node and terminal node, first node is calculated according to heuristic factor formula Heuristic factor value；

Corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as to the operation of start node Movement.

Optionally, update module 25 is specifically used for:

The initial intermediate quantity generated according to loop module 26 and the characteristic vector space for establishing the building of module 22, according to default Eligibility trace more new formula is updated the initial value for obtaining the eligibility trace that module 21 obtains, to obtain the first of the eligibility trace updated Initial value；

According to the updated value for the initial value and eligibility trace for obtaining the construction column vector that module 21 obtains, arranged according to pre-set configuration Vector more new formula is updated the initial value of construction column vector, to obtain the initial value of the construction column vector updated；

It is obtained according to the updated value of eligibility trace, the initial intermediate quantity of the generation of loop module 26, node processing module 24 first Before state value, the node processing module 24 for the advance node that the run action of beginning node, node processing module 24 obtain obtain Into the initial of the run action of node, the characteristic vector space for establishing the building of module 22 and the structural matrix of the acquisition acquisition of module 21 Value is updated the initial value of structural matrix according to pre-set configuration matrix update formula, to obtain the structural matrix updated Initial value.

The path planning apparatus of dynamic random environment provided in an embodiment of the present invention, because the device includes: acquisition module, For obtain eligibility trace initial value, construction the initial value of column vector, the initial value of structural matrix, start node state value and The state value of terminal node；The state value of start node includes the space coordinate of start node, and the state value of terminal node includes The space coordinate of terminal node；Module is established, for according to the initial of the weight row vector of CMAC Neural Network CMAC hidden layer The activation primitive of value and CMAC constructs the characteristic vector space of dynamic random environment；Loop module is obtained for will acquire module The state value of start node assign initial intermediate quantity；Node processing module is used for the initial centre generated according to loop module Amount, obtains the run action of the run action of start node, the state value of advance node and advance node；Update module is used for The initial value for the eligibility trace that initial intermediate quantity, the acquisition module generated according to loop module obtains, the feature for establishing module building Vector space, the initial value for obtaining the construction column vector that module obtains, initial value, the node for obtaining the structural matrix that module obtains The state value and node processing for the advance node that the run action for the start node that processing module obtains, node processing module obtain The run action for the advance node that module obtains, according to the recurrence least square Q nitrification enhancement based on CMAC, to eligibility trace Initial value, construct the initial value of column vector and the initial value of structural matrix is updated；Node processing module is also used to following After the state value for the advance node that ring moulds block is also used to obtain node processing module assigns initial intermediate quantity, according to loop module The initial intermediate quantity generated, obtains the run action of the run action of start node, the state value of advance node and advance node； The run action of start node and the state value of advance node correspond；When judgment module determines initial intermediate quantity and terminal node When the state value of point is identical, node processing module is also used to will acquire the state value of the start node of module acquisition in loop module After assigning initial intermediate quantity, according to the initial intermediate quantity that loop module generates, run action, the advance node of start node are obtained State value and advance node run action；Exist when in all initial intermediate quantities that judgment module determines loop module generation The initial intermediate quantity of predetermined number, when identical as the state value of terminal node for obtaining module acquisition, weight computing module is used for According to the updated value of the updated value of structural matrix and construction column vector, weight row vector is calculated according to recurrence least square solution formula Determine value；Feature calculation module, for determining that value is updated characteristic vector space according to weight row vector, to obtain Target feature vector space；Q value table computing module, the determining value of the weight row vector for being calculated according to weight computing module, At the state value and node of the advance node in target feature vector space, node processing module acquisition that feature calculation module obtains The run action for managing the start node that module obtains calculates final Q value table according to default value of Q calculation formula；Path selection module, Final Q value table for being calculated according to Q value table computing module determines in dynamic random environment between start node and terminal node Optimal path.

So the embodiment of the present invention provides technical solution, the weight row vector initial value of CMAC and activation can be passed through first The space of entire dynamic random environment is defined in function, obtains characteristic vector space, and the state value of start node is assigned The i.e. initial intermediate quantity of a median is given, according to the initial intermediate quantity, is obtained under the run action of start node, start node The state value of one node advance node and the run action of advance node, at the same it is strong according to the recurrence least square Q based on CMAC Change learning algorithm, eligibility trace relevant to the final determining value of weight row vector, structural matrix and construction column vector are carried out more Newly；Then the state value for the node that advances is assigned and repeats after initial intermediate quantity above-mentioned to assign in initial from the state value of start node Process behind the area of a room, until repeating to assign from the state value of start node when initial intermediate quantity is identical with the state value of terminal node Giving the process that initial intermediate quantity starts, initially intermediate quantity is identical as the state value of terminal node until there is predetermined number；Then The determining value of the weight row vector is calculated, according to recurrence least square solution formula to be updated acquisition to characteristic vector space Target feature vector space can be got according to the determining value of target feature vector space and right vector by repeatedly strong The final Q value table that chemical acquistion is arrived, according to the final Q value table can obtain start node to terminal node optimal path. Because of technical solution provided in an embodiment of the present invention, by recurrent least square method and multistep Q nitrification enhancement and CMAC phase In conjunction with, three algorithms to recirculate are formed, both there is the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount, But also with the fast advantage of CMAC velocity of approch, it is also equipped with the advantage of the optimum search of multistep Q nitrification enhancement, so that should Algorithm rapid while saving computing resource can obtain in the dynamic randoms environment such as the topic figure of elephant multiplayer online games To final Q value table and the optimal path obtained according to final Q value table.

Referring to Fig.1 shown in 2, the embodiment of the present invention also provides the path planning apparatus of another dynamic random environment, including Memory 41, processor 42, bus 43 and communication interface 44；Memory 41 is for storing computer executed instructions, processor 42 It is connect with memory 41 by bus 43；When the operation of the path planning apparatus of dynamic random environment, processor 42 executes storage The computer executed instructions that device 41 stores, so that the path planning apparatus of dynamic random environment is executed as provided by the above embodiment The paths planning method of dynamic random environment.

In concrete implementation, as one embodiment, processor 42 (42-1 and 42-2) may include one or more CPU, such as CPU0 and CPU1 shown in Figure 12.And as one embodiment, the path planning apparatus of dynamic random environment can To include multiple processors 42, such as processor 42-1 and processor 42-2 shown in Figure 12.It is every in these processors 42 One CPU can be a single core processor (Single-CPU), be also possible to a multi-core processor (Multi-CPU).This In processor 42 can refer to one or more equipment, circuit, and/or for handling data (such as computer program instructions) Handle core.

Memory 41 can be read-only memory 41 (Read-Only Memory, ROM) or can store static information and refer to The other kinds of static storage device enabled, random access memory (Random Access Memory, RAM) or can store The other kinds of dynamic memory of information and instruction, is also possible to Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), CD-ROM (Compact Disc Read-Only Memory, CD-ROM) or other optical disc storages, optical disc storage (including compression optical disc, laser disc, light Dish, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carry or Store have instruction or data structure form desired program code and can by any other medium of computer access, but It is without being limited thereto.Memory 41, which can be, to be individually present, and is connected by communication bus 43 with processor 42.Memory 41 can also be with It is integrated with processor 42.

In concrete implementation, memory 41, for storing the data in the application and executing the software program of the application Corresponding computer executed instructions.Processor 42 can by running or executing the software program being stored in memory 41, with And call the data being stored in memory 41, the various functions of the path planning apparatus of dynamic random environment.

Communication interface 44 is used for and other equipment or communication, such as control using the device of any transceiver one kind System processed, wireless access network (Radio Access Network, RAN), WLAN (Wireless Local Area Networks, WLAN) etc..Communication interface 44 may include that receiving unit realizes that receive capabilities and transmission unit realize transmission Function.

It is total to can be industry standard architecture (Industry Standard Architecture, ISA) for bus 43 Line, external equipment interconnection (Peripheral Component Interconnect, PCI) bus or extension Industry Standard Architecture Structure (Extended Industry Standard Architecture, EISA) bus etc..The bus 43 can be divided into address Bus, data/address bus, control bus etc..Only to be indicated with a thick line in Figure 12, it is not intended that only one convenient for indicating Bus or a type of bus.

The embodiment of the present invention also provides a kind of computer storage medium, and computer storage medium includes that computer execution refers to It enables, when computer executed instructions are run on computers, so that computer executes such as dynamic random provided by the above embodiment The paths planning method of environment.

The embodiment of the present invention also provides a kind of computer program, which can be loaded directly into memory, and Containing software code, the computer program be loaded into via computer and can be realized after executing dynamic provided by the above embodiment with The paths planning method of machine environment.

Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted. Computer-readable medium includes computer storage media and communication media, and wherein communication media includes convenient for from a place to another Any medium of one place transmission computer program.Storage medium can be general or specialized computer can access it is any Usable medium.

Through the above description of the embodiments, it is apparent to those skilled in the art that, for description It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete All or part of function described above.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the module or unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components It may be combined or can be integrated into another device, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.Unit can be or can also as illustrated by the separation member Not to be physically separated, component shown as a unit can be a physical unit or multiple physical units, it can It is in one place, or may be distributed over multiple and different places.Can select according to the actual needs part therein or Person's whole unit achieves the purpose of the solution of this embodiment.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.If integrated unit with The form of SFU software functional unit is realized and when sold or used as an independent product, can store and storage Jie can be read at one In matter.Based on this understanding, the portion that the technical solution of the embodiment of the present application substantially in other words contributes to the prior art Divide or all or part of the technical solution can be embodied in the form of software products, which is stored in one In storage medium, including some instructions are used so that an equipment (can be single-chip microcontroller, chip etc.) or processor (processor) it performs all or part of the steps of the method described in the various embodiments of the present invention.And storage medium above-mentioned includes: The various media that can store program code such as USB flash disk, mobile hard disk, ROM, RAM, magnetic or disk.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.

Claims

1. a kind of paths planning method of dynamic random environment characterized by comprising

Obtain eligibility trace initial value, construction the initial value of column vector, the initial value of structural matrix, start node state value and The state value of terminal node；The state value of the start node includes the space coordinate of the start node, the terminal node State value include the terminal node space coordinate；

According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and the CMAC, institute is constructed State the characteristic vector space of dynamic random environment；

Assign the state value of the start node to initial intermediate quantity；

According to the initial intermediate quantity, the run action of the start node, the state value of advance node and advance node are obtained Run action；

According to the initial intermediate quantity, the initial value of the eligibility trace, described eigenvector space, it is described construction column vector just Initial value, the initial value of the structural matrix, the run action of the start node, the advance node state value and it is described before Into the run action of node, according to the recurrence least square Q nitrification enhancement based on CMAC, to the initial of the eligibility trace The initial value of value, the initial value of the construction column vector and the structural matrix is updated；

After assigning the state value of the advance node to the initial intermediate quantity, according to the initial intermediate quantity, obtain described first The run action of the run action of beginning node, the state value of advance node and advance node；The run action of the start node It is corresponded with the state value of the advance node；

When determining that the initial intermediate quantity is identical with the state value of the terminal node, the state value of the start node is assigned After giving the initial intermediate quantity, according to the initial intermediate quantity, the run action of the start node, the shape of advance node are obtained The run action of state value and advance node；

When determining that there are the state values of predetermined number initial intermediate quantity and the terminal node in all initial intermediate quantities When identical, according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time, The determining value of the weight row vector is calculated according to recurrence least square solution formula；

Determine that value is updated described eigenvector space according to the weight row vector, to obtain target feature vector sky Between；

Value and the target feature vector space are determined according to the weight row vector, are calculated according to default value of Q calculation formula Final Q value table；

It is determined described in the dynamic random environment between start node and the terminal node most according to the final Q value table Shortest path.

2. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first Beginning intermediate quantity, the run action for obtaining the run action of the start node, the state value of advance node and advance node include:

Determine that the executable execution movement of the corresponding node of the initial intermediate quantity is the first movement of the start node；

According to the initial intermediate quantity and initial Q value table, the initial section is chosen from first movement according to greedy algorithm The run action of point；

According to the run action of the initial intermediate quantity and the start node, the state value of advance node is determined；

According to the state value of the advance node and the initial Q value table, the fortune of the advance node is obtained according to greedy algorithm Action is made；

The execution movement includes any one of following: up, down, left and right.

3. the paths planning method of dynamic random environment according to claim 2, which is characterized in that described according to described first Beginning intermediate quantity and initial Q value table choose the run action packet of the start node according to greedy algorithm from first movement It includes:

According to the initial intermediate quantity and first movement, the state value of first node is determined；It is described first movement and it is described The state value of first node corresponds；

The is chosen from the initial Q value table according to the state value of the second movement and first node corresponding with second movement One Q value；Second movement is any first movement；

Maximum second movement of first Q value is determined as to the run action of the start node.

4. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first Beginning intermediate quantity, the run action for obtaining the run action of the start node, the state value of advance node and advance node include:

Determine be according to the executable execution movement of the corresponding node of the initial intermediate quantity start node the first movement；

According to the state value of the initial intermediate quantity and the terminal node, according to selecting bad principle heuristic search algorithm, from described The run action of the start node is chosen in first movement；

According to the state value of the state value of the advance node and the terminal node, obtained according to bad principle heuristic search algorithm is selected Take the run action of the advance node；

The execution movement includes any one of following: up, down, left and right.

5. the paths planning method of dynamic random environment according to claim 4, which is characterized in that described according to described first The state value of beginning intermediate quantity and the terminal node is chosen from first movement according to bad principle heuristic search algorithm is selected The run action of the start node includes:

According to the state value of the state value of the first node and the terminal node, described the is calculated according to heuristic factor formula The heuristic factor value of one node；

Corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as to the operation of the start node Movement.

6. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first Beginning intermediate quantity, the initial value of the eligibility trace, described eigenvector space, initial value, the construction for constructing column vector The operation of the initial value of matrix, the run action of the start node, the state value of the advance node and the advance node Movement arranges initial value, the construction of the eligibility trace according to the recurrence least square Q nitrification enhancement based on CMAC The initial value of the initial value of vector and the structural matrix, which is updated, includes:

According to the initial intermediate quantity and described eigenvector space, according to default eligibility trace more new formula to the eligibility trace Initial value is updated, to obtain the initial value of the eligibility trace updated；

According to it is described construction column vector initial value and the update the eligibility trace initial value, according to pre-set configuration arrange to Amount more new formula is updated the initial value of the construction column vector, to obtain the initial of the construction column vector updated Value；

According to the initial value of the eligibility trace of the update, the initial intermediate quantity, the run action of the start node, institute State the first of the state value of advance node, the run action of the advance node, described eigenvector space and the structural matrix Initial value is updated the initial value of the structural matrix according to pre-set configuration matrix update formula, to obtain described in update The initial value of structural matrix.

7. the paths planning method of dynamic random environment according to claim 1, which is characterized in that the recurrence minimum two Multiply solution formula are as follows:

θ=A^~b'；

Wherein, θ is that determining for the weight row vector is worth, A^~For the initial value of the structural matrix at current time, b' is to work as The initial value of the construction column vector at preceding moment；

The default value of Q calculation formula are as follows:

Wherein, Q^πFor the final Q value table,For target feature vector space, s is any initial intermediate quantity, and a is according to s The run action of the start node of acquisition.

8. the paths planning method of dynamic random environment according to claim 5, which is characterized in that the heuristic factor is public Formula are as follows:

W (s, a)=| | s '-Goal | |²；

Wherein, (s a) is heuristic factor to W, and s ' is the state value of the first node, and Goal is the state of the terminal node Value, s are the initial intermediate quantity, and a is corresponding first movement of s '.

9. the paths planning method of dynamic random environment according to claim 6, which is characterized in that the default eligibility trace More new formula are as follows:

Wherein, e' be the update the eligibility trace initial value, e be the eligibility trace initial value, λ be mark decay because Son, γ are discount factor, and s is the initial intermediate quantity, and a is the run action according to the s start node obtained,For s and The corresponding characteristic vector space of a；

The pre-set configuration column vector more new formula are as follows:

B'=e'r+b；

Wherein, b' is the initial value of the construction column vector of the update, and r is money reward value, and b is the first of the construction column vector Initial value；

The pre-set configuration matrix update formula are as follows:

Wherein, A^~For the initial value of the structural matrix of the update, A is the initial value of the structural matrix, and s' is according to s The state value of the advance node of acquisition, a' are the run action according to the s advance node obtained,It is corresponding for s' and a' Characteristic vector space, I be unit matrix, the order of I andThe quantity of middle feature vector is equal.

10. a kind of path planning apparatus of dynamic random environment characterized by comprising obtain module, establish module, judgement Module, node processing module, update module, loop module, weight computing module, feature calculation module, Q value table computing module and Path selection module；

The acquisition module, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix initial value, The state value of start node and the state value of terminal node；The state value of the start node includes the space of the start node Coordinate, the state value of the terminal node include the space coordinate of the terminal node；

It is described to establish module, for the initial value and the CMAC according to the weight row vector of CMAC Neural Network CMAC hidden layer Activation primitive, construct the characteristic vector space of the dynamic random environment；

The state value of the loop module, the start node for obtaining the acquisition module assigns initial intermediate quantity；

The node processing module is used for the initial intermediate quantity generated according to the loop module, obtains the start node Run action, the state value of advance node and the run action of advance node；

The update module, the initial intermediate quantity, the acquisition module for being generated according to the loop module obtain The initial value of the eligibility trace, it is described establish module building described eigenvector space, it is described obtain module obtain it is described Construct the initial value of column vector, initial value, the node processing module for obtaining the structural matrix that module obtains obtains The state value and the section for the advance node that the run action of the start node taken, the node processing module obtain The run action for the advance node that point processing module obtains, is calculated according to the recurrence least square Q intensified learning based on CMAC Method is updated the initial value, the initial value of the construction column vector and the initial value of the structural matrix of the eligibility trace；

The node processing module is also used to the advance node for obtaining the node processing module in the loop module State value assign the initial intermediate quantity after, according to the initial intermediate quantity that the loop module generates, obtain it is described just The run action of the run action of beginning node, the state value of advance node and advance node；The run action of the start node It is corresponded with the state value of the advance node；

When the judgment module determines the initial intermediate quantity that the loop module generates and the institute for obtaining module and obtaining State terminal node state value it is identical when, the node processing module is also used to obtain the acquisition module in the loop module After the state value of the start node taken assigns the initial intermediate quantity, according to the loop module generate it is described it is initial in The area of a room obtains the run action of the run action of the start node, the state value of advance node and advance node；

In all initial intermediate quantities that the judgment module determines the loop module generation, there are at the beginning of predetermined number When beginning intermediate quantity is identical as the state value of the terminal node that the acquisition module obtains, the weight computing module is used for root According to the update module update current time the structural matrix initial value and it is described construction column vector initial value, according to The determining value of the weight row vector is calculated according to recurrence least square solution formula；

The feature calculation module, the determining for the weight row vector for being calculated according to the weight computing module are worth to institute State establish module building described eigenvector space be updated, to obtain target feature vector space；

The Q value table computing module, the weight row vector for being calculated according to the weight computing module determine value and The target feature vector space that the feature calculation module obtains calculates final Q value table according to default value of Q calculation formula；

The path selection module, the final Q value table for being calculated according to the Q value table computing module determine the dynamic Optimal path between start node described in random environment and the terminal node.

11. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the node processing Module is specifically used for:

The executable execution movement of the corresponding node of the initial intermediate quantity for determining that the loop module generates is described initial First movement of node；

The execution movement includes any one of following: up, down, left and right.

12. the path planning apparatus of dynamic random environment according to claim 11, which is characterized in that the node processing Module chooses the initial section from first movement according to the initial intermediate quantity and initial Q value table, according to greedy algorithm The process of the run action of point specifically includes:

13. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the node processing Module is specifically used for:

The execution movement includes any one of following: up, down, left and right.

14. the path planning apparatus of dynamic random environment according to claim 13, which is characterized in that the node processing Module is according to the state value of the initial intermediate quantity and the terminal node, according to selecting bad principle heuristic search algorithm, from described The process for the run action for choosing the start node in first movement specifically includes:

15. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the update module It is specifically used for:

The initial intermediate quantity generated according to the loop module and the described eigenvector space for establishing module building, The initial value for the eligibility trace that the acquisition module obtains is updated according to eligibility trace more new formula is preset, to obtain more The initial value of the new eligibility trace；

According to the updated value of the initial value of the construction column vector of the acquisition module acquisition and the eligibility trace, according to default Construction column vector more new formula is updated the initial value of the construction column vector, to obtain the construction column vector updated Initial value；

According to the updated value of the eligibility trace, the initial intermediate quantity of loop module generation, the node processing module It is the state value for the advance node that the run action of the start node that obtains, the node processing module obtain, described Node processing module obtain the advance node run action, it is described establish module building described eigenvector space and The initial value for obtaining the structural matrix that module obtains, according to pre-set configuration matrix update formula to the structural matrix Initial value be updated, with obtain update the structural matrix initial value.

16. a kind of path planning apparatus of dynamic random environment, which is characterized in that including memory, processor, bus and communication Interface；For storing computer executed instructions, the processor is connect with the memory by the bus memory； When the operation of the path planning apparatus of the dynamic random environment, the processor executes the calculating of the memory storage Machine executes instruction, so that the path planning apparatus of the dynamic random environment is executed as claim 1-9 is described in any item dynamic The paths planning method of state random environment.

17. a kind of computer storage medium, which is characterized in that the computer storage medium includes computer executed instructions, when When the computer executed instructions are run on computers, so that the computer is executed as described in claim any one of 1-9 Dynamic random environment paths planning method.