CN109343532A - A kind of paths planning method and device of dynamic random environment - Google Patents
A kind of paths planning method and device of dynamic random environment Download PDFInfo
- Publication number
- CN109343532A CN109343532A CN201811329446.3A CN201811329446A CN109343532A CN 109343532 A CN109343532 A CN 109343532A CN 201811329446 A CN201811329446 A CN 201811329446A CN 109343532 A CN109343532 A CN 109343532A
- Authority
- CN
- China
- Prior art keywords
- node
- value
- initial
- state value
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 230000009471 action Effects 0.000 claims abstract description 139
- 230000008569 process Effects 0.000 claims abstract description 16
- VIEYMVWPECAOCY-UHFFFAOYSA-N 7-amino-4-(chloromethyl)chromen-2-one Chemical compound ClCC1=CC(=O)OC2=CC(N)=CC=C21 VIEYMVWPECAOCY-UHFFFAOYSA-N 0.000 claims abstract 10
- 230000033001 locomotion Effects 0.000 claims description 84
- 239000011159 matrix material Substances 0.000 claims description 68
- 238000004422 calculation algorithm Methods 0.000 claims description 65
- 238000010276 construction Methods 0.000 claims description 50
- 238000012545 processing Methods 0.000 claims description 47
- 238000004364 calculation method Methods 0.000 claims description 29
- 238000003860 storage Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 238000010845 search algorithm Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 13
- 238000004891 communication Methods 0.000 claims description 10
- 230000005055 memory storage Effects 0.000 claims description 2
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 18
- 230000008901 benefit Effects 0.000 description 16
- 230000004888 barrier function Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 230000000306 recurrent effect Effects 0.000 description 8
- 241000406668 Loxodonta cyclotis Species 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000002490 cerebral effect Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012797 qualification Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 206010048669 Terminal state Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 239000003643 water by type Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0231—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
- G05D1/0238—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
- G05D1/024—Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0223—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0276—Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Optics & Photonics (AREA)
- Electromagnetism (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the present invention provides the paths planning method and device of a kind of dynamic random environment, is related to computer information processing field, can find optimal path under dynamic random environment.This method comprises: defined feature vector space, assign the state value of start node to initial intermediate quantity, according to the initial intermediate quantity, obtain the run action of start node, the state value and run action of advance node, simultaneously according to the recurrence least square Q nitrification enhancement based on CMAC, intermediate parameters are updated;Then above-mentioned process is repeated after assigning the state value for the node that advances to initial intermediate quantity, until repeating the above-mentioned process since the state value of start node assigns initial intermediate quantity when initial intermediate quantity is identical with the state value of terminal node;Determining for weight row vector is calculated according to recurrence least square solution formula to be worth, and to obtain target feature vector space, final Q value table is obtained according to the determining value of target feature vector space and right vector, to obtain optimal path.
Description
Technical field
The present invention relates to computer information processing field more particularly to a kind of paths planning method of path random environment and
Device.
Background technique
Barrier avoidance is an indispensable ring in the optimizing of path, it may be said that the path optimizing in dynamic random environment
It is exactly to find the shortest path from initial point to target point under the premise of avoiding obstacles.Range in existing pathfinding algorithm
The paths optimizing algorithms such as first search algorithm, ant group algorithm, genetic algorithm and A* algorithm, need to know the specific of environmental model
Information, that is to say, that very high to the required precision of environmental model and route searching space.But large-scale role playing game scene
In the barriers such as other players, monster and intrinsic mountain, water, the forest that occur at random so that environmental model and route searching
Space is dynamic, is random.Therefore to a certain extent for, for the barrier avoidance problem in the optimizing of path, tradition
Path optimizing algorithm be not applicable.
Intensified learning belongs to searching algorithm, can traverse all paths in the case where state and unknown environment, according to giving
Fixed money reward function acquires the value of the objective function of each paths, the maximum path of target function value is therefrom chosen, in conjunction with mind
The avoidance and path optimizing purpose under dynamic random ring scene may be implemented through network.But neural network is approached due to the overall situation and is led to
Normal training speed is slower, and the computing resource (memory etc.) and cost (time etc.) in large-scale scene of game needed for it are not to be inconsistent
Share family experience requirements.Therefore partial approximation neural network is usually taken, and the most important potential limitation of partial approximation is exactly
Feature unit required for increase with input space dimension is increased with exponential form, and partial approximation cannot achieve
The planning in global optimum path.
Summary of the invention
The embodiment of the present invention provides the paths planning method and device of a kind of dynamic random environment, for saving calculating
On the basis of resource, to the optimum route search in dynamic random environment between two nodes.
In order to achieve the above objectives, the embodiment of the present invention adopts the following technical scheme that
In a first aspect, providing a kind of paths planning method of dynamic random environment, comprising:
Obtain the initial value of eligibility trace, the state of the construction initial value of column vector, the initial value of structural matrix, start node
The state value of value and terminal node;The state value of start node includes the space coordinate of start node, the state value of terminal node
Space coordinate including terminal node;
According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building is dynamic
The characteristic vector space of state random environment;
Assign the state value of start node to initial intermediate quantity;
According to initial intermediate quantity, the fortune of the run action of start node, the state value of advance node and advance node is obtained
Action is made;
According to initial intermediate quantity, the initial value of eligibility trace, characteristic vector space, the initial value for constructing column vector, construction square
The initial value of battle array, the run action of start node, the run action of the state value of advance node and advance node, foundation are based on
The recurrence least square Q nitrification enhancement of CMAC to the initial value of eligibility trace, constructs the initial value and structural matrix of column vector
Initial value be updated;
After assigning the state value for the node that advances to initial intermediate quantity, according to initial intermediate quantity, the operation of start node is obtained
It acts, the run action of the state value of advance node and advance node;The run action of start node and the state of advance node
Value corresponds;
When the initial intermediate quantity of determination is identical with the state value of terminal node, the state value of start node is assigned in initial
Behind the area of a room, according to initial intermediate quantity, the operation of the run action of start node, the state value of advance node and advance node is obtained
Movement;
When determining that there are the initial intermediate quantity of predetermined number is identical as the state value of terminal node in all initial intermediate quantities
When, it is minimum according to recurrence according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time
Two multiply the determining value that solution formula calculates weight row vector;
Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector space;
Value, target feature vector space are determined according to weight row vector, calculate final Q value table according to default value of Q calculation formula;According to
Final Q value table determines the optimal path in dynamic random environment between start node and terminal node.
Above-described embodiment provides technical solution, first by the weight row vector initial value of CMAC and activation primitive to entire
The space of dynamic random environment is defined, and obtains characteristic vector space, assigns the state value of start node to a centre
Value is initial intermediate quantity, according to the initial intermediate quantity, obtains the run action of start node, the next node of start node is advanced
The state value of node and the run action of advance node, while recurrence least square Q nitrification enhancement of the foundation based on CMAC,
Eligibility trace relevant to the final determining value of weight row vector, structural matrix and construction column vector are updated;Then will before
Into node state value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after stream
Journey, until repeating to assign from the state value of start node initial intermediate when initial intermediate quantity is identical with the state value of terminal node
Measuring the process started, initially intermediate quantity is identical as the state value of terminal node until there is predetermined number;Then most according to recurrence
Small two multiply solution formula calculate the weight row vector determine value, with to characteristic vector space be updated acquisition target signature to
Quantity space can get according to the determining value in target feature vector space and right vector and obtain by multiple intensified learning
Final Q value table, according to the final Q value table can obtain start node to terminal node optimal path.Because of the invention
The technical solution that embodiment provides, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, and form three
The algorithm to recirculate, both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount, but also with CMAC
The fast advantage of velocity of approch is also equipped with the advantage of the optimum search of multistep Q nitrification enhancement, so that the algorithm is in elephant
Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of multiplayer online games while saving computing resource
And the optimal path obtained according to final Q value table.
Second aspect provides a kind of path planning apparatus of dynamic random environment, comprising: obtains module, establishes module, sentences
Disconnected module, node processing module, update module, loop module, weight computing module, feature calculation module, Q value table computing module
And path selection module;
Obtain module, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix initial value,
The state value of start node and the state value of terminal node;The state value of start node includes the space coordinate of start node, eventually
Only the state value of node includes the space coordinate of terminal node;
Module is established, for according to the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and swashing for CMAC
Function living, constructs the characteristic vector space of dynamic random environment;
Loop module assigns initial intermediate quantity for will acquire the state value of start node of module acquisition;
Node processing module is used for the initial intermediate quantity that generates according to loop module, obtain start node run action,
The state value of advance node and the run action of advance node;
Update module, initial intermediate quantity for being generated according to loop module obtain the initial of the eligibility trace that module obtains
Be worth, establish module building characteristic vector space, obtain module obtain construction column vector initial value, obtain module acquisition
The advance that the run action for the start node that initial value, the node processing module of structural matrix obtain, node processing module obtain
The run action for the advance node that the state value and node processing module of node obtain, according to the recurrence least square based on CMAC
Q nitrification enhancement is updated the initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector;
The state value imparting for the advance node that node processing module is also used to obtain node processing module in loop module
After initial intermediate quantity, according to the initial intermediate quantity that loop module generates, the run action of start node, the shape of advance node are obtained
The run action of state value and advance node;The run action of start node and the state value of advance node correspond;
When judgment module determines the initial intermediate quantity that loop module generates and obtains the state for the terminal node that module obtains
When being worth identical, node processing module is also used to assign in the state value that loop module will acquire the start node of module acquisition initial
After intermediate quantity, according to the initial intermediate quantity that loop module generates, the run action of start node, the state value of advance node are obtained
With the run action of advance node;
In all initial intermediate quantities that judgment module determines loop module generation, there are the initial intermediate quantities of predetermined number
When identical as the state value of terminal node for obtaining module acquisition, weight computing module is current for being updated according to update module
The initial value of the initial value of the structural matrix at moment and construction column vector, according to recurrence least square solution formula calculate weight row to
Amount determines value;
Feature calculation module, weight row vector for being calculated according to weight computing module determine value to establishing module structure
The characteristic vector space built is updated, to obtain target feature vector space;
Q value table computing module, the determining value and feature calculation of the weight row vector for being calculated according to weight computing module
The target feature vector space that module obtains calculates final Q value table according to default value of Q calculation formula;
Path selection module, the final Q value table for being calculated according to Q value table computing module determine in dynamic random environment just
Optimal path between beginning node and terminal node.
The third aspect provides a kind of path planning apparatus of dynamic random environment, comprising: memory, processor, bus and
Communication interface;For storing computer executed instructions, processor is connect with memory by bus memory;When dynamic random ring
When the path planning apparatus operation in border, processor executes the computer executed instructions of memory storage, so that dynamic random environment
Path planning apparatus execute as first aspect provide dynamic random environment paths planning method.
Fourth aspect provides a kind of computer storage medium, including computer executed instructions, when computer executed instructions exist
When being run on computer, so that computer executes the paths planning method of the dynamic random environment provided such as first aspect.
The paths planning method and device of dynamic random environment provided in an embodiment of the present invention, this method comprises: obtaining money
The initial value of lattice mark constructs the initial value of column vector, the initial value of structural matrix, the state value of start node and terminal node
State value;The state value of start node includes the space coordinate of start node, and the state value of terminal node includes terminal node
Space coordinate;According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building
The characteristic vector space of dynamic random environment;Assign the state value of start node to initial intermediate quantity;According to initial intermediate quantity, obtain
Take the run action of the run action of start node, the state value of advance node and advance node;According to initial intermediate quantity, qualification
The initial value of mark, characteristic vector space, the construction initial value of column vector, the initial value of structural matrix, the operation of start node are dynamic
Make, the run action of the state value of advance node and advance node, is calculated according to the recurrence least square Q intensified learning based on CMAC
Method is updated the initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector;To advance node
State value assign initial intermediate quantity after, according to initial intermediate quantity, obtain the run action of start node, the state of advance node
The run action of value and advance node;The run action of start node and the state value of advance node correspond;It is first when determining
When beginning intermediate quantity is identical with the state value of terminal node, after assigning the state value of start node to initial intermediate quantity, according to initial
Intermediate quantity obtains the run action of the run action of start node, the state value of advance node and advance node;It is all when determining
When identical as the state value of terminal node there are the initial intermediate quantity of predetermined number in initial intermediate quantity, according to the structure at current time
The initial value of the initial value of matrix and the construction column vector at current time is made, calculates weight row according to recurrence least square solution formula
Vector determines value;Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector
Space;Value, target feature vector space are determined according to weight row vector, calculate final Q value according to default value of Q calculation formula
Table;The optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.So of the invention
Embodiment provides technical solution, can be first by the weight row vector initial value of CMAC and activation primitive to entire dynamic random
The space of environment is defined, and obtains characteristic vector space, assigns the state value of start node to a median, that is, initial
Intermediate quantity obtains the shape of the run action of start node, the next node advance node of start node according to the initial intermediate quantity
The run action of state value and advance node, while according to the recurrence least square Q nitrification enhancement based on CMAC, to weight
The relevant eligibility trace of value, structural matrix and construction column vector are updated for row vector final determining;Then by the node that advances
State value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after process, until just
When beginning intermediate quantity is identical with the state value of terminal node, the stream since the state value of start node assigns initial intermediate quantity is repeated
It is identical as the state value of terminal node that the initial intermediate quantity of predetermined number occurs in Cheng Zhizhi;Then public according to recurrence least square solution
Formula calculates the determining value of the weight row vector, obtains target feature vector space, root to be updated to characteristic vector space
The final Q value obtained by multiple intensified learning can be got according to the determining value of target feature vector space and right vector
Table, according to the final Q value table can obtain start node to terminal node optimal path.Because the embodiment of the present invention provides
Technical solution, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, three calculations recirculated are formed
Method, it is fast but also with CMAC velocity of approch both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount
Advantage, the advantage of the optimum search of multistep Q nitrification enhancement is also equipped with, so that the algorithm is swum online in the more people of elephant
Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of play while saving computing resource and according to most
The optimal path that whole Q value table obtains.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of the paths planning method of dynamic random environment provided in an embodiment of the present invention;
Fig. 2 is the idiographic flow schematic diagram of 104 steps in Fig. 1;
Fig. 3 is the idiographic flow schematic diagram of 10412 steps in Fig. 2;
Fig. 4 is the idiographic flow schematic diagram of 10422 steps in Fig. 2;
Fig. 5 is the idiographic flow schematic diagram of 105 steps in Fig. 1;
Fig. 6 is a kind of instantiation figure of the paths planning method of dynamic random environment provided in an embodiment of the present invention;
Fig. 7 is that two kinds of path optimizing algorithms provided in an embodiment of the present invention are imitative in 40 × 40 game gridding environment
True comparison;
Fig. 8 is that two kinds of path optimizing algorithms provided in an embodiment of the present invention are imitative in 50 × 50 game gridding environment
True comparison;
Fig. 9 is the corresponding average learning curve comparison diagram of Fig. 8;
Figure 10 is the corresponding calculating time comparison diagram of Fig. 8;
Figure 11 provides a kind of structural schematic diagram of the path planning apparatus of dynamic random environment for the embodiment of the present invention;
Figure 12 provides the structural schematic diagram of the path planning apparatus of another dynamic random environment for the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
It should be noted that in the embodiment of the present invention, " illustrative " or " such as " etc. words make example, example for indicating
Card or explanation.Be described as in the embodiment of the present invention " illustrative " or " such as " any embodiment or design scheme do not answer
It is interpreted than other embodiments or design scheme more preferably or more advantage.Specifically, " illustrative " or " example are used
Such as " word is intended to that related notion is presented in specific ways.
It should also be noted that, in the embodiment of the present invention, " (English: of) ", " corresponding (English:
Corresponding, relevant) " it sometimes can be mixed with " corresponding (English: corresponding) ", it should be pointed out that
It is that, when not emphasizing its difference, meaning to be expressed is consistent.
For the ease of clearly describing the technical solution of the embodiment of the present invention, in an embodiment of the present invention, use " the
One ", the printed words such as " second " distinguish function and the essentially identical identical entry of effect or similar item, and those skilled in the art can
To understand that the printed words such as " first ", " second " are not to be defined to quantity and execution order.
In computer game, especially in the online role playing game of the more people of elephant or more people sports class game, pathfinding process
One of vital task when always, the path optimizing algorithm in game algorithm itself will directly affect the game experiencing of player.And with
The development of technology, scene of game will become increasingly complex, resource needed for traditional path optimizing algorithm (memory, time) also refers to
Number property increases, and if using traditional path optimizing algorithm always, it will greatly occupy needed for remaining functional task of game
Computing resource, seriously affects the usage experience of user, so needing one kind can be on the basis of saving computing resource rapidly
The algorithm for finding global optimum path is replaced traditional path optimizing algorithm.
Inventive concept of the invention is introduced below:
In traditional path optimizing algorithm, BFS (Breadth First Search, breadth-first search) algorithm is a kind of
Blind search algorithm all nodes can all scan in map, until finding result until, consuming computing resource it is more and
Obtained path is not necessarily optimal;
Heuristic search algorithm A* algorithm is to solve the most effective direct search side of shortest path in a kind of static road network
Method, and solve the problems, such as the efficient algorithm of many search, the range estimation value and actual value in algorithm are closer, final search speed
Degree is faster, but for dynamic random environment, and be not suitable for.
Although nitrification enhancement can traverse all paths in the case where state and unknown environment, according to given reward
Golden function acquires the value of the objective function of each paths, the maximum path of target function value is therefrom chosen, in conjunction with neural network
The avoidance and path optimizing purpose under dynamic random ring scene may be implemented.But neural network is approached due to the overall situation and is usually trained
Speed is slower, and the computing resource (memory etc.) and cost (time etc.) in large-scale scene of game needed for it are not meet user
Experience requirements.Therefore partial approximation neural network is usually taken, and the most important potential limitation of partial approximation is exactly with defeated
It is increased with exponential form for entering feature unit required for the increase of Spatial Dimension, and partial approximation cannot achieve the overall situation most
The planning of shortest path.
CMAC (Cerebellar Model Articulation Controller, Cerebellar Model Articulation Controller) is a kind of
The local very strong neural network of generalization ability, therefore it has compared to other neural network advantages: the modified weight of CMAC network
Algorithm is simple, stores the information in partial structurtes, and under the premise of guaranteeing function approximation capability, pace of learning is fast, very suitable
Close on-line study;CMAC structure is simple, is easy to hardware realization and software realization.It is therefore contemplated that can be by itself and traditional intensified learning
The task of automatic pathfinding is realized among algorithm connected applications to online game.But also due to this characteristic, it the shortcomings that be exactly only
It can be realized local optimum, and ' optimal ' path under scene of game in path planning is global optimum;And recurrence minimum two
Multiplication (Recur sive Least Square, RLS) is that a kind of calculation amount is small and can guarantee that global extremely excellent stable convergence is calculated
Method, so technical staff expects obtaining this triplicity the recurrence least square Q nitrification enhancement based on CMAC come to dynamic
State random environment carries out path planning.
Based on above-mentioned thought, shown referring to Fig.1, the embodiment of the present invention provides a kind of path planning side of dynamic random environment
Method, comprising:
101, initial value, the construction initial value of column vector, the initial value of structural matrix, start node of eligibility trace are obtained
The state value of state value and terminal node.
Wherein, the state value of start node includes the space coordinate of start node, and the state value of terminal node includes terminating
The space coordinate of node.
102, according to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, structure
Build the characteristic vector space of dynamic random environment.
Specifically, feature space vector for representing dynamic random environment, is specifically used as sample space, example in the algorithm
Property, feature space vector are as follows:
Wherein, s is start node obtained in algorithm operational process to any node between terminal node in path
Described initial intermediate quantity in state value, that is, subsequent step, a are the run action of s, ω1To ωNFor in weight row vector first to
N-th element, f are the activation primitive of CMAC.
103, the state value of start node is assigned to initial intermediate quantity.
It, in practice can be with specifically, propose the definition of initial intermediate quantity here merely to apparent in statement
There is no the initial intermediate quantities, as long as the circulation in completing technology scheme.
104, according to initial intermediate quantity, the run action of start node, the state value of advance node and advance node are obtained
Run action.
Wherein, the state value of the run action of start node and advance node corresponds.
When start node and terminal node are same node, 111 are executed after 104 steps.
Illustratively, referring to shown in Fig. 2,104 steps are specifically included:
10411, determine that the executable execution movement of the corresponding node of initial intermediate quantity is the first movement of start node.
Specifically, executable execution movement here refers to that the corresponding node of initial intermediate quantity executes this and executable holds
After action is made, the node of arrival is not present barrier, the step for purpose tail avoidance;In actual algorithm, there are the sections of barrier
The state value of point is set as 1, and there is no the state values of the node of barrier to be set as 0.
Illustratively, execution movement includes any one of following: up, down, left and right.
10412, according to initial intermediate quantity and initial Q value table, start node is chosen from the first movement according to greedy algorithm
Run action.
Illustratively, referring to shown in Fig. 3,10412 are specifically included:
104121, it is acted according to initial intermediate quantity and first, determines the state value of first node;First movement and first segment
The state value of point corresponds.
104122, it is selected from initial Q value table according to the state value of the second movement and first node corresponding with the second movement
Take the first Q value;Second movement is any first movement.
104123, maximum second movement of the first Q value is determined as to the run action of start node.
10413, according to the run action of initial intermediate quantity and start node, the state value of advance node is determined.
10414, according to the state value of advance node and initial Q value table, the operation of advance node is obtained according to greedy algorithm
Movement.
In the run action acquisition process of advance node, do not need to judge that the corresponding node of its forward motion whether there is
Barrier.
10421, determine that the execution movement that can be performed according to the corresponding node of initial intermediate quantity is start node first is dynamic
Make.
Illustratively, execution movement includes any one of following: up, down, left and right.
10422, according to the state value of initial intermediate quantity and terminal node, according to selecting bad principle heuristic search algorithm, from the
The run action of start node is chosen in one movement.
Specifically, heuristic search algorithm, each position searched for is commented in the search exactly in sample space
Estimate, obtains best position, then scan for until target from this position.Here, heuristic factor, which follows, selects bad principle;It selects bad
Gesture original is exactly to select a worst track to pass through the environment that can learn according to certain indexs (such as the priori knowledge in the field)
Model (map model), to obtain worst feedback i.e. money reward value.In practice, do so makes algorithm than any instead
Optimal solution i.e. optimal path is quickly found using the method for other priori knowledges.
Illustratively, referring to shown in Fig. 4,10422 steps are specifically included:
104221, it is acted according to initial intermediate quantity and first, determines the state value of first node;First movement and first segment
The state value of point corresponds.
104222, according to the state value of the state value of first node and terminal node, the is calculated according to heuristic factor formula
The heuristic factor value of one node.
Illustratively, heuristic factor formula are as follows:
W (s, a)=| | s '-Goal | |2;
Wherein, (s a) is heuristic factor to W, and s ' is the state value of first node, and Goal is the state value of terminal node, and s is
Initial intermediate quantity, a are corresponding first movement of s '.
104223, corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as start node
Run action.
10423, according to the run action of initial intermediate quantity and start node, the state value of advance node is determined.
10424, according to the state value of the state value of advance node and terminal node, foundation selects bad principle heuristic search algorithm
Obtain the run action of advance node.
105, according to initial intermediate quantity, the initial value of eligibility trace, characteristic vector space, initial value, the structure for constructing column vector
The run action of the initial value of matrix, the run action of start node, the state value of advance node and advance node is made, according to base
In the recurrence least square Q nitrification enhancement of CMAC, to the initial value of eligibility trace, the initial value of construction column vector and construction square
The initial value of battle array is updated.
Specifically, being stored in the initial value of the eligibility trace of pre-set space in the circulation of entire algorithm, constructing column vector
The initial value of initial value and structural matrix can update again and again as the cycle progresses.
Illustratively, referring to Figure 5,105 steps specifically include:
1051, according to initial intermediate quantity and characteristic vector space, according to default eligibility trace more new formula to the first of eligibility trace
Initial value is updated, to obtain the initial value of the eligibility trace updated.
Illustratively, eligibility trace more new formula is preset are as follows:
Wherein, e' is the initial value of the eligibility trace updated, and e is the initial value of eligibility trace, and λ is mark decay factor, and γ is folding
The factor is detained, s is initial intermediate quantity, and a is the run action according to the s start node obtained,For the corresponding feature of s and a
Vector space.
1052, according to the initial value of the initial value of construction column vector and the eligibility trace of update, according to pre-set configuration column vector
More new formula is updated the initial value of construction column vector, to obtain the initial value of the construction column vector updated.
Illustratively, pre-set configuration column vector more new formula are as follows:
B'=e'r+b;
Wherein, b' is the initial value of the construction column vector updated, and r is money reward value, and b is the initial value for constructing column vector.
1053, according to the initial value of the eligibility trace of update, initial intermediate quantity, the run action of start node, advance node
State value, the run action of advance node, characteristic vector space and structural matrix initial value, more according to pre-set configuration matrix
New formula is updated the initial value of structural matrix, to obtain the initial value of the structural matrix updated.
Illustratively, pre-set configuration matrix update formula are as follows:
Wherein, A~for update structural matrix initial value, A be structural matrix initial value, s' is to be obtained according to s
The state value of advance node, a' are the run action according to the s advance node obtained,For the corresponding feature of s' and a' to
Quantity space, I be unit matrix, the order of I andThe quantity of middle feature vector is equal.
106, the state value for the node that advances is assigned to initial intermediate quantity.
After 106 steps, 104 are executed.
107, when the initial intermediate quantity of determination is identical with the state value of terminal node, the state value of start node is assigned just
Beginning intermediate quantity.
104 are executed after 107 steps.
Specifically, the circulation of 104 steps to 107 steps is the innermost loop of algorithm provided in an embodiment of the present invention, often
One cycle is all to find a start node to the path of terminal node.
108, when determining that there are the state values of predetermined number initial intermediate quantity and terminal node in all initial intermediate quantities
When identical, according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time, according to recurrence
Least square solution formula calculates determining for weight row vector and is worth.
Specifically, the circulation of 104 steps to 107 steps is the path in order to look for different start nodes to terminal node,
But there is preset upper limit in this searching process, i.e., stop after finding predetermined number path, so in practice
If being in order to avoid will cause the waste of computing resource having found optimal path and go back continuous service algorithm;Predetermined number
It is identical as the state value of terminal node that path will also have the initial intermediate quantity of predetermined number.
Illustratively, recurrence least square solution formula are as follows:
θ=A~b';
Wherein, θ be weight row vector determine value, A~for current time structural matrix initial value, b' be it is current when
The initial value of the construction column vector at quarter.
109, determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector sky
Between.
Specifically, 109 steps can be according to the newest of acquisition referring to the expression formula of the characteristic vector space in 102 steps
Weight row vector is to the ω in feature space vector1To ωNIt is replaced, to obtain target feature vector space.
110, value, target feature vector space are determined according to weight row vector, is calculated most according to default value of Q calculation formula
Whole Q value table.
Illustratively, default value of Q calculation formula are as follows:
Wherein, QπFor final Q value table,For target feature vector space, s is any initial intermediate quantity, and a is according to s
The run action of the start node of acquisition.
111, according to the initial value and characteristic vector space of weight row vector, final Q is calculated according to default value of Q calculation formula
It is worth table.
112, the optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.
Illustratively, referring to shown in Fig. 6, by taking simple 5 × 8 grid experiment scene as an example, each grid represents one
A node, S hardship face represent start node, and G smiling face represents terminal node, and each state point has 4 movements it can be selected that
Upper (↑), under (↓), left (←), right (→), thing representing fault present in grid, the embodiment of the present invention may finally obtain figure
Shown in path.
Scene of game is thought into grid as shown in FIG. 6, wall is set by barrier similar to labyrinth, by intelligent body
Grid locating for Agent, that is, arrow or position regard status point at this time as, and the barrier in each grid is according to trip
It plays occurring at random with situation at that time and is set as black rectangle.Before not reaching home, it is transferred to from a state next
The consuming cost of a state is set as r=-1, and instantaneous money reward is considered as in intensified learning, therefore, finds optimal path
The problem transfer strategy minimum for cost spent from Initial condition to terminal state, and technical side provided in an embodiment of the present invention
Case may finally obtain the final Q value table obtained by repetition test, in table there is since start node to terminal node
The available money reward value of each node in each paths terminated, therefore can also obtain as shown in institute in Fig. 6 most
Shortest path.
In order to become apparent from the advantage for showing technical solution provided in an embodiment of the present invention, carried out below with two specific examples
Illustrate:
Simulation comparison in the game gridding environment of example one, 40 × 40: referring to shown in Fig. 7, in the embodiment of the present invention
Introduce the recurrence least square Q nitrification enhancement CMAC-wRLSQ (λ) based on CMAC for selecting the bad factor and in practice based on
Multistep least square Q nitrification enhancement RBF-LSQ (λ) ((the Radial Basis Function-Least of radial basis function
SquaresQ) optimal path searched in one 40 × 40 grid environment is compared, the intensified learning that two kinds of algorithms use
Rate α=0.1, Greedy strategy parameter ε=0.1 and eligibility trace parameter lambda=0.5 of greedy algorithm and regularization factors g=
10-4.The learning curve of two kinds of algorithms is intellectual Agent required step number in each path optimizing event, that is, is spent
Cost, program fetch runs 30 average value, and every operation be once just randomly generated one 40 × 40 initial point be S (1,
It 4) is the labyrinth of G (35,34) with target point, the probability that each grid in the labyrinth of generation generates barrier uses identical mark
Quasi normal distribution, and have following formula expression:
I.e.Tiletype=1 indicates herein to be barrier, tiletype=0
Indicate that the free time can be searched herein;
As shown in Figure 7, two kinds of algorithms take 50 times (value of the predetermined number provided in above-described embodiment) of study respectively
Path optimizing result is obtained based on the Q value table obtained afterwards.Matrix is exactly the scene of game of gridding in figure, and black square indicates
The barriers such as other players, scenery with hills and waters, animal, two kinds of the track algorithm search that when the 50th study that light color broken line represents generates
The optimal path arrived.It can be seen from the figure that two kinds of algorithms all do not search real optimal path, but this present invention is implemented
Algorithm expression effect provided by the technical solution that example provides is more preferable, shows as directly advancing to target point from initial point, almost
Do not take step, closer to real optimal path.
Simulation comparison in the game gridding environment of example two, 50 × 50: referring to shown in Fig. 8, in order to increase experimental situation
Complexity and randomness are generated in barrier and are also become larger on probability not only by increasing in scale, other all parameters and example
One is consistent, excellent under this extensive dynamic random environment of Massively Multiplayer Online role playing game to verify this patent algorithm
Gesture, wherein initial point is S (Isosorbide-5-Nitrae) and target point is the labyrinth of (45,44) G, and each grid in the labyrinth of generation generates obstacle
The probability of object is expressed by following formula:
Referring to shown in Fig. 8, RBF-LSQ (λ) (left side) and this patent are proposed in 50 × 50 game gridding environment
CMAC-wRLSQ (λ) (right side) two kinds of algorithms respectively taking study number be 50 when a path optimizing result.It can from figure
To find out, as the scale of the game environment complexity that becomes larger is got higher, traditional RBF-LSQ (λ) algorithm pathfinding effect is declined,
And the mentioned algorithm of this patent not only still has excellent effect, and can be seen that, as environment difficulty increases, advantage is also more next
It is more obvious.
Specifically, curve graph as shown in Figure 9 be exactly in example two two kinds of algorithms run 30 times in 50 × 50 environment
Average learning curve comparison diagram, abscissa is study number, and ordinate is step number required for reaching target point, and in game
In can be quickly found in the short time optimal path be the key that save game running cost;As can be seen from the figure RBF-LSQ
(λ) algorithm learning curve convergence rate is very slow and in fact the algorithm does not search optimal policy, initial step number always
It is 2200, is constantly reduced in 100 study by study, be reduced to about 500 when study number is 100.This patent
Mentioned CMAC-wR LSQ (λ) algorithm is greatly improved on learning rate compared to traditional Q algorithm, and required step number is by first
1200 to begin are quickly reduced to about 200 steps through overfitting, and gradually restrain when learning number and being 20, to a certain extent may be used
To say being to have a qualitative leap.It is opened in figure from Fig. 9 and Figure 10 two as can be seen that still being searched for either in terms of learning rate
In terms of the optimal path arrived, all it is greatly improved compared to legacy paths optimizing algorithm.
To sum up, the paths planning method of dynamic random environment provided in an embodiment of the present invention, this method comprises: obtaining qualification
Initial value, the construction initial value of column vector, the initial value of structural matrix, the state value of start node and the shape of terminal node of mark
State value;The state value of start node includes the space coordinate of start node, and the state value of terminal node includes the sky of terminal node
Between coordinate;According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC, building is dynamic
The characteristic vector space of state random environment;Assign the state value of start node to initial intermediate quantity;According to initial intermediate quantity, obtain
The run action of the run action of start node, the state value of advance node and advance node;According to initial intermediate quantity, eligibility trace
Initial value, characteristic vector space, construction the initial value of column vector, the initial value of structural matrix, start node run action,
The state value of advance node and the run action of advance node, recurrence least square Q nitrification enhancement of the foundation based on CMAC,
The initial value of eligibility trace, the initial value of the initial value and structural matrix that construct column vector are updated;By the node that advances
After state value assigns initial intermediate quantity, according to initial intermediate quantity, the run action of start node, the state value of advance node are obtained
With the run action of advance node;The run action of start node and the state value of advance node correspond;It is initial when determining
When intermediate quantity is identical with the state value of terminal node, after assigning the state value of start node to initial intermediate quantity, in initial
The area of a room obtains the run action of the run action of start node, the state value of advance node and advance node;It is all first when determining
When identical as the state value of terminal node there are the initial intermediate quantity of predetermined number in beginning intermediate quantity, according to the construction at current time
The initial value of the construction column vector of the initial value and current time of matrix, according to recurrence least square solution formula calculate weight row to
Amount determines value;Determine that value is updated characteristic vector space according to weight row vector, to obtain target feature vector sky
Between;Value, target feature vector space are determined according to weight row vector, calculate final Q value table according to default value of Q calculation formula;
The optimal path in dynamic random environment between start node and terminal node is determined according to final Q value table.So the present invention is real
It applies example and technical solution is provided, it can be first by the weight row vector initial value of CMAC and activation primitive to entire dynamic random ring
The space in border is defined, and obtains characteristic vector space, and assigning the state value of start node to a median is in initial
The area of a room obtains the state of the run action of start node, the next node advance node of start node according to the initial intermediate quantity
The run action of value and advance node, while according to the recurrence least square Q nitrification enhancement based on CMAC, to weight row
The relevant eligibility trace of value, structural matrix and construction column vector are updated for vector final determining;Then by the shape for the node that advances
State value assign repeat after initial intermediate quantity it is above-mentioned from the state value of start node assign initial intermediate quantity after process, until initial
When intermediate quantity is identical with the state value of terminal node, the process since the state value of start node assigns initial intermediate quantity is repeated
Until it is identical as the state value of terminal node the initial intermediate quantity of predetermined number occur;Then according to recurrence least square solution formula
Calculate the weight row vector determines value, obtains target feature vector space to be updated to characteristic vector space, according to
The determining value of target feature vector space and right vector can get the final Q value obtained by multiple intensified learning
Table, according to the final Q value table can obtain start node to terminal node optimal path.Because the embodiment of the present invention provides
Technical solution, recurrent least square method and multistep Q nitrification enhancement and CMAC are combined, three calculations recirculated are formed
Method, it is fast but also with CMAC velocity of approch both with the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount
Advantage, the advantage of the optimum search of multistep Q nitrification enhancement is also equipped with, so that the algorithm is swum online in the more people of elephant
Final Q value table can be rapidly obtained in the dynamic randoms environment such as topic figure of play while saving computing resource and according to most
The optimal path that whole Q value table obtains.
Referring to Fig.1 shown in 1, the embodiment of the present invention also provides a kind of path planning apparatus 01 of dynamic random environment, comprising:
It obtains module 21, establish module 22, judgment module 23, node processing module 24, update module 25, loop module 26, weight meter
Calculate module 27, feature calculation module 28, Q value table computing module 29 and path selection module 30;
Obtain module 21, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix it is initial
The state value of value, the state value of start node and terminal node;The state value of start node includes the space coordinate of start node,
The state value of terminal node includes the space coordinate of terminal node;
Module 22 is established, for according to the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and CMAC
Activation primitive constructs the characteristic vector space of dynamic random environment;
Loop module 26 assigns initial intermediate quantity for will acquire the state value of start node of the acquisition of module 21;
Node processing module 24 is used for the initial intermediate quantity generated according to loop module 26, and the operation for obtaining start node is dynamic
Make, the run action of the state value of advance node and advance node;
Update module 25, the eligibility trace that initial intermediate quantity, acquisition module 21 for being generated according to loop module 26 obtain
Initial value, establish module 22 building characteristic vector space, obtain module 21 obtain construction column vector initial value, acquisition
Run action, the node processing for the start node that initial value, the node processing module 24 for the structural matrix that module 21 obtains obtain
The run action for the advance node that the state value and node processing module 24 for the advance node that module 24 obtains obtain, foundation are based on
The recurrence least square Q nitrification enhancement of CMAC to the initial value of eligibility trace, constructs the initial value and structural matrix of column vector
Initial value be updated;
The state for the advance node that node processing module 24 is also used to obtain node processing module 24 in loop module 26
After value assigns initial intermediate quantity, according to the initial intermediate quantity that loop module 26 generates, obtains the run action of start node, advances
The state value of node and the run action of advance node;The run action of start node and the state value one of advance node are a pair of
It answers;
When judgment module 23 determines the initial intermediate quantity that loop module 26 generates and obtains the terminal node of the acquisition of module 21
State value it is identical when, node processing module 24 be also used to loop module 26 will acquire module 21 acquisition start node shape
After state value assigns initial intermediate quantity, according to the initial intermediate quantity that loop module 26 generates, the run action, preceding of start node is obtained
Into the state value of node and the run action of advance node;
When judgment module 23 determines in all initial intermediate quantities that loop module 26 generates, there are predetermined number it is initial in
When the area of a room is identical as the state value of terminal node for obtaining the acquisition of module 21, weight computing module 27 is used for according to update module 25
Initial and construction column vector the initial value of the structural matrix of update, calculates weight row vector according to recurrence least square solution formula
Determine value;
Feature calculation module 28, weight row vector for being calculated according to weight computing module 27 determine value to establishing mould
The characteristic vector space that block 22 constructs is updated, to obtain target feature vector space;
Q value table computing module 29, the determining value and feature of the weight row vector for being calculated according to weight computing module 27
The target feature vector space that computing module 28 obtains calculates final Q value table according to default value of Q calculation formula;
Path selection module 30, the final Q value table for being calculated according to Q value table computing module 29 determine dynamic random environment
Optimal path between middle start node and terminal node.
Optionally, node processing module 24 is specifically used for:
Determine that the executable execution movement of the corresponding node of initial intermediate quantity that loop module 26 generates is start node
First movement;
According to initial intermediate quantity and initial Q value table, the operation of start node is chosen from the first movement according to greedy algorithm
Movement;
According to the run action of initial intermediate quantity and start node, the state value of advance node is determined;
According to the state value of advance node and initial Q value table, the run action of advance node is obtained according to greedy algorithm;
Execution movement includes any one of following: up, down, left and right.
Optionally, node processing module 24 is acted according to greedy algorithm from first according to initial intermediate quantity and initial Q value table
The process of the middle run action for choosing start node specifically includes:
It is acted according to initial intermediate quantity and first, determines the state value of first node;The shape of first movement and first node
State value corresponds;
The first Q is chosen from initial Q value table according to the state value of the second movement and first node corresponding with the second movement
Value;Second movement is any first movement;
Maximum second movement of first Q value is determined as to the run action of start node.
Optionally, node processing module 24 is specifically used for:
Determine that the executable execution movement of the corresponding node of initial intermediate quantity that loop module 26 generates is start node
First movement;
According to the state value of initial intermediate quantity and terminal node, according to bad principle heuristic search algorithm is selected, from the first movement
The middle run action for choosing start node;
According to the run action of initial intermediate quantity and start node, the state value of advance node is determined;
According to the state value of the state value of advance node and terminal node, before selecting bad principle heuristic search algorithm acquisition
Into the run action of node;
Execution movement includes any one of following: up, down, left and right.
Optionally, node processing module 24 is opened according to the state value of initial intermediate quantity and terminal node according to bad principle is selected
Searching algorithm is sent out, the process for the run action for choosing start node from the first movement specifically includes:
It is acted according to initial intermediate quantity and first, determines the state value of first node;The shape of first movement and first node
State value corresponds;
According to the state value of the state value of first node and terminal node, first node is calculated according to heuristic factor formula
Heuristic factor value;
Corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as to the operation of start node
Movement.
Optionally, update module 25 is specifically used for:
The initial intermediate quantity generated according to loop module 26 and the characteristic vector space for establishing the building of module 22, according to default
Eligibility trace more new formula is updated the initial value for obtaining the eligibility trace that module 21 obtains, to obtain the first of the eligibility trace updated
Initial value;
According to the updated value for the initial value and eligibility trace for obtaining the construction column vector that module 21 obtains, arranged according to pre-set configuration
Vector more new formula is updated the initial value of construction column vector, to obtain the initial value of the construction column vector updated;
It is obtained according to the updated value of eligibility trace, the initial intermediate quantity of the generation of loop module 26, node processing module 24 first
Before state value, the node processing module 24 for the advance node that the run action of beginning node, node processing module 24 obtain obtain
Into the initial of the run action of node, the characteristic vector space for establishing the building of module 22 and the structural matrix of the acquisition acquisition of module 21
Value is updated the initial value of structural matrix according to pre-set configuration matrix update formula, to obtain the structural matrix updated
Initial value.
The path planning apparatus of dynamic random environment provided in an embodiment of the present invention, because the device includes: acquisition module,
For obtain eligibility trace initial value, construction the initial value of column vector, the initial value of structural matrix, start node state value and
The state value of terminal node;The state value of start node includes the space coordinate of start node, and the state value of terminal node includes
The space coordinate of terminal node;Module is established, for according to the initial of the weight row vector of CMAC Neural Network CMAC hidden layer
The activation primitive of value and CMAC constructs the characteristic vector space of dynamic random environment;Loop module is obtained for will acquire module
The state value of start node assign initial intermediate quantity;Node processing module is used for the initial centre generated according to loop module
Amount, obtains the run action of the run action of start node, the state value of advance node and advance node;Update module is used for
The initial value for the eligibility trace that initial intermediate quantity, the acquisition module generated according to loop module obtains, the feature for establishing module building
Vector space, the initial value for obtaining the construction column vector that module obtains, initial value, the node for obtaining the structural matrix that module obtains
The state value and node processing for the advance node that the run action for the start node that processing module obtains, node processing module obtain
The run action for the advance node that module obtains, according to the recurrence least square Q nitrification enhancement based on CMAC, to eligibility trace
Initial value, construct the initial value of column vector and the initial value of structural matrix is updated;Node processing module is also used to following
After the state value for the advance node that ring moulds block is also used to obtain node processing module assigns initial intermediate quantity, according to loop module
The initial intermediate quantity generated, obtains the run action of the run action of start node, the state value of advance node and advance node;
The run action of start node and the state value of advance node correspond;When judgment module determines initial intermediate quantity and terminal node
When the state value of point is identical, node processing module is also used to will acquire the state value of the start node of module acquisition in loop module
After assigning initial intermediate quantity, according to the initial intermediate quantity that loop module generates, run action, the advance node of start node are obtained
State value and advance node run action;Exist when in all initial intermediate quantities that judgment module determines loop module generation
The initial intermediate quantity of predetermined number, when identical as the state value of terminal node for obtaining module acquisition, weight computing module is used for
According to the updated value of the updated value of structural matrix and construction column vector, weight row vector is calculated according to recurrence least square solution formula
Determine value;Feature calculation module, for determining that value is updated characteristic vector space according to weight row vector, to obtain
Target feature vector space;Q value table computing module, the determining value of the weight row vector for being calculated according to weight computing module,
At the state value and node of the advance node in target feature vector space, node processing module acquisition that feature calculation module obtains
The run action for managing the start node that module obtains calculates final Q value table according to default value of Q calculation formula;Path selection module,
Final Q value table for being calculated according to Q value table computing module determines in dynamic random environment between start node and terminal node
Optimal path.
So the embodiment of the present invention provides technical solution, the weight row vector initial value of CMAC and activation can be passed through first
The space of entire dynamic random environment is defined in function, obtains characteristic vector space, and the state value of start node is assigned
The i.e. initial intermediate quantity of a median is given, according to the initial intermediate quantity, is obtained under the run action of start node, start node
The state value of one node advance node and the run action of advance node, at the same it is strong according to the recurrence least square Q based on CMAC
Change learning algorithm, eligibility trace relevant to the final determining value of weight row vector, structural matrix and construction column vector are carried out more
Newly;Then the state value for the node that advances is assigned and repeats after initial intermediate quantity above-mentioned to assign in initial from the state value of start node
Process behind the area of a room, until repeating to assign from the state value of start node when initial intermediate quantity is identical with the state value of terminal node
Giving the process that initial intermediate quantity starts, initially intermediate quantity is identical as the state value of terminal node until there is predetermined number;Then
The determining value of the weight row vector is calculated, according to recurrence least square solution formula to be updated acquisition to characteristic vector space
Target feature vector space can be got according to the determining value of target feature vector space and right vector by repeatedly strong
The final Q value table that chemical acquistion is arrived, according to the final Q value table can obtain start node to terminal node optimal path.
Because of technical solution provided in an embodiment of the present invention, by recurrent least square method and multistep Q nitrification enhancement and CMAC phase
In conjunction with, three algorithms to recirculate are formed, both there is the advantage of the small and global extremely excellent stable convergence of recurrent least square method calculation amount,
But also with the fast advantage of CMAC velocity of approch, it is also equipped with the advantage of the optimum search of multistep Q nitrification enhancement, so that should
Algorithm rapid while saving computing resource can obtain in the dynamic randoms environment such as the topic figure of elephant multiplayer online games
To final Q value table and the optimal path obtained according to final Q value table.
Referring to Fig.1 shown in 2, the embodiment of the present invention also provides the path planning apparatus of another dynamic random environment, including
Memory 41, processor 42, bus 43 and communication interface 44;Memory 41 is for storing computer executed instructions, processor 42
It is connect with memory 41 by bus 43;When the operation of the path planning apparatus of dynamic random environment, processor 42 executes storage
The computer executed instructions that device 41 stores, so that the path planning apparatus of dynamic random environment is executed as provided by the above embodiment
The paths planning method of dynamic random environment.
In concrete implementation, as one embodiment, processor 42 (42-1 and 42-2) may include one or more
CPU, such as CPU0 and CPU1 shown in Figure 12.And as one embodiment, the path planning apparatus of dynamic random environment can
To include multiple processors 42, such as processor 42-1 and processor 42-2 shown in Figure 12.It is every in these processors 42
One CPU can be a single core processor (Single-CPU), be also possible to a multi-core processor (Multi-CPU).This
In processor 42 can refer to one or more equipment, circuit, and/or for handling data (such as computer program instructions)
Handle core.
Memory 41 can be read-only memory 41 (Read-Only Memory, ROM) or can store static information and refer to
The other kinds of static storage device enabled, random access memory (Random Access Memory, RAM) or can store
The other kinds of dynamic memory of information and instruction, is also possible to Electrically Erasable Programmable Read-Only Memory
(Electrically Erasable Programmable Read-Only Memory, EEPROM), CD-ROM (Compact
Disc Read-Only Memory, CD-ROM) or other optical disc storages, optical disc storage (including compression optical disc, laser disc, light
Dish, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium or other magnetic storage apparatus or can be used in carry or
Store have instruction or data structure form desired program code and can by any other medium of computer access, but
It is without being limited thereto.Memory 41, which can be, to be individually present, and is connected by communication bus 43 with processor 42.Memory 41 can also be with
It is integrated with processor 42.
In concrete implementation, memory 41, for storing the data in the application and executing the software program of the application
Corresponding computer executed instructions.Processor 42 can by running or executing the software program being stored in memory 41, with
And call the data being stored in memory 41, the various functions of the path planning apparatus of dynamic random environment.
Communication interface 44 is used for and other equipment or communication, such as control using the device of any transceiver one kind
System processed, wireless access network (Radio Access Network, RAN), WLAN (Wireless Local Area
Networks, WLAN) etc..Communication interface 44 may include that receiving unit realizes that receive capabilities and transmission unit realize transmission
Function.
It is total to can be industry standard architecture (Industry Standard Architecture, ISA) for bus 43
Line, external equipment interconnection (Peripheral Component Interconnect, PCI) bus or extension Industry Standard Architecture
Structure (Extended Industry Standard Architecture, EISA) bus etc..The bus 43 can be divided into address
Bus, data/address bus, control bus etc..Only to be indicated with a thick line in Figure 12, it is not intended that only one convenient for indicating
Bus or a type of bus.
The embodiment of the present invention also provides a kind of computer storage medium, and computer storage medium includes that computer execution refers to
It enables, when computer executed instructions are run on computers, so that computer executes such as dynamic random provided by the above embodiment
The paths planning method of environment.
The embodiment of the present invention also provides a kind of computer program, which can be loaded directly into memory, and
Containing software code, the computer program be loaded into via computer and can be realized after executing dynamic provided by the above embodiment with
The paths planning method of machine environment.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention
It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions
Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Computer-readable medium includes computer storage media and communication media, and wherein communication media includes convenient for from a place to another
Any medium of one place transmission computer program.Storage medium can be general or specialized computer can access it is any
Usable medium.
Through the above description of the embodiments, it is apparent to those skilled in the art that, for description
It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper
It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete
All or part of function described above.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it
Its mode is realized.For example, the apparatus embodiments described above are merely exemplary, for example, the module or unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components
It may be combined or can be integrated into another device, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.Unit can be or can also as illustrated by the separation member
Not to be physically separated, component shown as a unit can be a physical unit or multiple physical units, it can
It is in one place, or may be distributed over multiple and different places.Can select according to the actual needs part therein or
Person's whole unit achieves the purpose of the solution of this embodiment.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.If integrated unit with
The form of SFU software functional unit is realized and when sold or used as an independent product, can store and storage Jie can be read at one
In matter.Based on this understanding, the portion that the technical solution of the embodiment of the present application substantially in other words contributes to the prior art
Divide or all or part of the technical solution can be embodied in the form of software products, which is stored in one
In storage medium, including some instructions are used so that an equipment (can be single-chip microcontroller, chip etc.) or processor
(processor) it performs all or part of the steps of the method described in the various embodiments of the present invention.And storage medium above-mentioned includes:
The various media that can store program code such as USB flash disk, mobile hard disk, ROM, RAM, magnetic or disk.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by those familiar with the art, all answers
It is included within the scope of the present invention.Therefore, protection scope of the present invention should be subject to the protection scope in claims.
Claims (17)
1. a kind of paths planning method of dynamic random environment characterized by comprising
Obtain eligibility trace initial value, construction the initial value of column vector, the initial value of structural matrix, start node state value and
The state value of terminal node;The state value of the start node includes the space coordinate of the start node, the terminal node
State value include the terminal node space coordinate;
According to the activation primitive of the initial value of the weight row vector of CMAC Neural Network CMAC hidden layer and the CMAC, institute is constructed
State the characteristic vector space of dynamic random environment;
Assign the state value of the start node to initial intermediate quantity;
According to the initial intermediate quantity, the run action of the start node, the state value of advance node and advance node are obtained
Run action;
According to the initial intermediate quantity, the initial value of the eligibility trace, described eigenvector space, it is described construction column vector just
Initial value, the initial value of the structural matrix, the run action of the start node, the advance node state value and it is described before
Into the run action of node, according to the recurrence least square Q nitrification enhancement based on CMAC, to the initial of the eligibility trace
The initial value of value, the initial value of the construction column vector and the structural matrix is updated;
After assigning the state value of the advance node to the initial intermediate quantity, according to the initial intermediate quantity, obtain described first
The run action of the run action of beginning node, the state value of advance node and advance node;The run action of the start node
It is corresponded with the state value of the advance node;
When determining that the initial intermediate quantity is identical with the state value of the terminal node, the state value of the start node is assigned
After giving the initial intermediate quantity, according to the initial intermediate quantity, the run action of the start node, the shape of advance node are obtained
The run action of state value and advance node;
When determining that there are the state values of predetermined number initial intermediate quantity and the terminal node in all initial intermediate quantities
When identical, according to the initial value of the initial value of the structural matrix at current time and the construction column vector at current time,
The determining value of the weight row vector is calculated according to recurrence least square solution formula;
Determine that value is updated described eigenvector space according to the weight row vector, to obtain target feature vector sky
Between;
Value and the target feature vector space are determined according to the weight row vector, are calculated according to default value of Q calculation formula
Final Q value table;
It is determined described in the dynamic random environment between start node and the terminal node most according to the final Q value table
Shortest path.
2. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first
Beginning intermediate quantity, the run action for obtaining the run action of the start node, the state value of advance node and advance node include:
Determine that the executable execution movement of the corresponding node of the initial intermediate quantity is the first movement of the start node;
According to the initial intermediate quantity and initial Q value table, the initial section is chosen from first movement according to greedy algorithm
The run action of point;
According to the run action of the initial intermediate quantity and the start node, the state value of advance node is determined;
According to the state value of the advance node and the initial Q value table, the fortune of the advance node is obtained according to greedy algorithm
Action is made;
The execution movement includes any one of following: up, down, left and right.
3. the paths planning method of dynamic random environment according to claim 2, which is characterized in that described according to described first
Beginning intermediate quantity and initial Q value table choose the run action packet of the start node according to greedy algorithm from first movement
It includes:
According to the initial intermediate quantity and first movement, the state value of first node is determined;It is described first movement and it is described
The state value of first node corresponds;
The is chosen from the initial Q value table according to the state value of the second movement and first node corresponding with second movement
One Q value;Second movement is any first movement;
Maximum second movement of first Q value is determined as to the run action of the start node.
4. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first
Beginning intermediate quantity, the run action for obtaining the run action of the start node, the state value of advance node and advance node include:
Determine be according to the executable execution movement of the corresponding node of the initial intermediate quantity start node the first movement;
According to the state value of the initial intermediate quantity and the terminal node, according to selecting bad principle heuristic search algorithm, from described
The run action of the start node is chosen in first movement;
According to the run action of the initial intermediate quantity and the start node, the state value of advance node is determined;
According to the state value of the state value of the advance node and the terminal node, obtained according to bad principle heuristic search algorithm is selected
Take the run action of the advance node;
The execution movement includes any one of following: up, down, left and right.
5. the paths planning method of dynamic random environment according to claim 4, which is characterized in that described according to described first
The state value of beginning intermediate quantity and the terminal node is chosen from first movement according to bad principle heuristic search algorithm is selected
The run action of the start node includes:
According to the initial intermediate quantity and first movement, the state value of first node is determined;It is described first movement and it is described
The state value of first node corresponds;
According to the state value of the state value of the first node and the terminal node, described the is calculated according to heuristic factor formula
The heuristic factor value of one node;
Corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as to the operation of the start node
Movement.
6. the paths planning method of dynamic random environment according to claim 1, which is characterized in that described according to described first
Beginning intermediate quantity, the initial value of the eligibility trace, described eigenvector space, initial value, the construction for constructing column vector
The operation of the initial value of matrix, the run action of the start node, the state value of the advance node and the advance node
Movement arranges initial value, the construction of the eligibility trace according to the recurrence least square Q nitrification enhancement based on CMAC
The initial value of the initial value of vector and the structural matrix, which is updated, includes:
According to the initial intermediate quantity and described eigenvector space, according to default eligibility trace more new formula to the eligibility trace
Initial value is updated, to obtain the initial value of the eligibility trace updated;
According to it is described construction column vector initial value and the update the eligibility trace initial value, according to pre-set configuration arrange to
Amount more new formula is updated the initial value of the construction column vector, to obtain the initial of the construction column vector updated
Value;
According to the initial value of the eligibility trace of the update, the initial intermediate quantity, the run action of the start node, institute
State the first of the state value of advance node, the run action of the advance node, described eigenvector space and the structural matrix
Initial value is updated the initial value of the structural matrix according to pre-set configuration matrix update formula, to obtain described in update
The initial value of structural matrix.
7. the paths planning method of dynamic random environment according to claim 1, which is characterized in that the recurrence minimum two
Multiply solution formula are as follows:
θ=A~b';
Wherein, θ is that determining for the weight row vector is worth, A~For the initial value of the structural matrix at current time, b' is to work as
The initial value of the construction column vector at preceding moment;
The default value of Q calculation formula are as follows:
Wherein, QπFor the final Q value table,For target feature vector space, s is any initial intermediate quantity, and a is according to s
The run action of the start node of acquisition.
8. the paths planning method of dynamic random environment according to claim 5, which is characterized in that the heuristic factor is public
Formula are as follows:
W (s, a)=| | s '-Goal | |2;
Wherein, (s a) is heuristic factor to W, and s ' is the state value of the first node, and Goal is the state of the terminal node
Value, s are the initial intermediate quantity, and a is corresponding first movement of s '.
9. the paths planning method of dynamic random environment according to claim 6, which is characterized in that the default eligibility trace
More new formula are as follows:
Wherein, e' be the update the eligibility trace initial value, e be the eligibility trace initial value, λ be mark decay because
Son, γ are discount factor, and s is the initial intermediate quantity, and a is the run action according to the s start node obtained,For s and
The corresponding characteristic vector space of a;
The pre-set configuration column vector more new formula are as follows:
B'=e'r+b;
Wherein, b' is the initial value of the construction column vector of the update, and r is money reward value, and b is the first of the construction column vector
Initial value;
The pre-set configuration matrix update formula are as follows:
Wherein, A~For the initial value of the structural matrix of the update, A is the initial value of the structural matrix, and s' is according to s
The state value of the advance node of acquisition, a' are the run action according to the s advance node obtained,It is corresponding for s' and a'
Characteristic vector space, I be unit matrix, the order of I andThe quantity of middle feature vector is equal.
10. a kind of path planning apparatus of dynamic random environment characterized by comprising obtain module, establish module, judgement
Module, node processing module, update module, loop module, weight computing module, feature calculation module, Q value table computing module and
Path selection module;
The acquisition module, for obtain the initial value of eligibility trace, construct the initial value of column vector, structural matrix initial value,
The state value of start node and the state value of terminal node;The state value of the start node includes the space of the start node
Coordinate, the state value of the terminal node include the space coordinate of the terminal node;
It is described to establish module, for the initial value and the CMAC according to the weight row vector of CMAC Neural Network CMAC hidden layer
Activation primitive, construct the characteristic vector space of the dynamic random environment;
The state value of the loop module, the start node for obtaining the acquisition module assigns initial intermediate quantity;
The node processing module is used for the initial intermediate quantity generated according to the loop module, obtains the start node
Run action, the state value of advance node and the run action of advance node;
The update module, the initial intermediate quantity, the acquisition module for being generated according to the loop module obtain
The initial value of the eligibility trace, it is described establish module building described eigenvector space, it is described obtain module obtain it is described
Construct the initial value of column vector, initial value, the node processing module for obtaining the structural matrix that module obtains obtains
The state value and the section for the advance node that the run action of the start node taken, the node processing module obtain
The run action for the advance node that point processing module obtains, is calculated according to the recurrence least square Q intensified learning based on CMAC
Method is updated the initial value, the initial value of the construction column vector and the initial value of the structural matrix of the eligibility trace;
The node processing module is also used to the advance node for obtaining the node processing module in the loop module
State value assign the initial intermediate quantity after, according to the initial intermediate quantity that the loop module generates, obtain it is described just
The run action of the run action of beginning node, the state value of advance node and advance node;The run action of the start node
It is corresponded with the state value of the advance node;
When the judgment module determines the initial intermediate quantity that the loop module generates and the institute for obtaining module and obtaining
State terminal node state value it is identical when, the node processing module is also used to obtain the acquisition module in the loop module
After the state value of the start node taken assigns the initial intermediate quantity, according to the loop module generate it is described it is initial in
The area of a room obtains the run action of the run action of the start node, the state value of advance node and advance node;
In all initial intermediate quantities that the judgment module determines the loop module generation, there are at the beginning of predetermined number
When beginning intermediate quantity is identical as the state value of the terminal node that the acquisition module obtains, the weight computing module is used for root
According to the update module update current time the structural matrix initial value and it is described construction column vector initial value, according to
The determining value of the weight row vector is calculated according to recurrence least square solution formula;
The feature calculation module, the determining for the weight row vector for being calculated according to the weight computing module are worth to institute
State establish module building described eigenvector space be updated, to obtain target feature vector space;
The Q value table computing module, the weight row vector for being calculated according to the weight computing module determine value and
The target feature vector space that the feature calculation module obtains calculates final Q value table according to default value of Q calculation formula;
The path selection module, the final Q value table for being calculated according to the Q value table computing module determine the dynamic
Optimal path between start node described in random environment and the terminal node.
11. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the node processing
Module is specifically used for:
The executable execution movement of the corresponding node of the initial intermediate quantity for determining that the loop module generates is described initial
First movement of node;
According to the initial intermediate quantity and initial Q value table, the initial section is chosen from first movement according to greedy algorithm
The run action of point;
According to the run action of the initial intermediate quantity and the start node, the state value of advance node is determined;
According to the state value of the advance node and the initial Q value table, the fortune of the advance node is obtained according to greedy algorithm
Action is made;
The execution movement includes any one of following: up, down, left and right.
12. the path planning apparatus of dynamic random environment according to claim 11, which is characterized in that the node processing
Module chooses the initial section from first movement according to the initial intermediate quantity and initial Q value table, according to greedy algorithm
The process of the run action of point specifically includes:
According to the initial intermediate quantity and first movement, the state value of first node is determined;It is described first movement and it is described
The state value of first node corresponds;
The is chosen from the initial Q value table according to the state value of the second movement and first node corresponding with second movement
One Q value;Second movement is any first movement;
Maximum second movement of first Q value is determined as to the run action of the start node.
13. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the node processing
Module is specifically used for:
The executable execution movement of the corresponding node of the initial intermediate quantity for determining that the loop module generates is described initial
First movement of node;
According to the state value of the initial intermediate quantity and the terminal node, according to selecting bad principle heuristic search algorithm, from described
The run action of the start node is chosen in first movement;
According to the run action of the initial intermediate quantity and the start node, the state value of advance node is determined;
According to the state value of the state value of the advance node and the terminal node, obtained according to bad principle heuristic search algorithm is selected
Take the run action of the advance node;
The execution movement includes any one of following: up, down, left and right.
14. the path planning apparatus of dynamic random environment according to claim 13, which is characterized in that the node processing
Module is according to the state value of the initial intermediate quantity and the terminal node, according to selecting bad principle heuristic search algorithm, from described
The process for the run action for choosing the start node in first movement specifically includes:
According to the initial intermediate quantity and first movement, the state value of first node is determined;It is described first movement and it is described
The state value of first node corresponds;
According to the state value of the state value of the first node and the terminal node, described the is calculated according to heuristic factor formula
The heuristic factor value of one node;
Corresponding first movement of the state value of the maximum first node of heuristic factor value is determined as to the operation of the start node
Movement.
15. the path planning apparatus of dynamic random environment according to claim 10, which is characterized in that the update module
It is specifically used for:
The initial intermediate quantity generated according to the loop module and the described eigenvector space for establishing module building,
The initial value for the eligibility trace that the acquisition module obtains is updated according to eligibility trace more new formula is preset, to obtain more
The initial value of the new eligibility trace;
According to the updated value of the initial value of the construction column vector of the acquisition module acquisition and the eligibility trace, according to default
Construction column vector more new formula is updated the initial value of the construction column vector, to obtain the construction column vector updated
Initial value;
According to the updated value of the eligibility trace, the initial intermediate quantity of loop module generation, the node processing module
It is the state value for the advance node that the run action of the start node that obtains, the node processing module obtain, described
Node processing module obtain the advance node run action, it is described establish module building described eigenvector space and
The initial value for obtaining the structural matrix that module obtains, according to pre-set configuration matrix update formula to the structural matrix
Initial value be updated, with obtain update the structural matrix initial value.
16. a kind of path planning apparatus of dynamic random environment, which is characterized in that including memory, processor, bus and communication
Interface;For storing computer executed instructions, the processor is connect with the memory by the bus memory;
When the operation of the path planning apparatus of the dynamic random environment, the processor executes the calculating of the memory storage
Machine executes instruction, so that the path planning apparatus of the dynamic random environment is executed as claim 1-9 is described in any item dynamic
The paths planning method of state random environment.
17. a kind of computer storage medium, which is characterized in that the computer storage medium includes computer executed instructions, when
When the computer executed instructions are run on computers, so that the computer is executed as described in claim any one of 1-9
Dynamic random environment paths planning method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811329446.3A CN109343532A (en) | 2018-11-09 | 2018-11-09 | A kind of paths planning method and device of dynamic random environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811329446.3A CN109343532A (en) | 2018-11-09 | 2018-11-09 | A kind of paths planning method and device of dynamic random environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109343532A true CN109343532A (en) | 2019-02-15 |
Family
ID=65314304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811329446.3A Pending CN109343532A (en) | 2018-11-09 | 2018-11-09 | A kind of paths planning method and device of dynamic random environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109343532A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741117A (en) * | 2019-02-19 | 2019-05-10 | 贵州大学 | A kind of discount coupon distribution method based on intensified learning |
CN109978243A (en) * | 2019-03-12 | 2019-07-05 | 北京百度网讯科技有限公司 | Track of vehicle planing method, device, computer equipment, computer storage medium |
CN111546347A (en) * | 2020-06-03 | 2020-08-18 | 中国人民解放军海军工程大学 | Mechanical arm path planning method suitable for dynamic environment |
CN112712193A (en) * | 2020-12-02 | 2021-04-27 | 南京航空航天大学 | Multi-unmanned aerial vehicle local route planning method and device based on improved Q-Learning |
CN113867639A (en) * | 2021-09-28 | 2021-12-31 | 北京大学 | Qualification trace calculator based on phase change memory |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004013328A (en) * | 2002-06-04 | 2004-01-15 | Yamaha Motor Co Ltd | Evaluation value calculating method, evaluation value calculating device, control device of control object and evaluation value calculating program |
CN1857981A (en) * | 2006-05-24 | 2006-11-08 | 南京大学 | Group control lift dispatching method based on CMAC network |
CN102525795A (en) * | 2012-01-16 | 2012-07-04 | 沈阳理工大学 | Fast automatic positioning method of foot massaging robot |
CN104932267A (en) * | 2015-06-04 | 2015-09-23 | 曲阜师范大学 | Neural network learning control method adopting eligibility trace |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
-
2018
- 2018-11-09 CN CN201811329446.3A patent/CN109343532A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004013328A (en) * | 2002-06-04 | 2004-01-15 | Yamaha Motor Co Ltd | Evaluation value calculating method, evaluation value calculating device, control device of control object and evaluation value calculating program |
CN1857981A (en) * | 2006-05-24 | 2006-11-08 | 南京大学 | Group control lift dispatching method based on CMAC network |
CN102525795A (en) * | 2012-01-16 | 2012-07-04 | 沈阳理工大学 | Fast automatic positioning method of foot massaging robot |
CN104932267A (en) * | 2015-06-04 | 2015-09-23 | 曲阜师范大学 | Neural network learning control method adopting eligibility trace |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
Non-Patent Citations (5)
Title |
---|
CHOONG S S,ET AL.: "Automatic design of hyper-heuristic based on reinforcement learning", 《INFORMATION SCIENCES》 * |
王仲民: "移动机器人路径规划及轨迹跟踪问题研究", 《中国博士学位论文全文数据库(博士) 信息科技辑》 * |
程玉虎: "连续状态-动作空间下强化学习方法的研究", 《万方学位论文数据库》 * |
童小龙: "移动机器人在未知环境下的学习控制方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
黄兵明: "基于改进ELM的递归最小二乘强化学习算法的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109741117A (en) * | 2019-02-19 | 2019-05-10 | 贵州大学 | A kind of discount coupon distribution method based on intensified learning |
CN109978243A (en) * | 2019-03-12 | 2019-07-05 | 北京百度网讯科技有限公司 | Track of vehicle planing method, device, computer equipment, computer storage medium |
CN111546347A (en) * | 2020-06-03 | 2020-08-18 | 中国人民解放军海军工程大学 | Mechanical arm path planning method suitable for dynamic environment |
CN111546347B (en) * | 2020-06-03 | 2021-09-03 | 中国人民解放军海军工程大学 | Mechanical arm path planning method suitable for dynamic environment |
CN112712193A (en) * | 2020-12-02 | 2021-04-27 | 南京航空航天大学 | Multi-unmanned aerial vehicle local route planning method and device based on improved Q-Learning |
CN113867639A (en) * | 2021-09-28 | 2021-12-31 | 北京大学 | Qualification trace calculator based on phase change memory |
CN113867639B (en) * | 2021-09-28 | 2024-03-19 | 北京大学 | Qualification trace calculator based on phase change memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109343532A (en) | A kind of paths planning method and device of dynamic random environment | |
CN111625361B (en) | Joint learning framework based on cooperation of cloud server and IoT (Internet of things) equipment | |
CN111582469A (en) | Multi-agent cooperation information processing method and system, storage medium and intelligent terminal | |
CN112016704B (en) | AI model training method, model using method, computer device and storage medium | |
CN105469143B (en) | Network-on-chip method for mapping resource based on neural network dynamic feature | |
CN111282267B (en) | Information processing method, information processing apparatus, information processing medium, and electronic device | |
WO2020155994A1 (en) | Hybrid expert reinforcement learning method and system | |
Green | AF: A framework for real-time distributed cooperative problem solving | |
Earle | Using fractal neural networks to play simcity 1 and conway's game of life at variable scales | |
CN112990987B (en) | Information popularization method and device, electronic equipment and storage medium | |
CN115300910B (en) | Confusion-removing game strategy model generation method based on multi-agent reinforcement learning | |
CN113599798A (en) | Chinese chess game learning method and system based on deep reinforcement learning method | |
Wang et al. | Learning to traverse over graphs with a Monte Carlo tree search-based self-play framework | |
Hölldobler et al. | Lessons Learned from AlphaGo. | |
CN112274935B (en) | AI model training method, application method computer device and storage medium | |
Kim et al. | Solving pbqp-based register allocation using deep reinforcement learning | |
CN109731338A (en) | Artificial intelligence training method and device, storage medium and electronic device in game | |
CN108874377A (en) | A kind of data processing method, device and storage medium | |
CN114037049A (en) | Multi-agent reinforcement learning method based on value function reliability and related device | |
CN111443806A (en) | Interactive task control method and device, electronic equipment and storage medium | |
Cao et al. | Intrinsic motivation for deep deterministic policy gradient in multi-agent environments | |
Ring et al. | Replicating deepmind starcraft ii reinforcement learning benchmark with actor-critic methods | |
Kang et al. | Self-organizing agents for reinforcement learning in virtual worlds | |
Ciantar et al. | Implementation of a Sudoku Puzzle Solver on a FPGA | |
Hasan et al. | Implementing artificially intelligent ghosts to play Ms. Pac-Man game by using neural network at social media platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190215 |
|
RJ01 | Rejection of invention patent application after publication |