CN114770497B - Search and rescue method and device of search and rescue robot and storage medium - Google Patents
Search and rescue method and device of search and rescue robot and storage medium Download PDFInfo
- Publication number
- CN114770497B CN114770497B CN202210328204.2A CN202210328204A CN114770497B CN 114770497 B CN114770497 B CN 114770497B CN 202210328204 A CN202210328204 A CN 202210328204A CN 114770497 B CN114770497 B CN 114770497B
- Authority
- CN
- China
- Prior art keywords
- search
- rescue
- robot
- strategy
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 71
- 230000009471 action Effects 0.000 claims abstract description 70
- 238000004088 simulation Methods 0.000 claims abstract description 24
- 238000011156 evaluation Methods 0.000 claims description 39
- 230000007613 environmental effect Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000003491 array Methods 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 claims description 3
- 230000002787 reinforcement Effects 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 12
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 238000004880 explosion Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1679—Programme controls characterised by the tasks executed
Abstract
The invention provides a search and rescue method, a device and a storage medium of a search and rescue robot, wherein the method comprises the following steps: when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot; generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model; executing search and rescue actions according to an automatic search and rescue strategy; the training of the automatic search and rescue strategy model comprises the following steps: constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment; constructing an automatic search and rescue strategy model based on a VDN algorithm; initializing an automatic search and rescue strategy model and training based on a training task; the invention can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.
Description
Technical Field
The invention relates to a search and rescue method and device of a search and rescue robot and a storage medium, and belongs to the technical field of unmanned operation.
Background
The search and rescue robot is an intelligent robot which can replace search and rescue personnel to go deep into rescue lines to engage in dangerous tasks such as personnel rescue, information detection and the like when the search and rescue robot faces emergency situations such as urban natural disasters, chemical explosions, fires and the like. When disasters such as earthquake, fire, chemical explosion and nuclear explosion occur, the building structure of the rescue site is extremely unstable, and secondary disasters can occur at any time, so that great risks are brought to life health of rescue workers. The intelligent search and rescue robot based on deep reinforcement learning can search vital signs in a slit according to expert instructions, detect on-site information, adjust search and rescue strategies according to real-time observation of on-site environment and avoid damage of the robot, is an important branch and development direction of current intelligent robot application, and has important significance for intelligent development of rescue work.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a search and rescue method, a search and rescue device and a storage medium of a search and rescue robot, which can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a search and rescue method of a search and rescue robot, including:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
Optionally, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
Optionally, initializing the automatic search and rescue strategy model and training based on the training task includes:
initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
if the preset maximum iteration number is reached, training is completed.
Optionally, the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O * ;
According to the next state S of the search and rescue robot * And environmental observations O * Obtaining a score R and a termination state E of an action strategy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T 。
Optionally, the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
Optionally, according to the next state S of the search and rescue robot * And environmental observations O * The obtaining of the score R and the termination state E of the action policy a includes:
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state isS * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
Optionally, the training and updating the automatic search and rescue strategy model through the model training sample set includes:
training a sample set according to the model of the ith search and rescue robotCalculating a target reward value +.f of a t-th cache playback array of an i-th search and rescue robot>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q i ' is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy action generated through a realistic action network, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.
In a second aspect, the present invention provides a search and rescue device of a search and rescue robot, the device comprising:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
In a third aspect, the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is operative according to the instructions to perform steps according to the method described above.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the above method.
Compared with the prior art, the invention has the beneficial effects that:
according to the search and rescue method, device and storage medium for the search and rescue robots, in the reinforcement learning process, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low by setting the scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of a plurality of search and rescue robots is effectively solved. And the VDN algorithm structure is applied to carry out linear addition on the individual search and rescue robot rewarding values, so that the individual search and rescue robot rewarding values are used as the basis for each robot to execute own actions, and the false rewarding problem in the learning process of a plurality of search and rescue robots is effectively solved. In conclusion, the learning efficiency of reinforcement learning of the search and rescue robot can be improved, and the real-time requirement of the search and rescue robot in a real task is met.
Drawings
Fig. 1 is a flowchart of a search and rescue method of a search and rescue robot according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
as shown in fig. 1, the embodiment of the invention provides a search and rescue method of a search and rescue robot, which comprises the following steps:
1. when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
2. generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
3. executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
s1, constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
specifically, the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
S2, constructing an automatic search and rescue strategy model based on a VDN algorithm.
S3, initializing an automatic search and rescue strategy model and training based on a training task;
the training comprises the following steps:
(1) Initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
(2) Executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
specifically, obtaining the model training sample set includes:
1. acquiring the current state S and the environment observation value O of the search and rescue robot;
the states include search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
2. Selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
the action set a includes, but is not limited to, forward action, backward action, left turn action, right turn action.
3. Driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O * ;
4. According to the next state S of the search and rescue robot * And environmentObserved value O * Obtaining a score R and a termination state E of an action strategy a;
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state S * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
5. The characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
6. Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
7. randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T 。
(3) Training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
(4) If the preset maximum iteration number is reached, training is completed.
Specifically, training and updating the automatic search and rescue strategy model through the model training sample set comprises the following steps:
1. training a sample set according to the model of the ith search and rescue robotCalculating the ith search and rescue robotTarget prize value for the t-th cache playback array>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q i ' is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
2. target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
3. Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy action generated through a realistic action network, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
4. based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
5. if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
6. model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
7. the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.
In the present embodiment of the present invention, in the present embodiment,
(1) In the reinforcement learning process of the plurality of search and rescue robots, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low through setting a scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of the plurality of search and rescue robots is effectively solved.
(2) In the reinforcement learning process of a plurality of search and rescue robots, the VDN algorithm (Value decomposition Network) structure is applied to linearly add the rewarding values of the individual search and rescue robots, and the rewarding values are used as the basis for each robot to execute the action of each robot, so that the problem of false rewarding in the learning process of the plurality of search and rescue robots is effectively solved.
Embodiment two:
the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
Embodiment III:
based on the first embodiment, the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is operative according to the instructions to perform steps according to the method described above.
Embodiment four:
based on an embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the above-mentioned method.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
Claims (9)
1. The search and rescue method of the search and rescue robot is characterized by comprising the following steps of:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing an automatic search and rescue strategy model and training based on a training task;
wherein the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O * ;
According to the next state S of the search and rescue robot * And environmental observations O * Acquisition of motionScore R and termination state E for policy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T 。
2. The search and rescue method of a search and rescue robot according to claim 1, wherein the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
3. A search and rescue method for a search and rescue robot as defined in claim 1, wherein initializing an automatic search and rescue strategy model and training based on a training task comprises:
initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
if the preset maximum iteration number is reached, training is completed.
4. A search and rescue method for a search and rescue robot according to claim 1, wherein the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
5. The search and rescue method of a search and rescue robot according to claim 4, wherein the following state S of the search and rescue robot * And environmental observations O * The obtaining of the score R and the termination state E of the action policy a includes:
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state S * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
6. A search and rescue method for a search and rescue robot as defined in claim 4, wherein training and updating the automatic search and rescue strategy model through the model training sample set comprises:
training a sample set according to the model of the ith search and rescue robotCalculating a target reward value +.f of a t-th cache playback array of an i-th search and rescue robot>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q '' i (. Cndot.) is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy generated through a realistic action networkSlightly act, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
weight parameters omega, theta and omega based on updated real action network, real evaluation network, target action network and target evaluation network ′ 、θ ′ The selection probability epsilon updates the policy model.
7. A search and rescue device of a search and rescue robot, the device comprising:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing an automatic search and rescue strategy model and training based on a training task;
wherein the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O * ;
According to the next state S of the search and rescue robot * And environmental observations O * Obtaining a score R and a termination state E of an action strategy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T 。
8. The search and rescue device of the search and rescue robot is characterized by comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1-6.
9. Computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210328204.2A CN114770497B (en) | 2022-03-31 | 2022-03-31 | Search and rescue method and device of search and rescue robot and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210328204.2A CN114770497B (en) | 2022-03-31 | 2022-03-31 | Search and rescue method and device of search and rescue robot and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114770497A CN114770497A (en) | 2022-07-22 |
CN114770497B true CN114770497B (en) | 2024-02-02 |
Family
ID=82427854
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210328204.2A Active CN114770497B (en) | 2022-03-31 | 2022-03-31 | Search and rescue method and device of search and rescue robot and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114770497B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9811074B1 (en) * | 2016-06-21 | 2017-11-07 | TruPhysics GmbH | Optimization of robot control programs in physics-based simulated environment |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
US10792810B1 (en) * | 2017-12-14 | 2020-10-06 | Amazon Technologies, Inc. | Artificial intelligence system for learning robotic control policies |
CN111984018A (en) * | 2020-09-25 | 2020-11-24 | 斑马网络技术有限公司 | Automatic driving method and device |
CN113031528A (en) * | 2021-02-25 | 2021-06-25 | 电子科技大学 | Multi-legged robot motion control method based on depth certainty strategy gradient |
CN113276883A (en) * | 2021-04-28 | 2021-08-20 | 南京大学 | Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device |
-
2022
- 2022-03-31 CN CN202210328204.2A patent/CN114770497B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9811074B1 (en) * | 2016-06-21 | 2017-11-07 | TruPhysics GmbH | Optimization of robot control programs in physics-based simulated environment |
US10792810B1 (en) * | 2017-12-14 | 2020-10-06 | Amazon Technologies, Inc. | Artificial intelligence system for learning robotic control policies |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
CN111984018A (en) * | 2020-09-25 | 2020-11-24 | 斑马网络技术有限公司 | Automatic driving method and device |
CN113031528A (en) * | 2021-02-25 | 2021-06-25 | 电子科技大学 | Multi-legged robot motion control method based on depth certainty strategy gradient |
CN113276883A (en) * | 2021-04-28 | 2021-08-20 | 南京大学 | Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device |
Also Published As
Publication number | Publication date |
---|---|
CN114770497A (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11779837B2 (en) | Method, apparatus, and device for scheduling virtual objects in virtual environment | |
CN112015174B (en) | Multi-AGV motion planning method, device and system | |
CN109690576A (en) | The training machine learning model in multiple machine learning tasks | |
Earl et al. | A decomposition approach to multi-vehicle cooperative control | |
CN109523029A (en) | For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body | |
CN109740741B (en) | Reinforced learning method combined with knowledge transfer and learning method applied to autonomous skills of unmanned vehicles | |
CN107563653A (en) | Multi-robot full-coverage task allocation method | |
CN115562357A (en) | Intelligent path planning method for unmanned aerial vehicle cluster | |
CN114770497B (en) | Search and rescue method and device of search and rescue robot and storage medium | |
CN115906673B (en) | Combat entity behavior model integrated modeling method and system | |
KR101139259B1 (en) | Heap-based multi-agent system for the theater level, mission level or the engagement level simulation | |
Gao et al. | An adaptive framework to select the coordinate systems for evolutionary algorithms | |
Ponsini et al. | Analysis of soccer robot behaviors using time petri nets | |
Leonard et al. | Bootstrapped Neuro-Simulation as a method of concurrent neuro-evolution and damage recovery | |
Yuan et al. | Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks | |
CN114118441A (en) | Online planning method based on efficient search strategy under uncertain environment | |
Tang et al. | Reinforcement learning for robots path planning with rule-based shallow-trial | |
Eker et al. | A finite horizon dec-pomdp approach to multi-robot task learning | |
Shiltagh et al. | A comparative study: Modified particle swarm optimization and modified genetic algorithm for global mobile robot navigation | |
Schubert et al. | Decision support for crowd control: Using genetic algorithms with simulation to learn control strategies | |
Hart et al. | Dante agent architecture for force-on-force wargame simulation and training | |
de Carvalho Santos et al. | A hybrid ga-ann approach for autonomous robots topological navigation | |
CN116227361B (en) | Intelligent body decision method and device | |
Sinclair et al. | A generic cognitive architecture framework with personality and emotions for crowd simulation | |
Cruz-Álvarez et al. | Robotic behavior implementation using two different differential evolution variants |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |