CN114770497B - Search and rescue method and device of search and rescue robot and storage medium - Google Patents

Search and rescue method and device of search and rescue robot and storage medium Download PDF

Info

Publication number
CN114770497B
CN114770497B CN202210328204.2A CN202210328204A CN114770497B CN 114770497 B CN114770497 B CN 114770497B CN 202210328204 A CN202210328204 A CN 202210328204A CN 114770497 B CN114770497 B CN 114770497B
Authority
CN
China
Prior art keywords
search
rescue
robot
strategy
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210328204.2A
Other languages
Chinese (zh)
Other versions
CN114770497A (en
Inventor
林泽阳
赖俊
陈希亮
王军
刘志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202210328204.2A priority Critical patent/CN114770497B/en
Publication of CN114770497A publication Critical patent/CN114770497A/en
Application granted granted Critical
Publication of CN114770497B publication Critical patent/CN114770497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed

Abstract

The invention provides a search and rescue method, a device and a storage medium of a search and rescue robot, wherein the method comprises the following steps: when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot; generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model; executing search and rescue actions according to an automatic search and rescue strategy; the training of the automatic search and rescue strategy model comprises the following steps: constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment; constructing an automatic search and rescue strategy model based on a VDN algorithm; initializing an automatic search and rescue strategy model and training based on a training task; the invention can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.

Description

Search and rescue method and device of search and rescue robot and storage medium
Technical Field
The invention relates to a search and rescue method and device of a search and rescue robot and a storage medium, and belongs to the technical field of unmanned operation.
Background
The search and rescue robot is an intelligent robot which can replace search and rescue personnel to go deep into rescue lines to engage in dangerous tasks such as personnel rescue, information detection and the like when the search and rescue robot faces emergency situations such as urban natural disasters, chemical explosions, fires and the like. When disasters such as earthquake, fire, chemical explosion and nuclear explosion occur, the building structure of the rescue site is extremely unstable, and secondary disasters can occur at any time, so that great risks are brought to life health of rescue workers. The intelligent search and rescue robot based on deep reinforcement learning can search vital signs in a slit according to expert instructions, detect on-site information, adjust search and rescue strategies according to real-time observation of on-site environment and avoid damage of the robot, is an important branch and development direction of current intelligent robot application, and has important significance for intelligent development of rescue work.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a search and rescue method, a search and rescue device and a storage medium of a search and rescue robot, which can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.
In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a search and rescue method of a search and rescue robot, including:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
Optionally, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
Optionally, initializing the automatic search and rescue strategy model and training based on the training task includes:
initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
if the preset maximum iteration number is reached, training is completed.
Optionally, the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O *
According to the next state S of the search and rescue robot * And environmental observations O * Obtaining a score R and a termination state E of an action strategy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T
Optionally, the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
Optionally, according to the next state S of the search and rescue robot * And environmental observations O * The obtaining of the score R and the termination state E of the action policy a includes:
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state isS * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
Optionally, the training and updating the automatic search and rescue strategy model through the model training sample set includes:
training a sample set according to the model of the ith search and rescue robotCalculating a target reward value +.f of a t-th cache playback array of an i-th search and rescue robot>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q i ' is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy action generated through a realistic action network, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.
In a second aspect, the present invention provides a search and rescue device of a search and rescue robot, the device comprising:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
In a third aspect, the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is operative according to the instructions to perform steps according to the method described above.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the above method.
Compared with the prior art, the invention has the beneficial effects that:
according to the search and rescue method, device and storage medium for the search and rescue robots, in the reinforcement learning process, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low by setting the scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of a plurality of search and rescue robots is effectively solved. And the VDN algorithm structure is applied to carry out linear addition on the individual search and rescue robot rewarding values, so that the individual search and rescue robot rewarding values are used as the basis for each robot to execute own actions, and the false rewarding problem in the learning process of a plurality of search and rescue robots is effectively solved. In conclusion, the learning efficiency of reinforcement learning of the search and rescue robot can be improved, and the real-time requirement of the search and rescue robot in a real task is met.
Drawings
Fig. 1 is a flowchart of a search and rescue method of a search and rescue robot according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
Embodiment one:
as shown in fig. 1, the embodiment of the invention provides a search and rescue method of a search and rescue robot, which comprises the following steps:
1. when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
2. generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
3. executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
s1, constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
specifically, the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
S2, constructing an automatic search and rescue strategy model based on a VDN algorithm.
S3, initializing an automatic search and rescue strategy model and training based on a training task;
the training comprises the following steps:
(1) Initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
(2) Executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
specifically, obtaining the model training sample set includes:
1. acquiring the current state S and the environment observation value O of the search and rescue robot;
the states include search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
2. Selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
the action set a includes, but is not limited to, forward action, backward action, left turn action, right turn action.
3. Driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O *
4. According to the next state S of the search and rescue robot * And environmentObserved value O * Obtaining a score R and a termination state E of an action strategy a;
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state S * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
5. The characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
6. Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
7. randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T
(3) Training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
(4) If the preset maximum iteration number is reached, training is completed.
Specifically, training and updating the automatic search and rescue strategy model through the model training sample set comprises the following steps:
1. training a sample set according to the model of the ith search and rescue robotCalculating the ith search and rescue robotTarget prize value for the t-th cache playback array>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q i ' is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
2. target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
3. Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy action generated through a realistic action network, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
4. based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
5. if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
6. model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
7. the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.
In the present embodiment of the present invention, in the present embodiment,
(1) In the reinforcement learning process of the plurality of search and rescue robots, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low through setting a scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of the plurality of search and rescue robots is effectively solved.
(2) In the reinforcement learning process of a plurality of search and rescue robots, the VDN algorithm (Value decomposition Network) structure is applied to linearly add the rewarding values of the individual search and rescue robots, and the rewarding values are used as the basis for each robot to execute the action of each robot, so that the problem of false rewarding in the learning process of the plurality of search and rescue robots is effectively solved.
Embodiment two:
the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing an automatic search and rescue strategy model and training based on a training task.
Embodiment III:
based on the first embodiment, the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is operative according to the instructions to perform steps according to the method described above.
Embodiment four:
based on an embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the above-mentioned method.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (9)

1. The search and rescue method of the search and rescue robot is characterized by comprising the following steps of:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing an automatic search and rescue strategy model and training based on a training task;
wherein the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O *
According to the next state S of the search and rescue robot * And environmental observations O * Acquisition of motionScore R and termination state E for policy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T
2. The search and rescue method of a search and rescue robot according to claim 1, wherein the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.
3. A search and rescue method for a search and rescue robot as defined in claim 1, wherein initializing an automatic search and rescue strategy model and training based on a training task comprises:
initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;
if the preset maximum iteration number is reached, training is completed.
4. A search and rescue method for a search and rescue robot according to claim 1, wherein the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
5. The search and rescue method of a search and rescue robot according to claim 4, wherein the following state S of the search and rescue robot * And environmental observations O * The obtaining of the score R and the termination state E of the action policy a includes:
if the next state S * The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O * If survivors exist, acquiring preset bonus points;
if the next state S * If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;
if the next state S * If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.
6. A search and rescue method for a search and rescue robot as defined in claim 4, wherein training and updating the automatic search and rescue strategy model through the model training sample set comprises:
training a sample set according to the model of the ith search and rescue robotCalculating a target reward value +.f of a t-th cache playback array of an i-th search and rescue robot>
Wherein i=1, 2,3 … p, t=1, 2,3 … T, R t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q '' i (. Cndot.) is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache t Is the next state of (2);
target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>
Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
wherein pi (·) is a policy generated through a realistic action networkSlightly act, Q i (. Cndot.) is an evaluation value generated by a real target evaluation network;
based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;
model training sample set D for each search and rescue robot T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;
weight parameters omega, theta and omega based on updated real action network, real evaluation network, target action network and target evaluation network 、θ The selection probability epsilon updates the policy model.
7. A search and rescue device of a search and rescue robot, the device comprising:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;
the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing an automatic search and rescue strategy model and training based on a training task;
wherein the obtaining the model training sample set includes:
acquiring the current state S and the environment observation value O of the search and rescue robot;
selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot * And environmental observations O *
According to the next state S of the search and rescue robot * And environmental observations O * Obtaining a score R and a termination state E of an action strategy a;
the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated * ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) * ),a,R,E};
Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;
randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D T
8. The search and rescue device of the search and rescue robot is characterized by comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1-6.
9. Computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.
CN202210328204.2A 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium Active CN114770497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210328204.2A CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210328204.2A CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Publications (2)

Publication Number Publication Date
CN114770497A CN114770497A (en) 2022-07-22
CN114770497B true CN114770497B (en) 2024-02-02

Family

ID=82427854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210328204.2A Active CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Country Status (1)

Country Link
CN (1) CN114770497B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9811074B1 (en) * 2016-06-21 2017-11-07 TruPhysics GmbH Optimization of robot control programs in physics-based simulated environment
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies
CN111984018A (en) * 2020-09-25 2020-11-24 斑马网络技术有限公司 Automatic driving method and device
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN113276883A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9811074B1 (en) * 2016-06-21 2017-11-07 TruPhysics GmbH Optimization of robot control programs in physics-based simulated environment
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN111984018A (en) * 2020-09-25 2020-11-24 斑马网络技术有限公司 Automatic driving method and device
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN113276883A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Also Published As

Publication number Publication date
CN114770497A (en) 2022-07-22

Similar Documents

Publication Publication Date Title
US11779837B2 (en) Method, apparatus, and device for scheduling virtual objects in virtual environment
CN112015174B (en) Multi-AGV motion planning method, device and system
CN109690576A (en) The training machine learning model in multiple machine learning tasks
Earl et al. A decomposition approach to multi-vehicle cooperative control
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN109740741B (en) Reinforced learning method combined with knowledge transfer and learning method applied to autonomous skills of unmanned vehicles
CN107563653A (en) Multi-robot full-coverage task allocation method
CN115562357A (en) Intelligent path planning method for unmanned aerial vehicle cluster
CN114770497B (en) Search and rescue method and device of search and rescue robot and storage medium
CN115906673B (en) Combat entity behavior model integrated modeling method and system
KR101139259B1 (en) Heap-based multi-agent system for the theater level, mission level or the engagement level simulation
Gao et al. An adaptive framework to select the coordinate systems for evolutionary algorithms
Ponsini et al. Analysis of soccer robot behaviors using time petri nets
Leonard et al. Bootstrapped Neuro-Simulation as a method of concurrent neuro-evolution and damage recovery
Yuan et al. Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
CN114118441A (en) Online planning method based on efficient search strategy under uncertain environment
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Eker et al. A finite horizon dec-pomdp approach to multi-robot task learning
Shiltagh et al. A comparative study: Modified particle swarm optimization and modified genetic algorithm for global mobile robot navigation
Schubert et al. Decision support for crowd control: Using genetic algorithms with simulation to learn control strategies
Hart et al. Dante agent architecture for force-on-force wargame simulation and training
de Carvalho Santos et al. A hybrid ga-ann approach for autonomous robots topological navigation
CN116227361B (en) Intelligent body decision method and device
Sinclair et al. A generic cognitive architecture framework with personality and emotions for crowd simulation
Cruz-Álvarez et al. Robotic behavior implementation using two different differential evolution variants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant