CN114770497A - Search and rescue method and device of search and rescue robot and storage medium - Google Patents

Search and rescue method and device of search and rescue robot and storage medium Download PDF

Info

Publication number
CN114770497A
CN114770497A CN202210328204.2A CN202210328204A CN114770497A CN 114770497 A CN114770497 A CN 114770497A CN 202210328204 A CN202210328204 A CN 202210328204A CN 114770497 A CN114770497 A CN 114770497A
Authority
CN
China
Prior art keywords
search
rescue
robot
strategy
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210328204.2A
Other languages
Chinese (zh)
Other versions
CN114770497B (en
Inventor
林泽阳
赖俊
陈希亮
王军
刘志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202210328204.2A priority Critical patent/CN114770497B/en
Publication of CN114770497A publication Critical patent/CN114770497A/en
Application granted granted Critical
Publication of CN114770497B publication Critical patent/CN114770497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)

Abstract

The invention provides a search and rescue method, a device and a storage medium of a search and rescue robot, wherein the method comprises the following steps: when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot; generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training; executing search and rescue actions according to the automatic search and rescue strategy; wherein the training of the automatic search and rescue strategy model comprises: constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment; constructing an automatic search and rescue strategy model based on a VDN algorithm; initializing an automatic search and rescue strategy model and training based on a training task; the invention can improve the learning efficiency of the search and rescue robot for reinforcement learning and meet the real-time requirement of the search and rescue robot in the actual task.

Description

Search and rescue method and device of search and rescue robot and storage medium
Technical Field
The invention relates to a search and rescue method and device of a search and rescue robot and a storage medium, and belongs to the technical field of unmanned driving.
Background
The search and rescue robot is an intelligent robot which can replace search and rescue personnel to deeply rescue personnel and conduct dangerous tasks such as personnel rescue, information detection and the like in the first line when facing emergencies such as urban natural disasters, chemical explosion, fire disasters and the like. When disasters such as earthquake, fire, chemical explosion, nuclear explosion and the like occur, the building structure of the rescue site is extremely unstable, and secondary disasters can occur at any time, so that great risks are brought to the life health of rescuers. The intelligent search and rescue robot based on the deep reinforcement learning can deeply enter the slit to search life signs and detect field information according to expert instructions, adjusts a search and rescue strategy according to real-time observation on the field environment and avoids damage of the robot, is an important branch and development direction of current intelligent robot application, and has important significance for intelligent development of rescue work.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a search and rescue method, a search and rescue device and a storage medium of a search and rescue robot, can improve the learning efficiency of reinforcement learning of the search and rescue robot, and meets the real-time requirement of the search and rescue robot in a real task.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a search and rescue method for a search and rescue robot, including:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training;
executing search and rescue actions according to the automatic search and rescue strategy;
wherein the training of the automatic search and rescue strategy model comprises:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing the automatic search and rescue strategy model and training based on the training task.
Optionally, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors, and n obstacles in the search and rescue range, and generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.
Optionally, initializing the automatic search and rescue strategy model and training based on the training task includes:
initializing an automatic search and rescue strategy model, including initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and substituting the updated automatic search and rescue strategy model for the initialized automatic search and rescue strategy model into the steps for iteration;
and if the preset maximum iteration times are reached, finishing the training.
Optionally, the obtaining of the model training sample set includes:
acquiring a current state S and an environment observation value O of the search and rescue robot;
selecting an action strategy a from an action set A based on a current state S and an environment observation value O according to an initialized automatic search and rescue strategy model;
according to the action strategy a, the search and rescue robot is driven to automatically search and rescue in the simulation environment and the next state S of the search and rescue robot is obtained*And environmental observation O*
According to the next state S of the search and rescue robot*And environmental observation O*Acquiring a score R and a termination state E of an action strategy a;
the feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared*) The action strategy a, the score R and the termination state E are saved as a cache replay array and are marked as { phi (S), phi (S)*),a,R,E};
Storing the cache playback array into a pre-constructed cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches a preset number;
randomly selecting T cache replay arrays from the cache replay experience pool D to generate a model training sample set DT
Optionally, the state includes search and rescue robot coordinates, and the environment observation value includes obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.
Optionally, the next state S according to the search and rescue robot*And environmental observation O*The acquisition of the score R and the termination state E of the action strategy a comprises the following steps:
if the next state S*The coordinates of the search and rescue robot in (1) are located in the search and rescue range, and the next environmental observation value O*If the survivors exist in the pool, acquiring a preset reward point;
if the next state S*Judging whether collision occurs or not according to the coordinates of the search and rescue robot and the coordinates of the barrier or the coordinates of other search and rescue robots when the coordinates of the search and rescue robot are within the search and rescue range, deducting a preset collision score from the reward point to obtain a score R if the collision occurs, and setting the termination state as termination;
if the next state S*If the coordinates of the search and rescue robot in the system are outside the search and rescue range, the termination state is set as termination.
Optionally, the training and updating the automatic search and rescue strategy model through the model training sample set includes:
model training sample set according to ith search and rescue robot
Figure BDA0003574243240000031
Calculating the target reward value of the t cache playback array of the ith search and rescue robot
Figure BDA0003574243240000032
Figure BDA0003574243240000033
Wherein, i is 1,2,3 … p, T is 1,2,3 … T, RtFor the score of the t-th cache replay array, gamma is a discount factor, pi '(. cndot.) is a policy action generated by the target action network, omega' is a weight parameter of the target evaluation network, Qi' (-) is an evaluation value generated by an objective evaluation network,
Figure BDA0003574243240000034
playback of the current state S of the array for the tth cachetThe next state of (a);
target reward values of the t-th cache playback array of all the search and rescue robots
Figure BDA0003574243240000035
Linear addition is carried out to obtain target reward values of all the search and rescue robots
Figure BDA0003574243240000036
Figure BDA0003574243240000037
Targeted reward value
Figure BDA0003574243240000038
Constructing a first loss function, and updating a weight parameter omega of the reality evaluation network through gradient back propagation of a neural network; the first loss function is:
Figure BDA0003574243240000041
where π (-) is a policy action generated through the real action network, Qi() is an evaluation value generated by a real-world object evaluation network;
reward value based on goal
Figure BDA0003574243240000042
Construction ofThe second loss function updates the weight parameter theta of the real action network through the gradient back propagation of the neural network; the second loss function is:
Figure BDA0003574243240000043
if t% C is 0, updating the weight parameters ω 'and θ' of the target motion network and the target evaluation network according to the weight parameters ω and θ of the real motion network and the real evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is an update coefficient;
model training sample set D for each search and rescue robotTEarned target prize value
Figure BDA0003574243240000044
Sorting from high to low, and according to the sorting result, the target reward value is sorted
Figure BDA0003574243240000045
Corresponding action policy
Figure BDA0003574243240000046
The selection probability epsilon is reassigned from low to high;
and updating the strategy model based on the updated weight parameters omega, theta, omega ', theta' and the selection probability epsilon of the real action network, the real evaluation network, the target action network and the target evaluation network.
In a second aspect, the present invention provides a search and rescue apparatus for a search and rescue robot, the apparatus including:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to the automatic search and rescue strategy;
wherein the training of the automatic search and rescue strategy model comprises:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing the automatic search and rescue strategy model and training based on the training task.
In a third aspect, the invention provides a search and rescue device of a search and rescue robot, comprising a processor and a storage medium;
the storage medium is to store instructions;
the processor is configured to operate in accordance with the instructions to perform the steps according to the above-described method.
In a fourth aspect, the invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
Compared with the prior art, the invention has the following beneficial effects:
according to the search and rescue method, device and storage medium of the search and rescue robot, in the reinforcement learning process, all the search and rescue robots are sequentially ranked from high to low in the obtained Q values by setting the scoring priority rule, and are assigned with the selection probability from low to high, so that the problem of inertia in the learning process of a plurality of search and rescue robots is effectively solved. The method has the advantages that the reward values of the individual search and rescue robots are linearly added by applying a VDN algorithm structure and serve as the basis for each robot to execute the self action, and the problem of false reward in the learning process of a plurality of search and rescue robots is effectively solved. In conclusion, the search and rescue robot learning method and the search and rescue robot learning system can improve the learning efficiency of the search and rescue robot reinforcement learning and meet the real-time requirement of the search and rescue robot in the actual task.
Drawings
Fig. 1 is a flowchart of a search and rescue method of a search and rescue robot according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The first embodiment is as follows:
as shown in fig. 1, an embodiment of the present invention provides a search and rescue method for a search and rescue robot, including the following steps:
1. when a search and rescue instruction is obtained, initializing the self state of the search and rescue robot;
2. generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training;
3. executing search and rescue actions according to the automatic search and rescue strategy;
wherein, the training of the automatic search and rescue strategy model comprises the following steps:
s1, constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
specifically, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.
And S2, constructing an automatic search and rescue strategy model based on a VDN algorithm.
S3, initializing the automatic search and rescue strategy model and training based on the training task;
the training comprises the following steps:
(1) initializing an automatic search and rescue strategy model, including initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
(2) executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
specifically, obtaining the model training sample set includes:
1. acquiring a current state S and an environmental observation value O of the search and rescue robot;
the state includes search and rescue robot coordinate, and the environment observation value includes barrier, survivor and other search and rescue robots of predetermineeing the within range around the search and rescue robot coordinate.
2. Selecting an action strategy a from an action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;
the action set a includes, but is not limited to, forward action, reverse action, left turn action, right turn action.
3. According to the action strategy a, the search and rescue robot is driven to automatically search and rescue in the simulation environment and the next state S of the search and rescue robot is obtained*And environmental observation O*
4. According to the next state S of the search and rescue robot*And environmental observation O*Acquiring a score R and a termination state E of the action strategy a;
if the next state S*The coordinates of the search and rescue robot in (1) are located in the search and rescue range, and the next environmental observation value O*If the survivors exist in the pool, acquiring a preset reward point;
if the next state S*Judging whether collision occurs according to the coordinates of the search and rescue robot and the coordinates of the barrier or the coordinates of other search and rescue robots when the coordinates of the search and rescue robot in the search and rescue range are within the search and rescue range, deducting a preset collision score from the reward score to obtain a score R if the collision occurs, and setting the termination state as termination;
if the next state S*If the coordinates of the search and rescue robot in the system are outside the search and rescue range, the termination state is set as termination.
5. The feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared*) The action strategy a, the score R and the termination state E are stored as a cache replay array, which is marked as { phi (S), phi (S)*),a,R,E};
6. Storing the cache playback array into a pre-constructed cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches a preset number;
7. randomly selecting T cache replay arrays from a cache replay experience pool D to generate a model training sample set DT
(3) Training and updating the automatic search and rescue strategy model through the model training sample set, and substituting the updated automatic search and rescue strategy model for the initialized automatic search and rescue strategy model into the steps for iteration;
(4) and if the preset maximum iteration times are reached, finishing the training.
Specifically, the training and updating of the automatic search and rescue strategy model through the model training sample set comprises the following steps:
1. training sample set according to model of ith search and rescue robot
Figure BDA0003574243240000071
Calculating the target reward value of the t cache playback array of the ith search and rescue robot
Figure BDA0003574243240000072
Figure BDA0003574243240000073
Wherein, i is 1,2,3 … p, T is 1,2,3 … T, RtFor the score of the t-th cache replay array, gamma is a discount factor, pi '(. cndot.) is a policy action generated by the target action network, omega' is a weight parameter of the target evaluation network, Qi' (. cndot.) is an evaluation value generated by the target evaluation network,
Figure BDA0003574243240000074
playback the current state S of the array for the t-th cachetThe next state of (a);
2. target reward values of the t-th cache playback array of all the search and rescue robots
Figure BDA0003574243240000075
Linear addition to obtain all search and rescue robotsTargeted reward values
Figure BDA0003574243240000076
Figure BDA0003574243240000081
3. Targeted reward value
Figure BDA0003574243240000082
Constructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:
Figure BDA0003574243240000083
where π (-) is a policy action generated through the real action network, Qi() is an evaluation value generated by a real-world object evaluation network;
4. targeted reward value
Figure BDA0003574243240000084
Constructing a second loss function, and updating the weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
Figure BDA0003574243240000085
5. if t% C is 0, updating the weight parameters ω 'and θ' of the target motion network and the target evaluation network according to the weight parameters ω and θ of the real motion network and the real evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is an update coefficient;
6、model training sample set D for each search and rescue robotTEarned target prize value
Figure BDA0003574243240000086
Sorting from high to low, and according to the sorting result, the target reward value is sorted
Figure BDA0003574243240000087
Corresponding action policy
Figure BDA0003574243240000088
The selection probability epsilon of the user is reassigned from low to high;
7. updating the strategy model based on the updated weight parameters omega, theta, omega ', theta' and the selection probability epsilon of the real action network, the real evaluation network, the target action network and the target evaluation network.
In the present embodiment of the present invention,
(1) in the process of the multiple search and rescue robots for reinforcement learning, all the search and rescue robots are sequentially ranked from high to low in the obtained Q values by setting a scoring priority rule, and probability assignment is performed from low to high, so that the problem of inertia in the process of the multiple search and rescue robots for learning is effectively solved.
(2) In the reinforcement learning process of a plurality of search and rescue robots, the reward values of the individual search and rescue robots are linearly added by applying a VDN (Value decomposition Network) structure and are used as the basis for each robot to execute the self action, so that the false reward problem in the learning process of the plurality of search and rescue robots is effectively solved.
The second embodiment:
the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises:
the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;
the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to the automatic search and rescue strategy;
wherein, the training of the automatic search and rescue strategy model comprises the following steps:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing the automatic search and rescue strategy model and training based on the training task.
Example three:
based on the first embodiment, the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium, wherein the processor is used for processing a search and rescue signal;
a storage medium to store instructions;
the processor is configured to operate in accordance with instructions to perform steps in accordance with the above-described method.
Example four:
based on the first embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is configured to implement the steps of the method when executed by a processor.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be considered as the protection scope of the present invention.

Claims (10)

1. A search and rescue method of a search and rescue robot is characterized by comprising the following steps:
when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;
generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training;
executing search and rescue actions according to the automatic search and rescue strategy;
wherein the training of the automatic search and rescue strategy model comprises:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
and initializing the automatic search and rescue strategy model and training based on the training task.
2. The search and rescue method for search and rescue robots according to claim 1, characterized in that the training task is to set a search and rescue range in a simulation environment, to configure p search and rescue robots, m survivors, n obstacles in the search and rescue range, and to generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.
3. The method for searching and rescuing robot as claimed in claim 1, wherein initializing the automated search and rescue strategy model and training based on training tasks includes:
initializing an automatic search and rescue strategy model, including initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;
executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;
training and updating the automatic search and rescue strategy model through the model training sample set, and substituting the updated automatic search and rescue strategy model for the initialized automatic search and rescue strategy model into the steps for iteration;
and if the preset maximum iteration times are reached, finishing the training.
4. The method as claimed in claim 1, wherein the obtaining of the set of model training samples comprises:
acquiring a current state S and an environmental observation value O of the search and rescue robot;
selecting an action strategy a from an action set A based on a current state S and an environment observation value O according to an initialized automatic search and rescue strategy model;
according to the action strategy a, the search and rescue robot is driven to automatically search and rescue in the simulation environment and the next state S of the search and rescue robot is obtained*And environmental observation O*
According to the next state S of the search and rescue robot*And environmental observation O*Acquiring a score R and a termination state E of an action strategy a;
the feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared*) The action strategy a, the score R and the termination state E are stored as a cache replay array, which is marked as { phi (S), phi (S)*),a,R,E};
Storing the cache playback array into a pre-constructed cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches a preset number;
randomly selecting T cache replay arrays from the cache replay experience pool D to generate a model training sample set DT
5. The method as claimed in claim 4, wherein the state includes search and rescue robot coordinates, and the environmental observation value includes obstacles, survivors and other search and rescue robots within a preset range around the search and rescue robot coordinates.
6. A search and rescue method for a search and rescue robot as claimed in claim 5, characterized in that the method is based on the next state S of the search and rescue robot*And environmental observation O*The acquisition of the score R and the termination state E of the action strategy a comprises the following steps:
if the next state S*The coordinates of the search and rescue robot in (1) are located in the search and rescue range, and the next environmental observation value O*If the survivors exist in the pool, acquiring a preset reward point;
if the next state S*If the coordinates of the search and rescue robot are within the search and rescue range, whether the search and rescue robot is in the search and rescue range or not is judged according to the coordinates of the search and rescue robot and the coordinates of the obstacle or the coordinates of other search and rescue robotsIf the collision occurs, deducting a preset collision score from the reward points to obtain a score R, and setting the termination state as termination;
if the next state S*If the coordinates of the search and rescue robot in the system are outside the search and rescue range, the termination state is set as termination.
7. The method as claimed in claim 5, wherein the training and updating of the automatic search and rescue strategy model by the model training sample set comprises:
training sample set according to model of ith search and rescue robot
Figure FDA0003574243230000031
Calculating the target reward value of the t cache playback array of the ith search and rescue robot
Figure FDA0003574243230000032
Figure FDA0003574243230000033
Wherein, i is 1,2,3 … p, T is 1,2,3 … T, RtFor the score of the t-th cache replay array, gamma is a discount factor, pi '(. cndot.) is a policy action generated by the target action network, omega' is a weight parameter of the target evaluation network, Qi' (-) is an evaluation value generated by an objective evaluation network,
Figure FDA0003574243230000034
playback of the current state S of the array for the tth cachetThe next state of (a);
target reward values of the t-th cache playback array of all the search and rescue robots
Figure FDA0003574243230000035
Linear addition is carried out to obtain target reward values of all the search and rescue robots
Figure FDA0003574243230000036
Figure FDA0003574243230000037
Reward value based on goal
Figure FDA0003574243230000038
Constructing a first loss function, and updating a weight parameter omega of the reality evaluation network through gradient back propagation of a neural network; the first loss function is:
Figure FDA0003574243230000039
where π (-) is a policy action generated through the real action network, Qi() is an evaluation value generated by a real-world object evaluation network;
targeted reward value
Figure FDA00035742432300000310
Constructing a second loss function, and updating the weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:
Figure FDA00035742432300000311
if t% C is 0, updating the weight parameters ω 'and θ' of the target motion network and the target evaluation network according to the weight parameters ω and θ of the real motion network and the real evaluation network:
ω′←τω+(1-τ)ω′
θ′←τθ+(1-τ)θ′
wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is an update coefficient;
model training sample set D for each search and rescue robotTAcquired target prize value
Figure FDA0003574243230000041
Sorting from high to low, and according to sorting result, rewarding target value
Figure FDA0003574243230000042
Corresponding action policy
Figure FDA0003574243230000043
The selection probability epsilon is reassigned from low to high;
and updating the strategy model based on the updated weight parameters omega, theta, omega ', theta' and the selection probability epsilon of the real action network, the real evaluation network, the target action network and the target evaluation network.
8. A search and rescue apparatus of a search and rescue robot, the apparatus comprising:
the information acquisition module is used for initializing the self state of the search and rescue robot when acquiring the search and rescue instruction;
the strategy generating module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;
the search and rescue execution module is used for executing search and rescue actions according to the automatic search and rescue strategy;
wherein the training of the automatic search and rescue strategy model comprises:
constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;
constructing an automatic search and rescue strategy model based on a VDN algorithm;
initializing the automatic search and rescue strategy model and training based on the training task.
9. A search and rescue device of a search and rescue robot is characterized by comprising a processor and a storage medium;
the storage medium is to store instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.
10. Computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210328204.2A 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium Active CN114770497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210328204.2A CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210328204.2A CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Publications (2)

Publication Number Publication Date
CN114770497A true CN114770497A (en) 2022-07-22
CN114770497B CN114770497B (en) 2024-02-02

Family

ID=82427854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210328204.2A Active CN114770497B (en) 2022-03-31 2022-03-31 Search and rescue method and device of search and rescue robot and storage medium

Country Status (1)

Country Link
CN (1) CN114770497B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9811074B1 (en) * 2016-06-21 2017-11-07 TruPhysics GmbH Optimization of robot control programs in physics-based simulated environment
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies
CN111984018A (en) * 2020-09-25 2020-11-24 斑马网络技术有限公司 Automatic driving method and device
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN113276883A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9811074B1 (en) * 2016-06-21 2017-11-07 TruPhysics GmbH Optimization of robot control programs in physics-based simulated environment
US10792810B1 (en) * 2017-12-14 2020-10-06 Amazon Technologies, Inc. Artificial intelligence system for learning robotic control policies
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN111984018A (en) * 2020-09-25 2020-11-24 斑马网络技术有限公司 Automatic driving method and device
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN113276883A (en) * 2021-04-28 2021-08-20 南京大学 Unmanned vehicle driving strategy planning method based on dynamic generation environment and implementation device

Also Published As

Publication number Publication date
CN114770497B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
Orozco-Rosas et al. Mobile robot path planning using membrane evolutionary artificial potential field
Yuan et al. Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
CN112362066A (en) Path planning method based on improved deep reinforcement learning
Niroui et al. Robot exploration in unknown cluttered environments when dealing with uncertainty
Xiao et al. Multigoal visual navigation with collision avoidance via deep reinforcement learning
Liu et al. Episodic memory-based robotic planning under uncertainty
Sheh et al. Behavioural cloning for driving robots over rough terrain
CN114770497A (en) Search and rescue method and device of search and rescue robot and storage medium
Yuan et al. Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
Zhang et al. Auto-conditioned recurrent mixture density networks for learning generalizable robot skills
Ponsini et al. Analysis of soccer robot behaviors using time petri nets
Bar et al. Deep Reinforcement Learning Approach with adaptive reward system for robot navigation in Dynamic Environments
Cabreira et al. An evolutionary learning approach for robot path planning with fuzzy obstacle detection and avoidance in a multi-agent environment
CN114118441A (en) Online planning method based on efficient search strategy under uncertain environment
Nguyen et al. A broad-persistent advising approach for deep interactive reinforcement learning in robotic environments
Dudarenko et al. Reinforcement Learning Approach for Navigation of Ground Robotic Platform in Statically and Dynamically Generated Environments
Shiltagh et al. A comparative study: Modified particle swarm optimization and modified genetic algorithm for global mobile robot navigation
Cruz-Álvarez et al. Robotic behavior implementation using two different differential evolution variants
de Carvalho Santos et al. A hybrid ga-ann approach for autonomous robots topological navigation
Das et al. Improved real time A*-fuzzy controller for improving multi-robot navigation and its performance analysis
Yonemoto GA-based action learning
CN115933734A (en) Multi-machine exploration method and system under energy constraint based on deep reinforcement learning
Lamini et al. Q-Free Walk Ant Hybrid Architecture for Mobile Robot Path Planning in Dynamic Environment
Via et al. Autonomous Robot Path Planning Using Ant Colony Optimization and Evolutionary Programming
Parker et al. Cyclic genetic algorithms for evolving multi-loop control programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant