CN114770497B

CN114770497B - Search and rescue method and device of search and rescue robot and storage medium

Info

Publication number: CN114770497B
Application number: CN202210328204.2A
Authority: CN
Inventors: 林泽阳; 赖俊; 陈希亮; 王军; 刘志飞
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2024-02-02
Anticipated expiration: 2042-03-31
Also published as: CN114770497A

Abstract

The invention provides a search and rescue method, a device and a storage medium of a search and rescue robot, wherein the method comprises the following steps: when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot; generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model; executing search and rescue actions according to an automatic search and rescue strategy; the training of the automatic search and rescue strategy model comprises the following steps: constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment; constructing an automatic search and rescue strategy model based on a VDN algorithm; initializing an automatic search and rescue strategy model and training based on a training task; the invention can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.

Description

Search and rescue method and device of search and rescue robot and storage medium

Technical Field

The invention relates to a search and rescue method and device of a search and rescue robot and a storage medium, and belongs to the technical field of unmanned operation.

Background

The search and rescue robot is an intelligent robot which can replace search and rescue personnel to go deep into rescue lines to engage in dangerous tasks such as personnel rescue, information detection and the like when the search and rescue robot faces emergency situations such as urban natural disasters, chemical explosions, fires and the like. When disasters such as earthquake, fire, chemical explosion and nuclear explosion occur, the building structure of the rescue site is extremely unstable, and secondary disasters can occur at any time, so that great risks are brought to life health of rescue workers. The intelligent search and rescue robot based on deep reinforcement learning can search vital signs in a slit according to expert instructions, detect on-site information, adjust search and rescue strategies according to real-time observation of on-site environment and avoid damage of the robot, is an important branch and development direction of current intelligent robot application, and has important significance for intelligent development of rescue work.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a search and rescue method, a search and rescue device and a storage medium of a search and rescue robot, which can improve the learning efficiency of reinforcement learning of the search and rescue robot and meet the real-time requirement of the search and rescue robot in real tasks.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a search and rescue method of a search and rescue robot, including:

when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;

generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;

executing search and rescue actions according to an automatic search and rescue strategy;

the training of the automatic search and rescue strategy model comprises the following steps:

constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;

constructing an automatic search and rescue strategy model based on a VDN algorithm;

and initializing an automatic search and rescue strategy model and training based on a training task.

Optionally, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.

Optionally, initializing the automatic search and rescue strategy model and training based on the training task includes:

initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;

executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;

training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;

if the preset maximum iteration number is reached, training is completed.

Optionally, the obtaining the model training sample set includes:

acquiring the current state S and the environment observation value O of the search and rescue robot;

selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;

driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot ^* And environmental observations O ^* ；

According to the next state S of the search and rescue robot ^* And environmental observations O ^* Obtaining a score R and a termination state E of an action strategy a;

the characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated ^* ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) ^* ),a,R,E}；

Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;

randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D _T 。

Optionally, the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.

Optionally, according to the next state S of the search and rescue robot ^* And environmental observations O ^* The obtaining of the score R and the termination state E of the action policy a includes:

if the next state S ^* The coordinate of the search and rescue robot in the search and rescue range is positioned in the search and rescue range, and the next environment observation value O ^* If survivors exist, acquiring preset bonus points;

if the next state isS ^* If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;

if the next state S ^* If the coordinate of the search and rescue robot is out of the search and rescue range, the termination state is set as termination.

Optionally, the training and updating the automatic search and rescue strategy model through the model training sample set includes:

training a sample set according to the model of the ith search and rescue robotCalculating a target reward value +.f of a t-th cache playback array of an i-th search and rescue robot>

Wherein i=1, 2,3 … p, t=1, 2,3 … T, R _t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q _i ' is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache _t Is the next state of (2);

target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>

Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:

wherein pi (·) is a policy action generated through a realistic action network, Q _i (. Cndot.) is an evaluation value generated by a real target evaluation network;

based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:

if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is the update coefficient;

model training sample set D for each search and rescue robot _T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;

the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.

In a second aspect, the present invention provides a search and rescue device of a search and rescue robot, the device comprising:

the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;

the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;

the search and rescue execution module is used for executing search and rescue actions according to an automatic search and rescue strategy;

In a third aspect, the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is operative according to the instructions to perform steps according to the method described above.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the above method.

Compared with the prior art, the invention has the beneficial effects that:

according to the search and rescue method, device and storage medium for the search and rescue robots, in the reinforcement learning process, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low by setting the scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of a plurality of search and rescue robots is effectively solved. And the VDN algorithm structure is applied to carry out linear addition on the individual search and rescue robot rewarding values, so that the individual search and rescue robot rewarding values are used as the basis for each robot to execute own actions, and the false rewarding problem in the learning process of a plurality of search and rescue robots is effectively solved. In conclusion, the learning efficiency of reinforcement learning of the search and rescue robot can be improved, and the real-time requirement of the search and rescue robot in a real task is met.

Drawings

Fig. 1 is a flowchart of a search and rescue method of a search and rescue robot according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.

Embodiment one:

as shown in fig. 1, the embodiment of the invention provides a search and rescue method of a search and rescue robot, which comprises the following steps:

1. when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;

2. generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;

3. executing search and rescue actions according to an automatic search and rescue strategy;

s1, constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;

specifically, the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.

S2, constructing an automatic search and rescue strategy model based on a VDN algorithm.

S3, initializing an automatic search and rescue strategy model and training based on a training task;

the training comprises the following steps:

(1) Initializing an automatic search and rescue strategy model, wherein the automatic search and rescue strategy model comprises initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;

(2) Executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;

specifically, obtaining the model training sample set includes:

1. acquiring the current state S and the environment observation value O of the search and rescue robot;

the states include search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.

2. Selecting an action strategy a from the action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;

the action set a includes, but is not limited to, forward action, backward action, left turn action, right turn action.

3. Driving the search and rescue robot to automatically search and rescue in a simulation environment according to the action strategy a and acquiring the next state S of the search and rescue robot ^* And environmental observations O ^* ；

4. According to the next state S of the search and rescue robot ^* And environmentObserved value O ^* Obtaining a score R and a termination state E of an action strategy a;

if the next state S ^* If the search and rescue robot coordinates are in the search and rescue range, judging whether collision occurs according to the search and rescue robot coordinates and the obstacle coordinates or other search and rescue robot coordinates, if collision occurs, deducting a preset collision score from the reward score to obtain a score R, and setting a termination state as termination;

5. The characteristic vector phi (S) of the current state of the search and rescue robot and the characteristic vector phi (S) of the next state are calculated ^* ) The action policy a, score R, and termination state E are saved as a cache playback array, denoted { φ (S), φ (S) ^* ),a,R,E}；

6. Storing the cache playback array into a pre-built cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches the preset number;

7. randomly selecting T cache playback arrays from the cache playback experience pool D to generate a model training sample set D _T 。

(3) Training and updating the automatic search and rescue strategy model through the model training sample set, and carrying the updated automatic search and rescue strategy model into the steps to iterate instead of the initialized automatic search and rescue strategy model;

(4) If the preset maximum iteration number is reached, training is completed.

Specifically, training and updating the automatic search and rescue strategy model through the model training sample set comprises the following steps:

1. training a sample set according to the model of the ith search and rescue robotCalculating the ith search and rescue robotTarget prize value for the t-th cache playback array>

2. target reward values of the nth cache playback array of all search and rescue robotsObtaining target reward value of all search and rescue robots by linear addition>

3. Based on target prize valuesConstructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:

4. based on target prize valuesConstructing a second loss function, and updating a weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:

5. if t% c=0, updating the weight parameters ω 'and θ' of the target action network and the target evaluation network according to the weight parameters ω and θ of the actual action network and the actual evaluation network:

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

6. model training sample set D for each search and rescue robot _T The obtained target prize valueRanking from high to low, ranking the target prize value +_based on ranking results>Corresponding action strategy->The selection probability epsilon of (1) is reassigned from low to high;

7. the policy model is updated based on the updated real action network, the updated real evaluation network, the updated target action network, and the updated weight parameters ω, θ, ω ', θ' of the updated target evaluation network, and the updated selection probability ε.

In the present embodiment of the present invention, in the present embodiment,

(1) In the reinforcement learning process of the plurality of search and rescue robots, the obtained Q values of all the search and rescue robots are sequentially ordered from high to low through setting a scoring priority rule, and the selection probability assignment is carried out from low to high, so that the inertia problem in the learning process of the plurality of search and rescue robots is effectively solved.

(2) In the reinforcement learning process of a plurality of search and rescue robots, the VDN algorithm (Value decomposition Network) structure is applied to linearly add the rewarding values of the individual search and rescue robots, and the rewarding values are used as the basis for each robot to execute the action of each robot, so that the problem of false rewarding in the learning process of the plurality of search and rescue robots is effectively solved.

Embodiment two:

the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises:

Embodiment III:

based on the first embodiment, the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

Embodiment four:

based on an embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the above-mentioned method.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. The search and rescue method of the search and rescue robot is characterized by comprising the following steps of:

initializing an automatic search and rescue strategy model and training based on a training task;

wherein the obtaining the model training sample set includes:

According to the next state S of the search and rescue robot ^* And environmental observations O ^* Acquisition of motionScore R and termination state E for policy a;

2. The search and rescue method of a search and rescue robot according to claim 1, wherein the training task is to set a search and rescue range in a simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position in the search and rescue range at intervals of a preset time period; the search and rescue robot searches survivors in the search and rescue range, and meanwhile, the search and rescue robot is prevented from touching obstacles and other search and rescue robots.

3. A search and rescue method for a search and rescue robot as defined in claim 1, wherein initializing an automatic search and rescue strategy model and training based on a training task comprises:

if the preset maximum iteration number is reached, training is completed.

4. A search and rescue method for a search and rescue robot according to claim 1, wherein the state includes search and rescue robot coordinates, and the environmental observations include obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.

5. The search and rescue method of a search and rescue robot according to claim 4, wherein the following state S of the search and rescue robot ^* And environmental observations O ^* The obtaining of the score R and the termination state E of the action policy a includes:

6. A search and rescue method for a search and rescue robot as defined in claim 4, wherein training and updating the automatic search and rescue strategy model through the model training sample set comprises:

Wherein i=1, 2,3 … p, t=1, 2,3 … T, R _t For the score of the t-th cache playback array, gamma is a discount factor, pi '(. Cndot.) is a policy action generated through the target action network, ω' is a weight parameter of the target evaluation network, Q '' _i (. Cndot.) is an evaluation value generated by the target evaluation network,playback of the current state S of the array for the t-th cache _t Is the next state of (2);

wherein pi (·) is a policy generated through a realistic action networkSlightly act, Q _i (. Cndot.) is an evaluation value generated by a real target evaluation network;

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

weight parameters omega, theta and omega based on updated real action network, real evaluation network, target action network and target evaluation network ^′ 、θ ^′ The selection probability epsilon updates the policy model.

7. A search and rescue device of a search and rescue robot, the device comprising:

wherein the obtaining the model training sample set includes:

8. The search and rescue device of the search and rescue robot is characterized by comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor being operative according to the instructions to perform the steps of the method according to any one of claims 1-6.

9. Computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.