CN114770497A

CN114770497A - Search and rescue method and device of search and rescue robot and storage medium

Info

Publication number: CN114770497A
Application number: CN202210328204.2A
Authority: CN
Inventors: 林泽阳; 赖俊; 陈希亮; 王军; 刘志飞
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-22
Anticipated expiration: 2042-03-31
Also published as: CN114770497B

Abstract

The invention provides a search and rescue method, a device and a storage medium of a search and rescue robot, wherein the method comprises the following steps: when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot; generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training; executing search and rescue actions according to the automatic search and rescue strategy; wherein the training of the automatic search and rescue strategy model comprises: constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment; constructing an automatic search and rescue strategy model based on a VDN algorithm; initializing an automatic search and rescue strategy model and training based on a training task; the invention can improve the learning efficiency of the search and rescue robot for reinforcement learning and meet the real-time requirement of the search and rescue robot in the actual task.

Description

Search and rescue method and device of search and rescue robot and storage medium

Technical Field

The invention relates to a search and rescue method and device of a search and rescue robot and a storage medium, and belongs to the technical field of unmanned driving.

Background

The search and rescue robot is an intelligent robot which can replace search and rescue personnel to deeply rescue personnel and conduct dangerous tasks such as personnel rescue, information detection and the like in the first line when facing emergencies such as urban natural disasters, chemical explosion, fire disasters and the like. When disasters such as earthquake, fire, chemical explosion, nuclear explosion and the like occur, the building structure of the rescue site is extremely unstable, and secondary disasters can occur at any time, so that great risks are brought to the life health of rescuers. The intelligent search and rescue robot based on the deep reinforcement learning can deeply enter the slit to search life signs and detect field information according to expert instructions, adjusts a search and rescue strategy according to real-time observation on the field environment and avoids damage of the robot, is an important branch and development direction of current intelligent robot application, and has important significance for intelligent development of rescue work.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a search and rescue method, a search and rescue device and a storage medium of a search and rescue robot, can improve the learning efficiency of reinforcement learning of the search and rescue robot, and meets the real-time requirement of the search and rescue robot in a real task.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a search and rescue method for a search and rescue robot, including:

when a search and rescue instruction is acquired, initializing the self state of the search and rescue robot;

generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training;

executing search and rescue actions according to the automatic search and rescue strategy;

wherein the training of the automatic search and rescue strategy model comprises:

constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;

constructing an automatic search and rescue strategy model based on a VDN algorithm;

initializing the automatic search and rescue strategy model and training based on the training task.

Optionally, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors, and n obstacles in the search and rescue range, and generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.

Optionally, initializing the automatic search and rescue strategy model and training based on the training task includes:

initializing an automatic search and rescue strategy model, including initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;

executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;

training and updating the automatic search and rescue strategy model through the model training sample set, and substituting the updated automatic search and rescue strategy model for the initialized automatic search and rescue strategy model into the steps for iteration;

and if the preset maximum iteration times are reached, finishing the training.

Optionally, the obtaining of the model training sample set includes:

acquiring a current state S and an environment observation value O of the search and rescue robot;

selecting an action strategy a from an action set A based on a current state S and an environment observation value O according to an initialized automatic search and rescue strategy model;

according to the action strategy a, the search and rescue robot is driven to automatically search and rescue in the simulation environment and the next state S of the search and rescue robot is obtained^*And environmental observation O^*；

According to the next state S of the search and rescue robot^*And environmental observation O^*Acquiring a score R and a termination state E of an action strategy a;

the feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared^*) The action strategy a, the score R and the termination state E are saved as a cache replay array and are marked as { phi (S), phi (S)^*),a,R,E}；

Storing the cache playback array into a pre-constructed cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches a preset number;

randomly selecting T cache replay arrays from the cache replay experience pool D to generate a model training sample set D_T。

Optionally, the state includes search and rescue robot coordinates, and the environment observation value includes obstacles, survivors, and other search and rescue robots within a preset range around the search and rescue robot coordinates.

Optionally, the next state S according to the search and rescue robot^*And environmental observation O^*The acquisition of the score R and the termination state E of the action strategy a comprises the following steps:

if the next state S^*The coordinates of the search and rescue robot in (1) are located in the search and rescue range, and the next environmental observation value O^*If the survivors exist in the pool, acquiring a preset reward point;

if the next state S^*Judging whether collision occurs or not according to the coordinates of the search and rescue robot and the coordinates of the barrier or the coordinates of other search and rescue robots when the coordinates of the search and rescue robot are within the search and rescue range, deducting a preset collision score from the reward point to obtain a score R if the collision occurs, and setting the termination state as termination;

if the next state S^*If the coordinates of the search and rescue robot in the system are outside the search and rescue range, the termination state is set as termination.

Optionally, the training and updating the automatic search and rescue strategy model through the model training sample set includes:

model training sample set according to ith search and rescue robot

Calculating the target reward value of the t cache playback array of the ith search and rescue robot

Wherein, i is 1,2,3 … p, T is 1,2,3 … T, R_tFor the score of the t-th cache replay array, gamma is a discount factor, pi '(. cndot.) is a policy action generated by the target action network, omega' is a weight parameter of the target evaluation network, Q_i' (-) is an evaluation value generated by an objective evaluation network,

playback of the current state S of the array for the tth cache_tThe next state of (a);

target reward values of the t-th cache playback array of all the search and rescue robots

Linear addition is carried out to obtain target reward values of all the search and rescue robots

Targeted reward value

Constructing a first loss function, and updating a weight parameter omega of the reality evaluation network through gradient back propagation of a neural network; the first loss function is:

where π (-) is a policy action generated through the real action network, Q_i() is an evaluation value generated by a real-world object evaluation network;

reward value based on goal

Construction ofThe second loss function updates the weight parameter theta of the real action network through the gradient back propagation of the neural network; the second loss function is:

if t% C is 0, updating the weight parameters ω 'and θ' of the target motion network and the target evaluation network according to the weight parameters ω and θ of the real motion network and the real evaluation network:

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

wherein C is the update frequency of the weight parameters of the target action network and the target evaluation network, and tau is an update coefficient;

model training sample set D for each search and rescue robot_TEarned target prize value

Sorting from high to low, and according to the sorting result, the target reward value is sorted

Corresponding action policy

The selection probability epsilon is reassigned from low to high;

and updating the strategy model based on the updated weight parameters omega, theta, omega ', theta' and the selection probability epsilon of the real action network, the real evaluation network, the target action network and the target evaluation network.

In a second aspect, the present invention provides a search and rescue apparatus for a search and rescue robot, the apparatus including:

the information acquisition module is used for initializing the self state of the search and rescue robot when the search and rescue instruction is acquired;

the strategy generation module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;

the search and rescue execution module is used for executing search and rescue actions according to the automatic search and rescue strategy;

In a third aspect, the invention provides a search and rescue device of a search and rescue robot, comprising a processor and a storage medium;

the storage medium is to store instructions;

the processor is configured to operate in accordance with the instructions to perform the steps according to the above-described method.

In a fourth aspect, the invention provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.

Compared with the prior art, the invention has the following beneficial effects:

according to the search and rescue method, device and storage medium of the search and rescue robot, in the reinforcement learning process, all the search and rescue robots are sequentially ranked from high to low in the obtained Q values by setting the scoring priority rule, and are assigned with the selection probability from low to high, so that the problem of inertia in the learning process of a plurality of search and rescue robots is effectively solved. The method has the advantages that the reward values of the individual search and rescue robots are linearly added by applying a VDN algorithm structure and serve as the basis for each robot to execute the self action, and the problem of false reward in the learning process of a plurality of search and rescue robots is effectively solved. In conclusion, the search and rescue robot learning method and the search and rescue robot learning system can improve the learning efficiency of the search and rescue robot reinforcement learning and meet the real-time requirement of the search and rescue robot in the actual task.

Drawings

Fig. 1 is a flowchart of a search and rescue method of a search and rescue robot according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

as shown in fig. 1, an embodiment of the present invention provides a search and rescue method for a search and rescue robot, including the following steps:

1. when a search and rescue instruction is obtained, initializing the self state of the search and rescue robot;

2. generating an automatic search and rescue strategy according to the state of the search and rescue strategy model after training;

3. executing search and rescue actions according to the automatic search and rescue strategy;

wherein, the training of the automatic search and rescue strategy model comprises the following steps:

s1, constructing a search and rescue simulation environment, and formulating a training task according to the simulation environment;

specifically, the training task is to set a search and rescue range in the simulation environment, configure p search and rescue robots, m survivors and n obstacles in the search and rescue range, and generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.

And S2, constructing an automatic search and rescue strategy model based on a VDN algorithm.

S3, initializing the automatic search and rescue strategy model and training based on the training task;

the training comprises the following steps:

(1) initializing an automatic search and rescue strategy model, including initializing a real action network, a target action network, a real evaluation network and a target evaluation network of the automatic search and rescue strategy model;

(2) executing a training task based on an automatic search and rescue strategy model in a search and rescue simulation environment to obtain a model training sample set;

specifically, obtaining the model training sample set includes:

1. acquiring a current state S and an environmental observation value O of the search and rescue robot;

the state includes search and rescue robot coordinate, and the environment observation value includes barrier, survivor and other search and rescue robots of predetermineeing the within range around the search and rescue robot coordinate.

2. Selecting an action strategy a from an action set A based on the current state S and the environment observation value O according to the initialized automatic search and rescue strategy model;

the action set a includes, but is not limited to, forward action, reverse action, left turn action, right turn action.

3. According to the action strategy a, the search and rescue robot is driven to automatically search and rescue in the simulation environment and the next state S of the search and rescue robot is obtained^*And environmental observation O^*；

4. According to the next state S of the search and rescue robot^*And environmental observation O^*Acquiring a score R and a termination state E of the action strategy a;

if the next state S^*Judging whether collision occurs according to the coordinates of the search and rescue robot and the coordinates of the barrier or the coordinates of other search and rescue robots when the coordinates of the search and rescue robot in the search and rescue range are within the search and rescue range, deducting a preset collision score from the reward score to obtain a score R if the collision occurs, and setting the termination state as termination;

5. The feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared^*) The action strategy a, the score R and the termination state E are stored as a cache replay array, which is marked as { phi (S), phi (S)^*),a,R,E}；

6. Storing the cache playback array into a pre-constructed cache playback experience pool D, and repeating the steps until the cache playback array in the cache playback experience pool D reaches a preset number;

7. randomly selecting T cache replay arrays from a cache replay experience pool D to generate a model training sample set D_T。

(3) Training and updating the automatic search and rescue strategy model through the model training sample set, and substituting the updated automatic search and rescue strategy model for the initialized automatic search and rescue strategy model into the steps for iteration;

(4) and if the preset maximum iteration times are reached, finishing the training.

Specifically, the training and updating of the automatic search and rescue strategy model through the model training sample set comprises the following steps:

1. training sample set according to model of ith search and rescue robot

Wherein, i is 1,2,3 … p, T is 1,2,3 … T, R_tFor the score of the t-th cache replay array, gamma is a discount factor, pi '(. cndot.) is a policy action generated by the target action network, omega' is a weight parameter of the target evaluation network, Q_i' (. cndot.) is an evaluation value generated by the target evaluation network,

playback the current state S of the array for the t-th cache_tThe next state of (a);

2. target reward values of the t-th cache playback array of all the search and rescue robots

Linear addition to obtain all search and rescue robotsTargeted reward values

3. Targeted reward value

Constructing a first loss function, and updating a weight parameter omega of a reality evaluation network through gradient back propagation of a neural network; the first loss function is:

4. targeted reward value

Constructing a second loss function, and updating the weight parameter theta of the real action network through gradient back propagation of the neural network; the second loss function is:

5. if t% C is 0, updating the weight parameters ω 'and θ' of the target motion network and the target evaluation network according to the weight parameters ω and θ of the real motion network and the real evaluation network:

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

6、model training sample set D for each search and rescue robot_TEarned target prize value

Corresponding action policy

The selection probability epsilon of the user is reassigned from low to high;

7. updating the strategy model based on the updated weight parameters omega, theta, omega ', theta' and the selection probability epsilon of the real action network, the real evaluation network, the target action network and the target evaluation network.

In the present embodiment of the present invention,

(1) in the process of the multiple search and rescue robots for reinforcement learning, all the search and rescue robots are sequentially ranked from high to low in the obtained Q values by setting a scoring priority rule, and probability assignment is performed from low to high, so that the problem of inertia in the process of the multiple search and rescue robots for learning is effectively solved.

(2) In the reinforcement learning process of a plurality of search and rescue robots, the reward values of the individual search and rescue robots are linearly added by applying a VDN (Value decomposition Network) structure and are used as the basis for each robot to execute the self action, so that the false reward problem in the learning process of the plurality of search and rescue robots is effectively solved.

The second embodiment:

the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises:

and initializing the automatic search and rescue strategy model and training based on the training task.

Example three:

based on the first embodiment, the embodiment of the invention provides a search and rescue device of a search and rescue robot, which comprises a processor and a storage medium, wherein the processor is used for processing a search and rescue signal;

a storage medium to store instructions;

the processor is configured to operate in accordance with instructions to perform steps in accordance with the above-described method.

Example four:

based on the first embodiment, the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is configured to implement the steps of the method when executed by a processor.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various improvements and modifications without departing from the technical principle of the present invention, and those improvements and modifications should be considered as the protection scope of the present invention.

Claims

1. A search and rescue method of a search and rescue robot is characterized by comprising the following steps:

2. The search and rescue method for search and rescue robots according to claim 1, characterized in that the training task is to set a search and rescue range in a simulation environment, to configure p search and rescue robots, m survivors, n obstacles in the search and rescue range, and to generate a new obstacle at any position every preset time period in the search and rescue range; the search and rescue robot searches for survivors in the search and rescue range, and simultaneously avoids touching obstacles and other search and rescue robots.

3. The method for searching and rescuing robot as claimed in claim 1, wherein initializing the automated search and rescue strategy model and training based on training tasks includes:

and if the preset maximum iteration times are reached, finishing the training.

4. The method as claimed in claim 1, wherein the obtaining of the set of model training samples comprises:

acquiring a current state S and an environmental observation value O of the search and rescue robot;

the feature vector phi (S) of the current state of the search and rescue robot and the feature vector phi (S) of the next state are compared^*) The action strategy a, the score R and the termination state E are stored as a cache replay array, which is marked as { phi (S), phi (S)^*),a,R,E}；

5. The method as claimed in claim 4, wherein the state includes search and rescue robot coordinates, and the environmental observation value includes obstacles, survivors and other search and rescue robots within a preset range around the search and rescue robot coordinates.

6. A search and rescue method for a search and rescue robot as claimed in claim 5, characterized in that the method is based on the next state S of the search and rescue robot^*And environmental observation O^*The acquisition of the score R and the termination state E of the action strategy a comprises the following steps:

if the next state S^*If the coordinates of the search and rescue robot are within the search and rescue range, whether the search and rescue robot is in the search and rescue range or not is judged according to the coordinates of the search and rescue robot and the coordinates of the obstacle or the coordinates of other search and rescue robotsIf the collision occurs, deducting a preset collision score from the reward points to obtain a score R, and setting the termination state as termination;

7. The method as claimed in claim 5, wherein the training and updating of the automatic search and rescue strategy model by the model training sample set comprises:

training sample set according to model of ith search and rescue robot

Reward value based on goal

targeted reward value

ω′←τω+(1-τ)ω′

θ′←τθ+(1-τ)θ′

model training sample set D for each search and rescue robot_TAcquired target prize value

Sorting from high to low, and according to sorting result, rewarding target value

Corresponding action policy

The selection probability epsilon is reassigned from low to high;

8. A search and rescue apparatus of a search and rescue robot, the apparatus comprising:

the information acquisition module is used for initializing the self state of the search and rescue robot when acquiring the search and rescue instruction;

the strategy generating module is used for generating an automatic search and rescue strategy according to the self state based on the trained search and rescue strategy model;

9. A search and rescue device of a search and rescue robot is characterized by comprising a processor and a storage medium;

the storage medium is to store instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.

10. Computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.