CN117271967B

CN117271967B - Rescue co-location method and system based on reinforcement learning compensation filtering

Info

Publication number: CN117271967B
Application number: CN202311537570.XA
Authority: CN
Inventors: 王然; 徐诚; 孙敬; 段世红; 张晓彤
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-13
Anticipated expiration: 2043-11-17
Also published as: CN117271967A

Abstract

The invention relates to the technical field of cooperative positioning, in particular to a rescue cooperative positioning method and system based on reinforcement learning compensation filtering. The rescue co-location method based on reinforcement learning compensation filtering comprises the following steps: acquiring data through a micro unmanned aerial vehicle to obtain self-information and observation information; obtaining a preliminary position estimation by expanding a Kalman filtering algorithm; performing filtering gain compensation on the extended Kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result; updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network; obtaining an updating action through updating the evaluation network and a preset strategy network; correcting the local positioning result according to the updating action to obtain an accurate positioning result; and carrying out rescue route planning according to the accurate positioning result. The rescue co-location method based on reinforcement learning compensation filtering is high in accuracy and strong in robustness.

Description

Rescue co-location method and system based on reinforcement learning compensation filtering

Technical Field

The invention relates to the technical field of cooperative positioning, in particular to a rescue cooperative positioning method and system based on reinforcement learning compensation filtering.

Background

In the sudden emergency rescue event, the priori knowledge obtained by the searcher on the surrounding environment is very limited, and a great test is put forward on the searching process. In the collaborative searching process, once effective positioning information of the target can be obtained, continuous positioning and tracking can be carried out on the rescue target, and the positioning accuracy directly influences the following rescue path planning and the efficiency of rescue activities. The real-time reliable positioning plays a key role in detecting the direction of a rescue target and planning a rescue path, and provides powerful guarantee for the follow-up correct decision and taking corresponding measures.

In a highly dynamic environment, the collaborative technology can fuse the perception information acquired by individuals, and the information gain among group target nodes is obtained through the mutual communication among the intelligent bodies. In order to meet the requirements of high precision and real-time, researchers have been exploring various co-location methods, wherein ultra-wideband and inertial measurement unit co-location techniques have received great attention for their unique advantages.

However, ultra-Wideband/inertial measurement unit (Inertial Measurement Unit) (simply "UWB/IMU") co-location still faces some challenges and shortcomings that need to be addressed further. First, initial positioning is a critical issue, especially without a priori information or reference base stations. Accurate initial positioning is critical to subsequent co-location algorithms and system performance. Secondly, there are error accumulation and inconsistent problem of error distribution between agents between UWB and IMU, and need to find optimization method to reduce cooperative error and improve overall positioning accuracy. The existing co-localization methods perform well under well calibrated experimental conditions, but are not reliable in more complex kinetic environments. Since it is very sensitive to the initial state estimation and the initial estimation depends on empirical selection, it is difficult to ensure accuracy. In complex unknown environments, the noise distribution is not constant, resulting in a continuously changing environment model structure, requiring a continuously adjusting filter gain. If the gain adjustment is not considered, the estimation performance can slowly converge or diverge.

In the prior art, a rescue co-location method with high accuracy and strong robustness based on reinforcement learning compensation filtering is lacking.

Disclosure of Invention

The embodiment of the invention provides a rescue co-location method and system based on reinforcement learning compensation filtering. The technical scheme is as follows:

in one aspect, a rescue co-location method based on reinforcement learning compensation filtering is provided, the method is implemented by an electronic device, and the method includes:

acquiring data through a micro unmanned aerial vehicle to obtain self-information and observation information;

according to the self information and the observation information, obtaining preliminary position estimation through an extended Kalman filtering algorithm;

according to the preliminary position estimation, performing filtering gain compensation on the extended Kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result;

updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network;

according to the local positioning result, obtaining an updating action through the updating evaluation network and a preset strategy network;

correcting the local positioning result according to the updating action to obtain an accurate positioning result;

and carrying out rescue route planning according to the accurate positioning result.

Optionally, the obtaining the preliminary position estimation according to the self information and the observation information through an extended kalman filtering algorithm includes:

calculating according to the self information and the observation information to obtain a priori estimated value;

and updating the prior estimated value according to the prior estimated value and the observation information to obtain preliminary position estimation.

Optionally, the filtering gain compensation is performed on the extended kalman filtering algorithm by adopting a reinforcement learning method according to the preliminary position estimation, so as to obtain a local positioning result, including:

carrying out parameter association on the extended Kalman filtering algorithm and the multi-layer perceptron to obtain a parameter optimization model;

obtaining Kalman filtering gain through the parameter optimization model according to the preliminary position estimation;

and calculating according to the preliminary position estimation and the Kalman filtering gain to obtain a local positioning result.

Optionally, updating the central evaluation network according to the local positioning result and a preset policy network to obtain an updated evaluation network, including:

inputting the local positioning result into a preset strategy network to obtain an unmanned aerial vehicle action value;

obtaining corresponding actions according to the unmanned aerial vehicle action values;

and updating the central evaluation network based on the observation information and the corresponding action to obtain an updated evaluation network.

Optionally, the obtaining, according to the local positioning result, an update action through the update evaluation network and a preset policy network includes:

optimizing a preset strategy network based on the updating evaluation network to obtain an optimized strategy network;

inputting the local positioning result into the optimizing strategy network to obtain an updating action value;

and obtaining an updating action according to the updating action value.

Optionally, the correcting the local positioning result according to the updating action to obtain an accurate positioning result includes:

according to the updating action, an action adjustment direction and an action adjustment displacement are obtained;

and calculating according to the action adjustment direction, the action adjustment displacement and the local positioning result to obtain an accurate positioning result.

On the other hand, a rescue co-location system based on reinforcement learning compensation filtering is provided, the system is applied to realizing a rescue co-location method based on reinforcement learning compensation filtering, the rescue co-location method based on reinforcement learning compensation filtering comprises a micro unmanned plane and electronic equipment, wherein:

the micro unmanned aerial vehicle is used for collecting data through the micro unmanned aerial vehicle to obtain self-information and observation information;

the electronic equipment is used for obtaining preliminary position estimation through an extended Kalman filtering algorithm according to the self information and the observation information; according to the preliminary position estimation, performing filtering gain compensation on the extended Kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result; updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network; according to the local positioning result, obtaining an updating action through the updating evaluation network and a preset strategy network; correcting the local positioning result according to the updating action to obtain an accurate positioning result; and carrying out rescue route planning according to the accurate positioning result.

Optionally, the electronic device is further configured to:

and obtaining an updating action according to the updating action value.

Optionally, the electronic device is further configured to:

In another aspect, an electronic device is provided that includes a processor and a memory having at least one instruction stored therein that is loaded and executed by the processor to implement the reinforcement learning compensation filtering-based rescue co-location method described above.

In another aspect, a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement a reinforcement learning compensation filtering based rescue co-location method as described above is provided.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the invention provides a rescue co-location method based on reinforcement learning compensation filtering, which carries out initial position estimation according to observed data and locates by using an extended Kalman filtering method. Taking the residual error of the observation prediction data and the real observation data as the input of the reinforcement learning network. Training a network through a reinforcement learning method to obtain a compensation gain value, and further correcting an estimated result of the EKF; and training and compensating Kalman gain matrix optimization by adopting a reinforcement learning method to obtain a local positioning result of each micro unmanned aerial vehicle. The method has the advantages that the overall errors of the rescue system of the miniature unmanned aerial vehicle are distributed, the lazy phenomenon among intelligent agents is effectively avoided, the reward function based on the positioning error is maximized, the positioning error is optimized, and the overall errors of the system are minimized. And through cooperation and information sharing among the micro unmanned aerial vehicles, the positioning precision of each intelligent agent is further improved. The rescue co-location method based on reinforcement learning compensation filtering is high in accuracy and strong in robustness.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a rescue co-location method based on reinforcement learning compensation filtering provided by an embodiment of the invention;

FIG. 2 is a block diagram of a rescue co-location system based on reinforcement learning compensation filtering provided by an embodiment of the invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a rescue co-location method based on reinforcement learning compensation filtering, which can be realized by electronic equipment, wherein the electronic equipment can be a terminal or a server. A rescue co-location method flowchart based on reinforcement learning compensation filtering as shown in fig. 1, the process flow of the method may include the following steps:

s1, data acquisition is carried out through the micro unmanned aerial vehicle, and self-information and observation information are obtained.

In a possible embodiment, in the present invention, the micro-scale is absentThe man-machine acquires self information through Ultra-Wideband (UWB), inertial measurement unit (Inertial Measurement Unit, IMU), vision Sensor (Vision Sensor) and other sensing devicesAnd observation information->As input data for each agent in the rescue system.

Wherein the self-information includes linear acceleration of the IMUAngular velocity->True position of landmark +.>（/>Position, & gt>Location and corresponding identification); the observation information includes Euler distance between the landmark and the agent>Relative direction angle between landmark and agent +.>Correspondence of measured value and certain landmark match +.>。

S2, obtaining preliminary position estimation through an extended Kalman filtering algorithm according to the self information and the observation information.

Optionally, obtaining the preliminary position estimate by an extended kalman filter algorithm according to the self information and the observation information includes:

calculating according to the self-information and the observation information to obtain a priori estimated value;

and updating the prior estimated value according to the prior estimated value and the observation information to obtain the preliminary position estimation.

In a practical implementation, the precondition of the extended kalman filter algorithm (Extended Kalman Filter, EKF) is that the position information is acquired and marked through communication between the micro unmanned aerial vehicles, so that matching with the corresponding micro unmanned aerial vehicles after the information data is acquired is avoided, and therefore calculation and time expenditure are reduced. In the motion process, the self state is updated and propagated according to the self motion model. And acquiring the observation value of the adjacent intelligent agents through the communication information among the intelligent agents. And correcting the positioning estimation by utilizing the difference between the observed value and the actual observed value, and continuously reducing the positioning error to enable the positioning to gradually approach to the actual position.

Acquiring movement information of the micro unmanned aerial vehicle based on the initial position by using an IMU sensor, and providing linear acceleration according to the IMUAnd angular velocity->And position information on the agent at the moment +.>To predict the current time setting +.>Covariance matrix->And acquiring an priori estimated value. The motion model updated with IMU sensor data is shown in the following formulas (1) and (2):

（1）

（2）

wherein, self information，/>The actual input as micro unmanned aerial vehicle odometer already contains a noise value, which is uncorrelated gaussian white noise with zero mean, covariance V, i.e +.>V is a process noise covariance matrix whose values conform to a gaussian distribution. />Representation->Is implemented by using a jacobian matrix mode as follows (3):

（3）

after the odometer motion update, the uncertainty of the micro-drone state is always amplified due to the accumulated error of the IMU. The present invention utilizes feature measurements to reduce this uncertainty. In this framework, the measurement of each micro-unmanned aerial vehicle is performed in the form of distance and direction, and the input observation information can be expressed as。

Obtaining corresponding ranging information comprising deflection angle and distance information by observing known landmarks, finally selecting three pieces of observation information with minimum distance from the micro unmanned aerial vehicle in an observation range as observation data input, and fusing the measurement valuesFor correcting the motion update phase resultsIs->Obtain->Is->The measurement update procedure is shown in the following formulas (4), (5), (6) and (7):

（4）

（5）

（6）

（7）

wherein,is representative of the sensor covariance matrix, < >>Real measurement data obtained by the sensor, denoted as time t,/>For measuring the function values, the current position can be estimated by means of a measurement model>And converting into a measurement form corresponding to the distance direction of the problem scene, and obtaining a measurement predicted value. />Is an observation model matrix ofNonlinear measurement function->The mathematical expressions of the jacobian matrix are shown as the following formulas (8) and (9):

（8）

（9）

the measured prediction value is compared with the actual observation value of the sensor. Wherein the measurement function is represented by the following formulas (10), (11):

（10）

（11）

wherein,、/>、/>represents the true +.>Position, & gt>Location and corresponding identification. In the measuring process, combining landmark information set by the environment and +.>、/>The coordinates are respectively calculated to obtain difference values, euclidean distance solution is carried out, and yaw angle difference solution is carried out, so that a model predicted value +.>. The preliminary position estimation of each intelligent agent can be obtained through the calculation。

And S3, performing filtering gain compensation on the extended Kalman filtering algorithm by adopting a reinforcement learning method according to the preliminary position estimation to obtain a local positioning result.

Optionally, according to the preliminary position estimation, performing filtering gain compensation on the extended kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result, including:

carrying out parameter association on an extended Kalman filtering algorithm and a multi-layer perceptron to obtain a parameter optimization model;

obtaining Kalman filtering gain through a parameter optimization model according to the preliminary position estimation;

In a possible embodiment, the preliminary position estimate of each agent calculated in the above stepAs input data for compensating the kalman gain matrix optimization. The Extended Kalman Filter (EKF) is represented as a dynamic markov decision process (Markov Decision Process, MDP). The parameter variables in the reinforcement learning environment are associated with components of the EKF, and the reinforcement learning method is incorporated into the design of the EKF filter. In this way, deep reinforcement learning techniques are utilized to optimize the performance of the EKF to better accommodate uncertainties and complex environments.

In the PPO algorithm used in the present invention, actors (actor) and critics (cric) are composed of a plurality of fully-connected Multi-Layer Perceptron (MLP), and a linear rectification function (Rectified Linear Unit, reLU) is used as an activation function.

In the present invention, the state and action space is continuous, so MLP is selected as the network structure. A typical MLP includes three layers, an input layer, a hidden layer, and an output layer, each neuron layer being connected to a previous layer and a subsequent layer, receiving input from the previous layer and passing it to the next layer. The influence of the input is adjusted by adjusting the weight value and the weighted sum is converted into an output value by activating a function. Correspondingly, the network method of the invention can be popularized to other network results or methods through replacement or structural change so as to realize the same estimation target.

For policy networks, the inputs and outputs of the MLP depend on the state and the size of the motion vector, respectively. The size of the hidden layer should be chosen according to the complexity of the problem in practical applications. In the present invention, taking an MLP structure including two hidden layers as an example, the mathematical expression thereof is represented by the following formula (12):

（12）

wherein,is the input state->Representing the weight of each layer, function +.>Representing the RELU activation function added after each layer. RELU is indicated at->Provides advantages for the back propagation process.

The final EKF update formula can be written in MDP form with the mathematical expressions shown in formulas (13), (14) below:

（13）

（14）

wherein the state at the next moment is only the same as the state at the current momentMeasurement of->And Kalman gain value->Related to the following. In EKF, the Kalman gain is calculated by measurement innovation, which essentially is to measure the true state +.>And estimating a stateIs a difference in (a) between the two.

In the process of offline training of deep learning reinforcement learning, the real state of each step is assumedIs known, then from +.>To->Can be modeled as a nonlinear function. By considering the nonlinear mapping function as reinforcement learning strategy +.>Taking Kalman gain K as reinforcement learning action, training a compensation gain value by reinforcement learning method, MDP tupleThe RL state estimator structure in (1) may represent: state S (State)/(State)>The system transition probability is as followsWherein->The action value means a compensation gain value of the filter.

Action A is an Action matrix output by network, its value range is [ -0.1,0.1]The action matrix is combined with residual errors of the observed data and the model prediction data to further correct the positioning estimated value. The Reward function R (forward), which measures the value of the motion state pair, uses the mean square error between the true position and the estimated position in the position estimation. StrategyAn MLP approximator is used as a mapping from the estimated state value to the compensation gain value (action value).

For the sake of more clear description, the EKF estimation result will be describedDefined as->The reinforcement learning compensation result is defined as +.>I.e. the final result of the local positioning. The process of reinforcement learning compensation can be expressed as the following formulas (15), (16), (17):

（15）

（16）

（17）

wherein,the action matrix is the compensation gain value of the filter, and is also the action value output by the reinforcement learning network, and the network rewarding function is set as the following formula (18):

（18）

and S4, updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network.

and updating the central evaluation network based on the observation information and the corresponding actions to obtain an updated evaluation network.

In a possible implementation manner, in order to further refine the local positioning result based on the local positioning estimation algorithm, the invention provides a global cooperative positioning estimation algorithm based on credit allocation multi-agent reinforcement learning, and the final result is obtained by locally positioning the distance information Obs between intelligent agents and each agent calculated in the previous stepAs input to the reinforcement learning network, and trains an optimization strategy using the participant-criticizer network.

The algorithm utilizes the credit distribution network to distribute the overall errors of the multi-agent system, effectively avoids the lazy phenomenon among intelligent agents, maximizes the reward function based on the positioning errors, optimizes the positioning errors of all the intelligent agents, and minimizes the overall errors.

Based on the algorithm structure of centralized training distributed execution, a multi-agent reinforcement learning global co-location algorithm based on credit allocation is provided. And learning from the communication observation values, actual distances and residual errors of azimuth values among the intelligent agents by processing the observation data and training strategies among the intelligent agents so as to optimize the local positioning result. The goal of the algorithm is to learn an action value, and by performing the action, adjust the local positioning result so that the final result is closer to the true value.

The elements involved in the reinforcement learning process include: the observations (Obs O) include the positioning residual of the current agent, the distance to other agents, and angle information. Local observation information is input to the Actor network (Actor) for each agent, and global observation information is input to the central evaluator network (Critic). Action space (Action a): is defined as a discrete space consisting of 8 action values for adjusting the corrected orientation of the current agent. Rewards (Reward R): indicating a global bonus function, the larger the bonus value, the better the current positioning effect, as opposed to positioning errors. Wherein,as a discount factor, the mathematical expression is as follows (19):

（19）

in order to solve the credit allocation problem in the rescue activity collaboration problem, a method of independently allocating the reward value can be adopted, so that each micro unmanned aerial vehicle can know the return obtained by the action taken by the micro unmanned aerial vehicle. The invention mainly adopts a method of the inverse fact rule to obtain the dominant value of the individual.

Each micro-drone has a policy network, such as a recurrent neural network (Recurrent Neural Network, RNN). In each time step, each individual makes use of its local observationsFor input, actions are selected through the own policy network. By aggregating the action values of all agents, a joint action can be obtained. Furthermore, all individuals share an evaluator network, i.e. Q network, for calculating the action value functions of the respective agents. The evaluator network calculates the Q value of a joint action to approximately represent the global prize value.

However, the individual contribution to the global rewards is different. Thus, it is necessary to estimate the actual contribution of each agent based on the global prize value and calculate an independent return value for each agent. This independent return value->For constantly iterating its own policy network to optimize its action selection. The mathematical expression of the independent report values is shown as the following formula (20):

（20）

wherein,can be understood as taking action +.>Is +.>Better or worse. During subsequent training, efforts are made to maximize +.>I.e. in maximizing global rewards. />Representing a unionThe action taken at this moment is removed from the actions, +.>Representing the joint action value after taking the default action. s is the current individual state of the micro unmanned aerial vehicle, < >>。

If all actions are to be calculatedValues, each action typically needs to be replaced with a default action to interact with the environment, and then a corresponding utility value is calculated. However, this approach has two problems: firstly, the number of times of calculation is excessive, and the complexity of calculation is increased; secondly, it is not easy to determine which action is selected as the default action. The present invention therefore proposes an approximation method that approximates the average utility value of all possible actions taken to the utility value obtained by the default action, the mathematical expression of the utility value being shown in the following formula (21):

（21）

thenThe calculation of (2) is equivalent to +.>The mathematical expression is shown in the following formula (22):

（22）

based on independent return valuesSelect action->As an input to this stepOne of the data is entered. Network parameters->、/>、/>Discount factor->Maximum number of iteration rounds->Training round number->Equal input parameters are central evaluation network (+)>Network) initializing network parameters.

In the network of the method of the invention, a central evaluation network is adoptedNetwork) by calculating +_for each agent using differential Error (Time Difference Error, TD-Error)>Updating the evaluation network with the value, wherein the loss function of the network is represented by the following formulas (23), (24):

（23）

（24）

the mathematical expression of the parameter update method of the evaluation network is shown in the following formula (25):

（25）

the preset strategy network is updated based on strategy gradients, and a gradient calculation formula is shown as the following formula (26):

（26）

finally, the updated evaluation network parameters are obtained through calculationAnd policy network parameters->。

S5, according to the local positioning result, updating the evaluation network and a preset strategy network to obtain an updating action.

Optionally, according to the local positioning result, the updating action is obtained by updating the evaluation network and the preset policy network, including:

optimizing a preset strategy network based on an updating evaluation network to obtain an optimized strategy network;

inputting the local positioning result into an optimization strategy network to obtain an updated action value;

and obtaining the updating action according to the updating action value.

In a possible implementation manner, in the central evaluation network optimization process, each agent is continuously and iteratively updated with its own policy network, and an action is selected as the output of this step according to the updated policy network.

S6, correcting the local positioning result according to the updating action to obtain an accurate positioning result.

Optionally, correcting the local positioning result according to the updating action to obtain an accurate positioning result, including:

In a possible implementation manner, in the step, the correction result of the local positioning by using a global cooperative positioning method based on multi-agent reinforcement learning is utilizedAnd actions selected by the network according to the updated self-policies ∈>As input data, an estimation result with further improved accuracy is obtained +.>I.e. +.>Final estimate of time of day agent position. The accurate positioning result formula (27) is as follows:

（27）

wherein,is->Action value given at moment, indicating direction of adjustment,/->Representing the adjusted displacement magnitude value.

Finally, the most effective strategy under the scene is learned by continuously iterating the process, and a position estimation result with further improved precision is obtained through the strategy.

And S7, carrying out rescue route planning according to the accurate positioning result.

In one possible implementation, the searcher has very limited prior knowledge of the surrounding environment during the rescue incident. The micro unmanned aerial vehicle is used for collaborative search, accurate positioning information of the target can be obtained, rescue route planning is conducted on the rescue target, and continuous positioning and tracking are achieved, so that rescue actions are further developed.

FIG. 2 is a block diagram illustrating a rescue co-location system based on reinforcement learning compensation filtering, according to an example embodiment. Referring to fig. 2, the system is applied to implement a rescue co-location method based on reinforcement learning compensation filtering, and the rescue co-location method based on reinforcement learning compensation filtering includes a micro-unmanned aerial vehicle 210 and an electronic device 220, wherein:

the micro unmanned aerial vehicle 210 is configured to acquire self-information and observation information through data acquisition of the micro unmanned aerial vehicle;

the electronic device 220 is configured to obtain a preliminary position estimate according to the self-information and the observation information by extending a kalman filtering algorithm; according to the preliminary position estimation, performing filtering gain compensation on an extended Kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result; updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network; according to the local positioning result, an updating action is obtained through updating the evaluation network and a preset strategy network; correcting the local positioning result according to the updating action to obtain an accurate positioning result; and carrying out rescue route planning according to the accurate positioning result.

Optionally, the electronic device 220 is further configured to:

and obtaining the updating action according to the updating action value.

Optionally, the electronic device 220 is further configured to:

Fig. 3 is a schematic structural diagram of an electronic device 300 according to an embodiment of the present invention, where the electronic device 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 301 and one or more memories 302, where at least one instruction is stored in the memories 302, and the at least one instruction is loaded and executed by the processors 301 to implement the steps of the rescue co-location method based on reinforcement learning compensation filtering.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in a terminal to perform the above-described rescue co-location method based on reinforcement learning compensation filtering is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. Rescue co-location method based on reinforcement learning compensation filtering is characterized by comprising the following steps:

wherein the self-information includes linear acceleration of the inertial measurement unitAngular velocity->The true position m of the landmark; the observation information comprises Euler distance between landmark and intelligent agent>Relative direction angle between landmark and agent +.>Correspondence of measured value and certain landmark match +.>；

wherein, selecting the observationThree pieces of observation information with the smallest distance from the micro unmanned plane in the range are used as observation data input for correcting the obtained motion update stageObtaining a preliminary position estimate +.>The method comprises the steps of carrying out a first treatment on the surface of the Said->Estimating for the current position;

the filtering gain compensation is performed on the extended Kalman filtering algorithm by adopting a reinforcement learning method according to the preliminary position estimation, so as to obtain a local positioning result, which comprises the following steps:

calculating according to the preliminary position estimation and the Kalman filtering gain to obtain a local positioning result;

wherein the parameter optimization model is represented by the following formulas (15), (16) and (17):

（15）

（16）

（17）

wherein,is a local positioning result; />The preliminary position estimation result is obtained; />Is observation data; />A compensation gain value for the filter;

the updating the central evaluation network according to the local positioning result and the preset strategy network to obtain an updated evaluation network comprises the following steps:

updating the central evaluation network based on the observation information and the corresponding action to obtain an updated evaluation network;

the preset strategy network is an action decision network of each micro unmanned aerial vehicle; the central evaluation network is an evaluator network shared by all the micro unmanned aerial vehicles;

wherein, according to the local positioning result, the updating action is obtained through the updating evaluation network and a preset strategy network, including:

obtaining an updating action according to the updating action value;

2. The rescue co-location method based on reinforcement learning compensation filtering according to claim 1, wherein the correcting the local location result according to the updating action to obtain an accurate location result comprises:

3. The rescue co-location system based on the reinforcement learning compensation filtering is characterized by being used for realizing a rescue co-location method based on the reinforcement learning compensation filtering, and comprises a micro unmanned aerial vehicle and electronic equipment, wherein:

The electronic equipment is used for obtaining preliminary position estimation through an extended Kalman filtering algorithm according to the self information and the observation information; according to the preliminary position estimation, performing filtering gain compensation on the extended Kalman filtering algorithm by adopting a reinforcement learning method to obtain a local positioning result; updating the central evaluation network according to the local positioning result and a preset strategy network to obtain an updated evaluation network; according to the local positioning result, obtaining an updating action through the updating evaluation network and a preset strategy network; correcting the local positioning result according to the updating action to obtain an accurate positioning result; carrying out rescue route planning according to the accurate positioning result;

wherein three pieces of observation information with the smallest distance from the micro unmanned aerial vehicle in the observation range are selected as the input of the observation data for correcting the obtained motion update stageObtaining a preliminary position estimate +.>The method comprises the steps of carrying out a first treatment on the surface of the Said->Estimating for the current position;

（15）

（16）

（17）

and obtaining an updating action according to the updating action value.