CN116628448B

CN116628448B - Sensor management method based on deep reinforcement learning in extended target

Info

Publication number: CN116628448B
Application number: CN202310609986.1A
Authority: CN
Inventors: 陈辉; 张虹芸; 张文旭; 张新迪; 田博; 罗欣; 缪嘉伟
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-11-28
Anticipated expiration: 2043-05-26
Also published as: CN116628448A

Abstract

The application discloses a sensor management method based on deep reinforcement learning in an extended target, which comprises the following steps: modeling is conducted on an ellipse expansion target, and a virtual interaction environment with deep reinforcement learning is constructed according to an expansion target filtering algorithm; establishing a TD3 algorithm intelligent agent; the virtual interaction environment is interacted with the TD3 algorithm intelligent agent to obtain sensor control data, and the sensor control data is stored as a sample in an experience playback pool; based on the experience playback pool, extracting samples, training an intelligent agent of a TD3 algorithm, and deciding a sensor path planning optimal action through the trained intelligent agent; and (3) acting the optimal action on the sensor, and obtaining the sensor position after the sensor is subjected to state transfer, so as to obtain the sensor measurement value of the expansion target at the current moment, and carrying out filtering prediction and updating to carry out tracking estimation of the expansion target. The application optimizes the tracking effect of the ellipse expanding target as a whole.

Description

Sensor management method based on deep reinforcement learning in extended target

Technical Field

The application relates to the technical field of intelligent sensor management, in particular to a sensor management method based on deep reinforcement learning in an extended target.

Background

Sensor management refers to the process of ultimately achieving a goal by controlling degrees of freedom in a sensor system to meet certain constraints and optimize certain performance metrics. With the advent of modern high resolution sensors, targets can be produced by multiple metrology sources at a time, so that more state features of the target, such as the shape profile of the target, can be estimated, such a problem is referred to as an extended target tracking problem. There are two general classes of sensor management methods for obtaining optimal measurement information for the purpose of optimizing target tracking performance, one is a task-based sensor management method, and such methods formulate a corresponding sensor control strategy by specific task requirements, such as a variance of states or a measure about target state distribution. But such methods have difficulty meeting the requirements of systems where multiple task demands exist simultaneously. The other type is a sensor management method based on information theory, wherein an evaluation function (such as Kullback-Leibler divergence, renyi divergence and the like) is established through a certain measure of information gain between two probability density functions, and then a sensor control strategy is solved under the criterion of maximizing the information gain. The sensor control method based on the information theory can maximize the overall information gain of the system including a plurality of tasks.

Sensor management in conventional target tracking is typically studied in the theoretical framework of a partially observable markov decision process, and sensor management is typically performed in a discrete action space, since all actions in an achievable sensor control scheme need to be evaluated under established corresponding evaluation criteria at each decision. The conventional method cannot cope with the dimensional explosion caused by the rapid increase of the motion space and the problem of computational complexity.

In recent years, deep reinforcement learning is a new research hotspot in the field of artificial intelligence. The cross fusion of the deep reinforcement learning and the extended target tracking problem provides a new approach for realizing intelligent decision of sensor control. The Deep Q-Network (DQN) algorithm is an pioneering work in the field of Deep reinforcement learning. Based on the DQN, a series of improved algorithms such as Double DQN (DDQN), lasting DQN and Double Dueling DQN (D3 QN) are sequentially proposed. Although both DQN and its improved algorithms have good application results, the problem of continuous motion space cannot be addressed. Therefore, a classical continuous control algorithm, a depth deterministic strategy gradient (DDPG) algorithm, appears in the field of deep reinforcement learning, and is an important algorithm applied to complex continuous control. However, the DDPG algorithm has a problem that the Critic network overestimates the Q value.

In the existing sensor management method in the target tracking field, when a task of sensor management is selected to control the position of a sensor platform to optimize the target tracking performance, because both sensor management methods based on a task theory and an information theory need to be studied under the established specific task optimization or a certain optimization criterion, control decisions are mainly performed based on discrete sensor action spaces, the whole action space needs to be traversed when each decision is made, when all actions to be decided in the freedom degree space need to be considered, the traditional sensor control method faces the problem that efficiency is drastically reduced due to dimensional explosion, and when the freedom degree dimension of the decision is higher, the traditional sensor control method is not in charge.

Disclosure of Invention

In order to solve the technical problems, the application provides a sensor management method based on deep reinforcement learning in an extended target, which expands a traditional sensor management decision space from a discrete action space to a continuous action space, establishes a scientific rewards and rewards mechanism which jointly optimizes the motion state and the extended state (outline information) of the extended target according to the tracking estimation effect of the extended target, establishes an intelligent body learning optimal control strategy based on a deep reinforcement learning algorithm, and realizes intelligent decision of sensor control in an artificial intelligent mode.

In order to achieve the above object, the present application provides a sensor management method based on deep reinforcement learning in an extended target, including:

modeling is conducted on an ellipse expansion target, and a virtual interaction environment with deep reinforcement learning is constructed according to an expansion target filtering algorithm;

establishing a TD3 algorithm intelligent agent;

the virtual interaction environment and the TD3 algorithm agent are interacted to obtain sensor control data, and the sensor control data is stored as a sample in an experience playback pool; based on the experience playback pool, sampling, training the TD3 algorithm agent, and deciding a sensor path planning optimal action through the trained agent;

and the optimal action is acted on a sensor, and the sensor position is obtained after the sensor is subjected to state transfer, so that the current moment expansion target sensor measurement value is obtained, filtered prediction and updating are carried out, and the tracking estimation of the expansion target is carried out.

Preferably, modeling for the ellipse expansion target includes:

setting the state of the extended target tracking at the moment k as follows: zeta type _k ＝(x _k ,X _k ) Wherein x is _k Representing the kinematic state of the target, X _k Representing an extended state of the object;

the modeling method comprises the following steps:

wherein w is _k Is zero-mean Gaussian process noise, v _k For zero mean Gaussian measurement noise, x _s,k (pi) is the current time sensor position,mapping for system state evolution, ->For measurement mapping, x _k+1 Represents the kinematic state of the target at time k+1, < >>A plurality of measured values at time k are represented.

Preferably, the expansion state of the k-time expansion target is modeled as an ellipse shape using a positive definite symmetry matrix X _k The description is as follows:

wherein θ _k Is in the shape of ellipse, direction angle, sigma _k,1 Sum sigma _k,2 The major and minor axes of the oval shape, respectively.

Preferably, constructing a virtual interaction environment with deep reinforcement learning according to the extended target filtering algorithm includes:

fitting a cost function and a strategy function in the deep reinforcement learning based on a neural network, performing sensor control by adopting a deep reinforcement learning algorithm through a exploration and utilization mechanism, establishing an intelligent sensor control system, and constructing the virtual interaction environment through the intelligent sensor control system; the extended target filtering algorithm includes a prediction process and an update process.

Preferably, the prediction process is:

wherein F is _k|k-1 For state transition matrix, I _d For d-dimensional identity matrix, P _k|k-1 To predict covariance matrix D _k|k-1 Covariance matrix, x, of zero-mean gaussian process noise _k|k-1 As one-step predictive value, x _k-1|k-1 Filtering the updated value for time k-1, P _k-1|k-1 For the corresponding covariance matrix.

Preferably, the TD3 algorithm agent comprises:

actor network: for selecting an action based on the status;

target Actor network: the method comprises the steps of selecting actions again according to states according to results acquired by an Actor network;

critic network: the action selecting module is used for evaluating the action selected by the Actor network;

target Critic network: and the action selection module is used for evaluating the action selected by the Actor network again according to the result acquired by the Critic network.

Preferably, acquiring the sensor control data includes:

the intelligent agent takes action and acts on the virtual interactive environment at any moment, and the sensor is in a state x from the moment k _s,k Transition to state x at time k+1 _s,k+1 Obtaining a prize value R by evaluating a prize function _k+1 And then the agent continuously improves the strategy according to the rewarding value, and finally learns the action of the optimal strategy decision sensor at each moment.

Preferably, the method for constructing the reward function comprises the following steps:

and defining prior probability distribution and posterior probability distribution of the expansion target at the k moment, obeying multi-element Gaussian distribution, obtaining Gaussian distance between the prior probability distribution and the posterior probability distribution, and constructing the reward function based on the Gaussian distance.

Preferably, the reward function is:

wherein a is _k,0 Indicating that the sensor is at rest at the current time.

Compared with the prior art, the application has the following advantages and technical effects:

the method of the application uses a random matrix to model the expansion state of the elliptical expansion target, can effectively estimate the movement state and the expansion state of the target, then adopts the setting of the evaluation function in the sensor management method similar to the information theory to construct the reward function applied to the depth reinforcement learning TD3 algorithm, the reward function comprehensively considers the joint optimization of the movement state and the profile information (expansion state) of the target, and after the sensor is effectively controlled under the continuous action space by using the TD3 algorithm, the estimation of the centroid position of the target is more accurate compared with the sensorless control, and the estimation of the profile information of the target is also more accurate, thereby optimizing the tracking effect of the elliptical expansion target on the whole.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a schematic diagram of a sensor control track in an embodiment of the present application;

FIG. 2 is a graph of elliptical expansion target semi-major and semi-minor axis errors according to an embodiment of the present application;

FIG. 3 is a graph of centroid estimation error for an embodiment of the present application;

fig. 4 is a schematic diagram of a distance between a long-range installer and a target GW;

FIG. 5 is a flowchart of a sensor management method based on deep reinforcement learning in an extended target according to an embodiment of the present application;

fig. 6 is a schematic diagram of a connection relationship between networks in the TD3 algorithm agent according to an embodiment of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

The application provides a sensor management method based on deep reinforcement learning in an extended target, as shown in fig. 5, comprising the following steps:

(1) Extending the target tracking problem description:

when the extended target is tracked, the extended state of the target, namely the evolution of the shape of the target along with time, is tracked besides the tracking of the movement state of the mass center of the target, and the state of the extended target tracking at the moment k is set to be expressed as follows: zeta type _k ＝(x _k ,X _k ) Wherein x is _k Representing the kinematic state of the object, subject to a multivariate gaussian distribution. X is X _k Representing the extended state of the target, subject to an inverse Wishart distribution:

wherein w is _k Is zero-mean Gaussian process noise, v _k Noise is measured for zero-mean Gaussian. X is x _s,k (pi) is the current time sensor position,is a system state evolution map,/->Is a metrology map.

The expansion state of the expansion target at the moment k is modeled as an ellipse shape by using a positive definite symmetric matrix X _k The description is as follows:

(2) Expansion target filtering algorithm:

the extended target filtering algorithm is realized under the framework of a Bayesian filtering algorithm and consists of a prediction process and an updating process. Wherein each process is further divided into prediction and update of motion state and extension state:

1) The prediction process comprises the following steps:

for one-step prediction of motion states, the mean and covariance matrices are as follows, due to the compliance with a multivariate gaussian distribution:

wherein F is _k|k-1 For state transition matrix, I _d For d-dimensional identity matrix, P _k|k-1 As covariance matrix, D _k|k-1 Is the covariance matrix of zero-mean Gaussian process noise.

One-step prediction of extended state:

v _k|k-1 ＝e ^-T/τ v _k-1|k-1 (6)

wherein v is _k|k-1 And V _k|k-1 The degree of freedom and the inverse scale matrix in the inverse Wishare distribution obtained by posterior prediction at the k moment according to the k-1 moment are respectively represented by T, T represents sampling time, tau is a time decay constant, d represents the dimension of a target expansion state, and v _k-1|k-1 And V _k-1|k-1 Representing the posterior of k-1 moment, i.e. the degree of freedom and inverse scale matrix of k-1 moment obtained after iterative update, E [ X ] _k|k-1 ]X represents _k|k-1 Is a mathematical expectation of (a).

2) The updating process comprises the following steps:

updating motion state:

wherein,representing centroid measurements->Representing the corresponding scattering matrix. W (W) _k|k-1 Representing the system gain matrix, ε _k For the innovation part of system measurement, S _k|k-1 Representing the covariance matrix of the innovation part.

Updating the expansion state:

v _k|k ＝v _k|k-1 +n _k (16)

wherein n is _k Is a measurement number.

Assuming that the expansion target moves linearly at a uniform speed, a system equation is established according to the formula (1), and meanwhile, the expansion target is modeled to be elliptical in shape by the formulas (2) - (3). The extended target filtering algorithm from formulas (4) - (19) sets the environment for interaction with the reinforcement learning agent. For an interactive environment, namely, the posterior estimation value of the motion state of an expansion target at the moment of input k-1 and the covariance matrix x _k-1|k-1 ,P _k-1|k-1 And a posterior estimate v of the extended state _k-1|k-1 ,V _k-1|k-1 And the real-time position of the sensor, obtaining a posterior value x of k moment through a filtering algorithm of formulas (4) - (19) _k|k ,P _k|k ,v _k|k ,V _k|k 。

The problem of extended target tracking has been studied in different fields of radar, computer vision, etc., and its performance depends on the relative geometry of the observer (measurement sensor) and the moving target, so the task of selecting sensor management is sensor trajectory planning. According to the framework shown in fig. 1, firstly, modeling is conducted on an ellipse expansion target, a virtual interaction environment with deep reinforcement learning is built according to an expansion target tracking algorithm, a neural network is used for fitting a cost function and a strategy function in the reinforcement learning, a deep reinforcement learning algorithm is used for conducting sensor control through mechanisms such as exploration and utilization, and an intelligent sensor control system is built.

Its network structure is constructed according to the TD3 algorithm, which contains 6 networks (as in fig. 6): the network training algorithms for the Actor network, the target Actor network, the two Critic networks, and the target Critic network are given by formulas (20) - (27), respectively. Two main bodies, namely an environment and a reinforcement learning intelligent body, are built up, then the intelligent body is trained, finally an optimal strategy is obtained, and then target tracking is carried out according to sensor measurement. The function of the target Actor network is the same as that of the Actor network, and the function of the target Critic network is the same as that of the Critic network. When the network parameters are updated, the actions taken by the next state and the state action value after the actions taken by the next state are calculated, and the purpose of setting the target network is to restrain the problem of excessively high Q value caused by re-using the original network, namely 'bootstrapping'. The reason for setting two Critic networks and the target Critic network is that the maximizing operation also causes the problem of over-estimating the Q value when updating the networks, and the updating can be effectively inhibited by selecting smaller values through two different Critic networks.

The DDPG algorithm is an important deep reinforcement learning algorithm for processing complex and continuous control problems, but the problem of overestimation of the Q value in a Critic network exists in the DDPG algorithm, so that the double-delay depth deterministic strategy gradient (TD 3) algorithm optimizes the DDPG algorithm through 3 parts aiming at the problem, and effectively suppresses the problem of overhigh Q value. Therefore, in order to improve the performance optimization degree of the deep reinforcement learning algorithm on the extended target tracking, the sensor control is performed in an environment adopting continuous tasks based on the TD3 algorithm.

(3) TD3 algorithm

Reinforcement learning problems include two main subjects: an agent and an environment. In the sensor control of the extended target based on the deep reinforcement learning, the interactive environment is an elliptical extended target filtering algorithm, and the reinforcement learning agent is trained to perform intelligent sensor control. The reinforcement learning problem can be modeled, typically using a Markov Decision Process (MDP), expressed as a five-tupleIs a finite state set, i.eAnd (3) all possible states explored by the intelligent agent in the environment are represented by s, the state at the current moment is represented by s', and the specific state is the position of the sensor in the coordinate system. />For a limited action set, namely, a set of all possible actions which can be taken by an agent according to the current state, a is used for representing the action which is currently taken, and at the moment, in the sensor path planning, after the sensor speed is fixed, the sensor can select a moving direction angle. P is a state transition function, i.e. the probability of the sensor transitioning from the current state s to the next moment state s'. />And as a reward function, the expected reward obtained after the sensor takes action according to the current time position state is represented. Gamma is a discount factor representing the ratio of the value of the future desired prize at the current time.

The interaction process of the agent and the environment is that the agent takes action a at the moment k _k And acts on the environment, the sensor is in state x from the moment k _s,k Transition to state x at time k+1 _s,k+1 Evaluation by a reward function to obtain a reward value R _k+1 Then the agent continuously improves the strategy according to the rewarding value, and finally learns the action a of the optimal strategy decision sensor at each moment _k 。

In MDP, the cost function includes a state cost function and an action cost function (Q function), and in TD3 algorithm, the action cost function is used to represent a desired reward obtained by an agent taking action a according to state s under the guidance of a policy function pi. Reinforcement learning algorithms can be divided into two main categories depending on whether the environmental model is known or not: model-based methods and model-free methods, respectively, are more widely used because the environment of the actual problem is mostly complex and unknown, resulting in difficulty in modeling the environment. The methods for model-free can be further classified into policy-based methods, value-based methods, and Actor-Critic (AC) methods that combine both.

The TD3 algorithm for deep reinforcement learning is an AC-based approach to model-free, fitting a cost function and a strategy function with a neural network. Wherein, the Actor represents a network based on a strategy function and is used for selecting actions according to states, and the Critic represents a network based on a cost function and is used for evaluating actions selected by the Actor network. Assume that the respective usage parameters are θ ^μ Andthe parameter is theta for the Actor network and the Critic network ^μ′ Target Actor network of (a) and parameter is +.>And->Is a target Critic network of (c). Updating the Critic network in a gradient descent mode, wherein an updating formula is described as follows:

updating the Actor network by means of gradient descent can be described as:

where N is the number of samples in the batch sampled per learning, and α and β are the learning rates. For the target network in a soft update manner, the update formula can be described as follows:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ (27)

the intelligent agent outputs a sensor action according to the network parameter of the current intelligent agent from the current k moment sensor position, then obtains a sensor k+1 moment position according to the action, brings the k+1 moment position into a filtering environment to obtain a sensor measurement value, and carries out extended target tracking pseudo-updating, so that a reward value which can obtain the action at k moment is given by a formula (29) according to a designed reward function, and the data is expressed as (x _s,k ,a _k ,r _k+1 ,x _s,k+1 ) Is stored in the form of a (c). Many times are involved in target tracking, namely k e {0,1,2 … T }, until the end of the tracking time. In the process, the environment continuously interacts with the reinforcement learning agent to obtain a plurality of strips (x _s,k ,a _k ,r _k+1 ,x _s,k+1 ) The data in the form is stored in the experience playback pool, a certain experience playback pool capacity is set, when the data exceeds the capacity, old data before is discarded, and new data is used for filling the experience playback pool.

Construction of a reward function in deep reinforcement learning: and selecting the information gain between the prior probability density and the posterior probability density of the expansion target to design a reward function, and measuring the information gain by using the Gaussian Neisseria distance. At this time, the tracking estimation effect of the extended target is comprehensively evaluated by using a Gaussian (GW) distance between the prior probability density and the posterior probability density, the larger the GW distance is, the larger the information gain is, and the reinforcement learning agent is guided to select the optimal strategy by the reward function, so that the problem that the agent is difficult to converge due to sparse rewards can be avoided.

(4) Design of reward functions

Since the target tracking involves many moments, namely k e {0,1,2 … T }, at each moment a small batch of data is extracted from the empirical playback pool, and the network parameters are updated according to the network update mode of the TD3 algorithm, namely equations (20) - (27). Continuously extracting data from the experience playback pool, setting a certain training frequency, carrying out iterative updating on network parameters, finally converging the network parameters to be optimal, and controlling an intelligent agent to make an optimal strategy according to the position of the sensor at the moment k so as to obtain the sensor action.

In elliptical extended target tracking, the position component of the motion state and the extended state may be defined as a multivariate gaussian distribution to describe the overall effect of extended target tracking, expressed as: n (N) _x ～N(m _x ,s∑ _x ) Wherein m is _x From x _k S-sigma _x Representing the range, s is the scaling factor, taking s=1, Σ _x From a random matrix X _k And (5) determining. Defining that the prior probability distribution and the posterior probability distribution of the extended target k moment are both subjected to multi-element Gaussian distribution, and describing as follows:the Gaussian distance between the two is:

then based on this, the bonus function is:

wherein a is _k,0 Indicating that the sensor is at rest at the current time.

The training reinforcement learning agent at each moment is used for finally selecting the action at the next moment according to the sensor position at the current moment to obtain the optimal measurement, the new measurement is obtained according to formulas (9) - (10), and finally the new measurement is brought into filtering algorithms (6) - (19) to obtain the final extended target tracking result. And finally obtaining an optimal sensor path through iteration of the steps so as to optimize the overall tracking performance of the elliptical expansion target.

Fig. 2 shows that after the sensor track based on the TD3 algorithm is intelligently planned in the extended target tracking, the estimation errors between the semi-major axis and the semi-minor axis of the real target in the target extended state estimation are less than 1.0, the estimation errors of the semi-major axis and the semi-minor axis are less than 0.5, and the shape estimation of the extended target is more accurate.

Fig. 3 is a diagram of the centroid estimation error between the extended target state estimate and the true target state under a sensorless control scheme and a TD 3-based control scheme. As can be seen from fig. 3, after the sensor control method based on the deep reinforcement learning is added into the extended target, the centroid tracking estimation of the extended target is more accurate.

Fig. 4 shows the total performance of the estimation of the target motion state and the expansion state by comprehensively considering the gaussian distance between the estimation of the expansion target and the real expansion target under the sensorless control scheme and the TD 3-based control scheme through the index. As can be seen from fig. 4, after the sensor control method based on the deep reinforcement learning is added into the extended target, the overall performance of the extended target tracking is improved, so that the centroid estimation is more accurate, and the tracking estimation of the target contour information is closer to the real shape of the target.

According to the method, the random matrix is used for modeling the expansion state of the elliptical expansion target, the motion state and the expansion state of the target can be effectively estimated, and then the estimation of the centroid position of the target and the estimation of the contour information of the target are more accurate by adopting the setting of the evaluation function in the sensor management method similar to the information theory to construct the reward function applied to the depth reinforcement learning TD3 algorithm, the combined optimization of the motion state and the contour information (the expansion state) of the target is comprehensively considered by the reward function, and after the sensor is effectively controlled under the continuous action space by using the TD3 algorithm, the tracking effect of the elliptical expansion target is optimized as a whole.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. The sensor management method based on the deep reinforcement learning in the expansion target is characterized by comprising the following steps of:

establishing a TD3 algorithm intelligent agent;

wherein acquiring the sensor control data comprises:

the intelligent agent takes action and acts on the virtual interactive environment at any moment, and the sensor is in a state x from the moment k _s,k Transition to state x at time k+1 _s,k+1 Obtaining a prize value R by evaluating a prize function _k+1 Then, the agent continuously improves the strategy according to the rewarding value, and finally learns the action of the optimal strategy decision sensor at each moment;

2. The method for deep reinforcement learning based sensor management in an expanded target of claim 1, wherein modeling for the elliptical expanded target comprises:

the modeling method comprises the following steps:

wherein w is _k Is zero-mean Gaussian process noise, v _k For zero mean Gaussian measurement noise, x _s,k (pi) is the current sensor position, f _k :Mapping for system state evolution, h _k :/>For measurement mapping, x _k+1 Represents the kinematic state of the target at time k+1, < >>A plurality of measured values at time k are represented.

3. The method for sensor management based on deep reinforcement learning in an extended target according to claim 2, wherein the extended state of the extended target at the k time isIs modeled as an elliptical shape with a positive definite symmetric matrix X _k The description is as follows:

4. The method for sensor management based on deep reinforcement learning in an extended target according to claim 1, wherein constructing a virtual interaction environment with the deep reinforcement learning according to the extended target filtering algorithm comprises:

5. The method for deep reinforcement learning based sensor management in an extended target according to claim 4, wherein the prediction process is:

6. The method for sensor management based on deep reinforcement learning in an extended target according to claim 1, wherein the TD3 algorithm agent comprises:

actor network: for selecting an action based on the status;

7. The method for deep reinforcement learning based sensor management in an extended target of claim 1, wherein the method for constructing the bonus function comprises:

8. The method of deep reinforcement learning based sensor management in an extended target of claim 7, wherein the reward function is:

wherein a is _k,0 Indicating that the sensor is at rest at the current time.