CN114630299A

CN114630299A - Information age-perceptible resource allocation method based on deep reinforcement learning

Info

Publication number: CN114630299A
Application number: CN202210228341.9A
Authority: CN
Inventors: 彭诺蘅; 林艳; 张一晋; 李骏; 邹骏
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-14
Anticipated expiration: 2042-03-08
Also published as: CN114630299B

Abstract

The invention discloses an information age-perceptible resource allocation method based on deep reinforcement learning, which specifically comprises the following steps: inputting the environment of the Internet of vehicles, and initializing parameters of an operator network and a critic network by a base station; in the current time slot, the base station firstly allocates channels and transmitting power for all vehicle user pairs in the environment; after the vehicle user and the cellular user finish communication, the residual load capacity and the information age of all links are updated; after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and simultaneously, a buffer pool stores the sample data generated at the time slot; when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in a confidence domain strategy optimization algorithm, and emptying a buffer pool after updating is completed; when the maximum number of steps for a training round is reached, the Internet of vehicles environment is re-entered to begin the next round. The invention supports various real-time sensitive applications in the Internet of vehicles by minimizing the average information age and the average power consumption.

Description

Information age-perceptible resource allocation method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of communication in a wireless mobile network, and particularly relates to an information age-perceptible resource allocation method based on deep reinforcement learning.

Background

In the internet of vehicles, people and things can interact and share information such as vehicle driving states, positions of pedestrians and non-motor vehicles, traffic marks and signals, road jam conditions and the like in real time. Meanwhile, the Internet of vehicles processes the obtained information through cloud computing and big data technology, so that the functions of vehicle safety early warning, vehicle state monitoring, vehicle navigation optimization, vehicle energy-saving control, vehicle emergency scheduling, hit-and-run vehicle tracking, pedestrian and non-motor vehicle early warning and the like are realized.

In a wireless communication network, signals always consume wireless communication resources such as frequency spectrum and transmission power during transmission. Therefore, in consideration of the current situation of shortage of spectrum resources and the goal of energy-saving society, how to reasonably allocate wireless communication resources in the internet of vehicles to meet the communication demand of users has become an urgent problem to be solved. For the problem of resource allocation in the internet of vehicles, although many scholars have been studied so far, most of the work does not consider the influence of the high dynamics of the internet of vehicles on the resource allocation scheme.

The high dynamics of car networking is realized in that the fast fading of the wireless channel in the environment is always rapidly changing with time and space. While the resource allocation work based on the conventional optimization algorithm chooses to ignore the factor of channel fast fading to reduce the computational complexity. Therefore, the resource allocation strategy solved by the traditional optimization algorithm is often not the optimal or near optimal strategy. With the rapid development of artificial intelligence technology, deep reinforcement learning combining the perception capability of deep learning and the decision-making capability of reinforcement learning is gradually concerned by students. Compared with the traditional optimization algorithm, the deep reinforcement learning algorithm can continuously adjust the current strategy along with the dynamic change of the environment and make a reasonable decision according to a long-term optimization target. Therefore, even in an unknown dynamic network environment, the deep reinforcement learning algorithm can find a globally optimal resource allocation strategy.

In addition, considering that many application scenarios in the internet of vehicles, such as unmanned driving, have high requirements on data timeliness, scholars begin to pay attention to how to minimize information age by optimizing resource allocation strategies, thereby ensuring information freshness. Nevertheless, most work places restrictions on the distance or association of receivers and transmitters in the vehicle-to-vehicle link, and thus cannot guarantee stability of scheme performance in a highly dynamic internet of vehicles.

Disclosure of Invention

The invention aims to minimize the sum of the average information age of all links and the average power of all vehicle user pairs, thereby providing communication guarantee for various real-time sensitive applications in the Internet of vehicles.

The technical solution for realizing the purpose of the invention is as follows: an information age-aware resource allocation method based on deep reinforcement learning specifically comprises the following steps:

step 1, inputting a car networking environment, and initializing parameters of an operator network and a critic network of a base station;

step 2, in the current time slot, the base station firstly allocates channels and transmitting power for all vehicle user pairs in the environment;

step 3, after the vehicle user and the cellular user complete communication, updating the residual load capacity and the information age of all links;

step 4, after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and meanwhile, a buffer pool stores sample data generated at the time slot;

step 5, when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in a confidence domain strategy optimization algorithm, and emptying a buffer pool after updating;

and 6, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, inputting the Internet of vehicles environment again, and repeating the steps 2 to 5.

Further, the discretized time slots are represented by an index j, wherein the duration of each time slot is τ; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.

Further, step 1, inputting the car networking environment, and the base station initializing parameters of an actor network and a critic network, wherein the car networking environment comprises:

(1) and (3) network model: the base station is positioned in the center of the map, has a large enough coverage range and can communicate with vehicles at any positions in the map; each vehicle in the system is a cellular user and a vehicle user, when the vehicle user is used as the vehicle user, the vehicle user needs to establish a vehicle and vehicle link with a target vehicle user corresponding to the vehicle user to transmit traffic information, and when the vehicle user is used as the cellular user, the vehicle and infrastructure link is established with a base station to realize the transmission of user data of the vehicle user;

(2) and (3) channel model: considering that limited sub-frequency bands exist in a system, and channel power gain and interference power gain are both composed of slow fading and fast fading, wherein the slow fading comprises path loss and shadow fading and has two conditions of line-of-sight and non-line-of-sight, and the fast fading refers to Rayleigh fading;

(3) wireless transmission model: different vehicle and infrastructure links fixedly occupy different channels, and the transmit power of cellular users remains unchanged; definition of

And

the channel and the transmission power which are respectively distributed to the mth pair of vehicle users by the base station at the beginning of the time slot j;

the maximum value in (1) is denoted as P_maxThe minimum value is denoted as P_min。

Further, in step 2, in the current timeslot, the base station first allocates channels and transmission power to all pairs of vehicle users in the environment, specifically:

the base station is used as an intelligent agent and is responsible for distributing transmission channels and transmitting power for all vehicle user pairs; thus, the action at time slot j can be expressed as:

wherein,

the action of the m-th vehicle user in the time slot j.

Further, after the vehicle user and the cellular user complete communication in step 3, the remaining load and the information age of all links are updated, specifically:

(1) the method for updating the residual load comprises the following steps: if the residual load is successfully transmitted in the current time slot, updating the residual load to the initial load; otherwise, subtracting the successfully transmitted load from the residual load;

(2) the information age updating method comprises the following steps: if the residual load capacity in the current time slot is updated to the initial load capacity, the information age is updated to the initial value tau; otherwise, τ is added to the age of the message.

Further, after obtaining the reward fed back by the environment in step 4, the base station senses and collects the current state information of the environment, and simultaneously stores the sample data generated by the time slot in the buffer pool, specifically:

(1) reward

Definition of

Is the transmission channel vector that the base station allocates to all pairs of vehicle users at the beginning of time slot j,

is the transmit power vector that the base station assigns to all pairs of vehicle users at the beginning of time slot j,

and

the information ages of the nth cellular user and the mth pair of vehicle users at the end of the time slot j are respectively set;

in order to achieve the optimization goal of minimizing the sum of the information age of all links and the power of the vehicle user pairs, a separate definition is made

For the information age utility function of the mth pair of vehicle users,

for the power utility function of the mth pair of vehicle users,

an information age utility function for the nth cellular user;

considering different orders of magnitude of information age and power, respectively carrying out normalization processing on the information age and the power consumption in a utility function to ensure that the information age and the power consumption are in the same value range; therefore, the temperature of the molten metal is controlled,

and

respectively expressed as:

the reward at slot j can be expressed as:

wherein,

is the m-th reward to the vehicle user at time slot j,

is the reward for the nth cellular user at time slot j; lambda [ alpha ]_aAnd λ_pWeight coefficients of the information age part and the power part respectively, and satisfy lambda_a+λ_p＝1；

(2) Status of state

The base station can observe and collect four kinds of state information, including: the channel states of all vehicle-to-vehicle links and vehicle-to-infrastructure links in the current time slot, the interference power suffered by all the vehicle-to-vehicle links when the previous time slot selects different channels, the residual load capacity of all cellular users and vehicle user pairs, and the information age of all the cellular users and vehicle user pairs;

first, the state information associated with all vehicle user pairs at time slot j can be expressed as:

wherein,

the state information of the mth pair of vehicle users in the time slot j comprises the following steps: channel state vector of m-th pair of vehicle users on L channels at time slot j

Interference power vector received by receiver in m-th pair of vehicle users on L channel selections in time slot j-1

The residue of the m-th pair of vehicle users at the end of time slot j-1Residual load capacity

Information age of mth pair of vehicle users at the end of time slot j-1

Second, the state information associated with all cellular users at slot j can be expressed as:

wherein,

the state information of the nth cellular user at time slot j includes: channel state vector of nth cellular user on L channels at time slot j

The remaining capacity of the nth cell user at the end of slot j-1

Age of information at the end of time slot j-1 for the nth cellular user

The state at slot j can be expressed as:

further, in step 5, when the number of samples is sufficient, the parameters of the operator network and the critic network are updated according to an iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:

(1) actor network

In the confidence domain strategy optimization algorithm, an operator network can fit strategies and output high-dimensional action and environment interaction as a strategy function;

the Actor network can ensure that the cumulative prize value obtainable by the new strategy is higher than the cumulative prize value obtainable by the old strategy by maximizing the difference between the cumulative prize value obtained by using the new strategy and the cumulative prize value obtained by using the old strategy, i.e. the new strategy is better than the old strategy, thereby achieving the aim that the strategies are always improved monotonously; meanwhile, in order to ensure the stability of the strategy updating process, KL divergence constraint is introduced into the confidence domain strategy optimization algorithm to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:

where θ is the policy parameter vector, a and s are the action vector and the state vector, respectively,

is desired on the trajectory, D_KL(. is) the KL divergence of the two distributions,

is an unrefreshed policy, pi_θIs an updated policy, delta is a threshold for KL divergence expectation in the trust domain,

is that

I.e. the dominance of action a over the average action in state s;

aiming at the optimization problem, firstly, a natural strategy gradient method is utilized to simplify the optimization problem, then a conjugate gradient method is utilized to avoid the inversion operation of the Fisher information matrix, and finally, a line search method with backtracking property is introduced to obtain an iterative equation of theta:

where x is the solution of a system of linear equations in the form Fx-g, F (θ) is the Fisher information matrix, g (θ) is the gradient,

the step length of the operator network, i is a first nonnegative integer which can simultaneously meet the requirement of improving the desired KL divergence constraint and strategy;

(2) critic network

In the confidence domain strategy optimization algorithm, the critic network can fit a state value function and is used as a value function to evaluate and guide an operator network according to high-dimensional state input;

the Critic network improves the accuracy of its predicted reward by minimizing the loss function as follows:

Loss(w)＝(r_t+γV(s_t+1,w′)-V(s_t,w))²

wherein w is the parameter vector of the critic evaluation network, w' is the parameter vector of the critic target network, the discount factor gamma is the [0,1 ] reflects the influence of the future reward on the accumulated reward, and V(s)_tW) is the critic's evaluation of the state value of the network at time t, V(s)_t+1W') is the state value at time t +1 of the critic target network;

the unconstrained nonlinear programming problem is solved by using an L-BFGS method, and an iterative formula of w is obtained as follows:

wherein, g_kIs the gradient, ρ is the step size of the criticc network, D_kIs an approximation of the inverse of the Hessian matrix, w₁Set to a random initial point of time,

is the first non-negative integer that can ensure smooth update of the critic network parameter vector.

Compared with the prior art, the invention has the following remarkable advantages: (1) by means of the operator-critic framework, an operator network can fit the strategy, and the optimal strategy can be directly generated after convergence; (2) the strategy can be ensured to be improved monotonously, so that the convergence speed is high; (3) the KL divergence constraint limits the policy update magnitude during each iteration, thus ensuring stability.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic model diagram of a car networking system according to an embodiment of the present invention.

FIG. 3 is a graph of average jackpot versus number of training rounds in accordance with an embodiment of the present invention.

Fig. 4 is a graph illustrating the variation of the average jackpot amount with the initial load amount of the vehicle and the vehicle link according to the embodiment of the present invention.

Fig. 5 is a graph illustrating the variation of the initial load of the average jackpot over the vehicle and infrastructure links according to an embodiment of the present invention.

Detailed Description

With reference to fig. 1 and fig. 2, the method for allocating age-aware resources based on deep reinforcement learning provided by the present invention specifically includes the following steps:

step 2, in the current time slot, the base station firstly distributes channels and transmitting power for all vehicle user pairs in the environment;

step 5, when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in the confidence domain strategy optimization algorithm, and emptying the buffer pool after updating is completed;

In the invention, the discretized time slots are represented by an index j, wherein the duration of each time slot is tau; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.

(2) and (3) channel model: considering that limited sub-bands exist in a system, and both the channel power gain and the interference power gain are composed of slow fading and fast fading, wherein the slow fading comprises path loss and shadow fading and has two conditions of line-of-sight and non-line-of-sight, and the fast fading refers to Rayleigh fading;

And

wherein,

the action of the m-th vehicle user in the time slot j.

(1) reward

Definition of

and

For the information age utility function of the mth pair of vehicle users,

for the power utility function of the mth pair of vehicle users,

an information age utility function for the nth cellular user;

and

respectively expressed as:

the reward at slot j can be expressed as:

wherein,

is the m-th reward to the vehicle user at time slot j,

(2) Status of state

wherein,

the state information of the mth pair of vehicle users in the time slot j comprises the following steps: in time slot j, the m-th pair of vehicle users are in LChannel state vector on a channel

The residual load quantity of the mth pair of vehicle users at the end of the time slot j-1

Information age of mth pair of vehicle users at the end of time slot j-1

wherein,

The remaining capacity of the nth cell user at the end of slot j-1

Age of information at the end of time slot j-1 for the nth cellular user

The state at slot j can be expressed as:

further, in step 5, when the number of samples is sufficient, the parameters of the operator network and the critical network are updated according to an iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:

(1) actor network

In the confidence domain strategy optimization algorithm, an actor network can fit strategies and output high-dimensional actions and environment interaction as a strategy function;

the Actor network can ensure that the jackpot value available for a new policy is higher than the jackpot value available for an old policy by maximising the difference between the jackpot value obtained using the new policy and the jackpot value obtained using the old policy, i.e. the new policy is better than the old policy, thereby achieving the goal that the policies are always improving monotonically; meanwhile, in order to ensure the stability of the strategy updating process, the confidence domain strategy optimization algorithm introduces KL divergence constraint to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:

is that

I.e. the dominance of action a over the average action in state s;

(2) critic network

In the confidence domain strategy optimization algorithm, the critic network can fit a state value function and is used as a value function to evaluate and guide an actor network according to high-dimensional state input;

Loss(w)＝(r_t+γV(s_t+1,w′)-V(s_t,w))²

wherein w is the parameter vector of the critic evaluation network, w' is the parameter vector of the critic target network, the discount factor gamma is [0,1 ] reflects the influence of the future reward on the accumulated reward, and V(s)_tW) is the critic's evaluation of the state value of the network at time t, V(s)_t+1W') is the state value at time t +1 of the critic target network;

is the first non-negative integer that can ensure the smooth update of the critic network parameter vector.

The invention is described in further detail below with reference to the figures and the embodiments.

Examples

One embodiment of the invention is described in detail below, with the simulation using python programming, and the parameter settings do not affect generality. In contrast to the method described, the method described is: (1) a random resource dynamic allocation method; (2) an information age-aware resource dynamic allocation method based on a deep Q network.

TABLE 1 Primary simulation parameter values

As shown in fig. 2, the car networking system model refers to the manhattan grid model in the 3GPP TR 36.885 standard. Table 1 lists the main simulation parameter values. During training, the number of rounds is set to 300 and the maximum number of steps per round is set to 50000. Before each round begins, each vehicle will select the vehicle closest to it as its target communication vehicle, thereby generating a plurality of vehicle-user pairs. The initial positions of the vehicles are randomly distributed in the map, and the speed interval of the vehicles is [10m/s,15m/s]. Further, when the vehicle travels to the intersection, there is a certain probability that its traveling direction is changed (this probability is set to 0.64). In the confidence domain strategy optimization algorithm, the actor network and the critic network are fully connected neural networks and comprise an input layer, an output layer and three hidden layersThe number of neurons in the three hidden layers is 500, 250 and 120. In addition, the step size of the operator network

Set to 0.5, the step size p for the critical network is also set to 0.5. It is worth noting that the freshness of the information is of crucial importance in order to ensure road traffic safety. Therefore, the weight coefficient of the information age part is set to be larger than that of the power part by simulation, and the values of the weight coefficient of the information age part and the weight coefficient of the power part are kept unchanged in the performance comparison later.

As shown in fig. 3, compared with the baseline method, the method can stably and monotonically improve the policy in each iteration of the operator network and the critic network, so that the converged average cumulative reward is higher, the convergence speed is higher, and the performance is more stable.

As shown in fig. 4 and 5, the performance of the method is significantly better than the baseline method when the initial load amount of both the vehicle-to-vehicle link and the vehicle-to-infrastructure link changes. The reason for this is that the method can be learned quickly and stably through its operator-critic framework and appropriate gradient update step selection method.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An information age-aware resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

step 1, inputting a car networking environment, and initializing parameters of an operator network and a critic network by a base station;

step 3, after the vehicle user and the cellular user complete communication, the residual load capacity and the information age of all links are updated;

2. The method of claim 1, wherein the discretized time slots are represented by an index j, wherein the duration of each time slot is τ; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.

3. The method according to claim 2, wherein the step 1 of inputting the car networking environment, the base station initializes the parameters of the own operator network and the critic network, wherein the car networking environment comprises:

And

4. The method as claimed in claim 2 or 3, wherein in the current timeslot, the base station allocates channels and transmission power for all pairs of vehicle users in the environment first, specifically:

wherein,

the action of the m-th vehicle user in the time slot j.

5. The method for allocating information age-aware resources based on deep reinforcement learning according to claim 4, wherein in step 3, after the vehicle user and the cellular user complete communication, the remaining load and information age of all links are updated, specifically:

6. The method according to claim 5, wherein the base station in step 4, after receiving the reward from the environment feedback, senses and collects current state information of the environment, and the buffer pool stores sample data generated in the time slot, specifically:

(1) reward

Definition of

and

For the information age utility function of the mth pair of vehicle users,

for the power utility function of the mth pair of vehicle users,

an information age utility function for the nth cellular user;

and

respectively expressed as:

the reward at slot j can be expressed as:

wherein,

is the m-th reward to the vehicle user at time slot j,

(2) Status of state

The base station can observe and collect four kinds of state information, including: the channel states of all vehicle-to-vehicle links and vehicle-to-infrastructure links in the current time slot, the interference power suffered by all vehicle-to-vehicle links when the previous time slot selects different channels, the residual load capacity of all cellular users and vehicle user pairs, and the information ages of all cellular users and vehicle user pairs;

wherein,

Information age of mth pair of vehicle users at the end of time slot j-1

wherein,

The remaining capacity of the nth cell user at the end of slot j-1

Age of information at the end of time slot j-1 for the nth cellular user

The state at slot j can be expressed as:

7. the method as claimed in claim 1, wherein in step 5, when the number of samples is sufficient, the parameters of the operator network and the critic network are updated according to the iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:

(1) actor network

the Actor network can ensure that the jackpot value available for a new policy is higher than the jackpot value available for an old policy by maximising the difference between the jackpot value obtained using the new policy and the jackpot value obtained using the old policy, i.e. the new policy is better than the old policy, thereby achieving the goal that the policies are always improving monotonically; meanwhile, in order to ensure the stability of the strategy updating process, KL divergence constraint is introduced into the confidence domain strategy optimization algorithm to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:

is that

I.e. the dominance of action a over the average action in state s;

aiming at the optimization problem, firstly, a natural strategy gradient method is used for simplifying the optimization problem, then, a conjugate gradient method is used for avoiding inversion operation on a Fisher information matrix, and finally, a line search method with backtracking property is introduced to obtain an iterative equation of theta:

(2) critic network

Loss(w)＝(r_t+γV(s_t+1,w′)-V(s_t,w))²

the unconstrained nonlinear programming problem is solved by using an L-BFGS method, and the obtained iterative formula of w is as follows:

w_k+1＝w_k-ρ^ιD_kg_k

wherein, g_kIs the gradient, p is the step size of the criticc network, D_kIs an approximation of the inverse of the Hessian matrix, w₁Set to a random initial point, and iota is the first non-negative integer that can ensure smooth update of the critic network parameter vector.