CN114630299A - Information age-perceptible resource allocation method based on deep reinforcement learning - Google Patents

Information age-perceptible resource allocation method based on deep reinforcement learning Download PDF

Info

Publication number
CN114630299A
CN114630299A CN202210228341.9A CN202210228341A CN114630299A CN 114630299 A CN114630299 A CN 114630299A CN 202210228341 A CN202210228341 A CN 202210228341A CN 114630299 A CN114630299 A CN 114630299A
Authority
CN
China
Prior art keywords
vehicle
time slot
network
user
base station
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210228341.9A
Other languages
Chinese (zh)
Other versions
CN114630299B (en
Inventor
彭诺蘅
林艳
张一晋
李骏
邹骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202210228341.9A priority Critical patent/CN114630299B/en
Publication of CN114630299A publication Critical patent/CN114630299A/en
Application granted granted Critical
Publication of CN114630299B publication Critical patent/CN114630299B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/52Allocation or scheduling criteria for wireless resources based on load
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/542Allocation or scheduling criteria for wireless resources based on quality criteria using measured or perceived quality
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses an information age-perceptible resource allocation method based on deep reinforcement learning, which specifically comprises the following steps: inputting the environment of the Internet of vehicles, and initializing parameters of an operator network and a critic network by a base station; in the current time slot, the base station firstly allocates channels and transmitting power for all vehicle user pairs in the environment; after the vehicle user and the cellular user finish communication, the residual load capacity and the information age of all links are updated; after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and simultaneously, a buffer pool stores the sample data generated at the time slot; when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in a confidence domain strategy optimization algorithm, and emptying a buffer pool after updating is completed; when the maximum number of steps for a training round is reached, the Internet of vehicles environment is re-entered to begin the next round. The invention supports various real-time sensitive applications in the Internet of vehicles by minimizing the average information age and the average power consumption.

Description

Information age-perceptible resource allocation method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of communication in a wireless mobile network, and particularly relates to an information age-perceptible resource allocation method based on deep reinforcement learning.
Background
In the internet of vehicles, people and things can interact and share information such as vehicle driving states, positions of pedestrians and non-motor vehicles, traffic marks and signals, road jam conditions and the like in real time. Meanwhile, the Internet of vehicles processes the obtained information through cloud computing and big data technology, so that the functions of vehicle safety early warning, vehicle state monitoring, vehicle navigation optimization, vehicle energy-saving control, vehicle emergency scheduling, hit-and-run vehicle tracking, pedestrian and non-motor vehicle early warning and the like are realized.
In a wireless communication network, signals always consume wireless communication resources such as frequency spectrum and transmission power during transmission. Therefore, in consideration of the current situation of shortage of spectrum resources and the goal of energy-saving society, how to reasonably allocate wireless communication resources in the internet of vehicles to meet the communication demand of users has become an urgent problem to be solved. For the problem of resource allocation in the internet of vehicles, although many scholars have been studied so far, most of the work does not consider the influence of the high dynamics of the internet of vehicles on the resource allocation scheme.
The high dynamics of car networking is realized in that the fast fading of the wireless channel in the environment is always rapidly changing with time and space. While the resource allocation work based on the conventional optimization algorithm chooses to ignore the factor of channel fast fading to reduce the computational complexity. Therefore, the resource allocation strategy solved by the traditional optimization algorithm is often not the optimal or near optimal strategy. With the rapid development of artificial intelligence technology, deep reinforcement learning combining the perception capability of deep learning and the decision-making capability of reinforcement learning is gradually concerned by students. Compared with the traditional optimization algorithm, the deep reinforcement learning algorithm can continuously adjust the current strategy along with the dynamic change of the environment and make a reasonable decision according to a long-term optimization target. Therefore, even in an unknown dynamic network environment, the deep reinforcement learning algorithm can find a globally optimal resource allocation strategy.
In addition, considering that many application scenarios in the internet of vehicles, such as unmanned driving, have high requirements on data timeliness, scholars begin to pay attention to how to minimize information age by optimizing resource allocation strategies, thereby ensuring information freshness. Nevertheless, most work places restrictions on the distance or association of receivers and transmitters in the vehicle-to-vehicle link, and thus cannot guarantee stability of scheme performance in a highly dynamic internet of vehicles.
Disclosure of Invention
The invention aims to minimize the sum of the average information age of all links and the average power of all vehicle user pairs, thereby providing communication guarantee for various real-time sensitive applications in the Internet of vehicles.
The technical solution for realizing the purpose of the invention is as follows: an information age-aware resource allocation method based on deep reinforcement learning specifically comprises the following steps:
step 1, inputting a car networking environment, and initializing parameters of an operator network and a critic network of a base station;
step 2, in the current time slot, the base station firstly allocates channels and transmitting power for all vehicle user pairs in the environment;
step 3, after the vehicle user and the cellular user complete communication, updating the residual load capacity and the information age of all links;
step 4, after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and meanwhile, a buffer pool stores sample data generated at the time slot;
step 5, when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in a confidence domain strategy optimization algorithm, and emptying a buffer pool after updating;
and 6, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, inputting the Internet of vehicles environment again, and repeating the steps 2 to 5.
Further, the discretized time slots are represented by an index j, wherein the duration of each time slot is τ; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.
Further, step 1, inputting the car networking environment, and the base station initializing parameters of an actor network and a critic network, wherein the car networking environment comprises:
(1) and (3) network model: the base station is positioned in the center of the map, has a large enough coverage range and can communicate with vehicles at any positions in the map; each vehicle in the system is a cellular user and a vehicle user, when the vehicle user is used as the vehicle user, the vehicle user needs to establish a vehicle and vehicle link with a target vehicle user corresponding to the vehicle user to transmit traffic information, and when the vehicle user is used as the cellular user, the vehicle and infrastructure link is established with a base station to realize the transmission of user data of the vehicle user;
(2) and (3) channel model: considering that limited sub-frequency bands exist in a system, and channel power gain and interference power gain are both composed of slow fading and fast fading, wherein the slow fading comprises path loss and shadow fading and has two conditions of line-of-sight and non-line-of-sight, and the fast fading refers to Rayleigh fading;
(3) wireless transmission model: different vehicle and infrastructure links fixedly occupy different channels, and the transmit power of cellular users remains unchanged; definition of
Figure BDA0003537167280000021
And
Figure BDA0003537167280000022
the channel and the transmission power which are respectively distributed to the mth pair of vehicle users by the base station at the beginning of the time slot j;
Figure BDA0003537167280000023
the maximum value in (1) is denoted as PmaxThe minimum value is denoted as Pmin
Further, in step 2, in the current timeslot, the base station first allocates channels and transmission power to all pairs of vehicle users in the environment, specifically:
the base station is used as an intelligent agent and is responsible for distributing transmission channels and transmitting power for all vehicle user pairs; thus, the action at time slot j can be expressed as:
Figure BDA0003537167280000031
wherein,
Figure BDA0003537167280000032
the action of the m-th vehicle user in the time slot j.
Further, after the vehicle user and the cellular user complete communication in step 3, the remaining load and the information age of all links are updated, specifically:
(1) the method for updating the residual load comprises the following steps: if the residual load is successfully transmitted in the current time slot, updating the residual load to the initial load; otherwise, subtracting the successfully transmitted load from the residual load;
(2) the information age updating method comprises the following steps: if the residual load capacity in the current time slot is updated to the initial load capacity, the information age is updated to the initial value tau; otherwise, τ is added to the age of the message.
Further, after obtaining the reward fed back by the environment in step 4, the base station senses and collects the current state information of the environment, and simultaneously stores the sample data generated by the time slot in the buffer pool, specifically:
(1) reward
Definition of
Figure BDA0003537167280000033
Is the transmission channel vector that the base station allocates to all pairs of vehicle users at the beginning of time slot j,
Figure BDA0003537167280000034
is the transmit power vector that the base station assigns to all pairs of vehicle users at the beginning of time slot j,
Figure BDA0003537167280000035
and
Figure BDA0003537167280000036
the information ages of the nth cellular user and the mth pair of vehicle users at the end of the time slot j are respectively set;
in order to achieve the optimization goal of minimizing the sum of the information age of all links and the power of the vehicle user pairs, a separate definition is made
Figure BDA0003537167280000037
For the information age utility function of the mth pair of vehicle users,
Figure BDA0003537167280000038
for the power utility function of the mth pair of vehicle users,
Figure BDA0003537167280000039
an information age utility function for the nth cellular user;
considering different orders of magnitude of information age and power, respectively carrying out normalization processing on the information age and the power consumption in a utility function to ensure that the information age and the power consumption are in the same value range; therefore, the temperature of the molten metal is controlled,
Figure BDA00035371672800000310
and
Figure BDA00035371672800000311
respectively expressed as:
Figure BDA00035371672800000312
Figure BDA00035371672800000313
Figure BDA00035371672800000314
the reward at slot j can be expressed as:
Figure BDA00035371672800000315
wherein,
Figure BDA00035371672800000316
is the m-th reward to the vehicle user at time slot j,
Figure BDA0003537167280000041
is the reward for the nth cellular user at time slot j; lambda [ alpha ]aAnd λpWeight coefficients of the information age part and the power part respectively, and satisfy lambdaap=1;
(2) Status of state
The base station can observe and collect four kinds of state information, including: the channel states of all vehicle-to-vehicle links and vehicle-to-infrastructure links in the current time slot, the interference power suffered by all the vehicle-to-vehicle links when the previous time slot selects different channels, the residual load capacity of all cellular users and vehicle user pairs, and the information age of all the cellular users and vehicle user pairs;
first, the state information associated with all vehicle user pairs at time slot j can be expressed as:
Figure BDA0003537167280000042
wherein,
Figure BDA0003537167280000043
the state information of the mth pair of vehicle users in the time slot j comprises the following steps: channel state vector of m-th pair of vehicle users on L channels at time slot j
Figure BDA0003537167280000044
Interference power vector received by receiver in m-th pair of vehicle users on L channel selections in time slot j-1
Figure BDA0003537167280000045
The residue of the m-th pair of vehicle users at the end of time slot j-1Residual load capacity
Figure BDA0003537167280000046
Information age of mth pair of vehicle users at the end of time slot j-1
Figure BDA0003537167280000047
Second, the state information associated with all cellular users at slot j can be expressed as:
Figure BDA0003537167280000048
wherein,
Figure BDA0003537167280000049
the state information of the nth cellular user at time slot j includes: channel state vector of nth cellular user on L channels at time slot j
Figure BDA00035371672800000410
The remaining capacity of the nth cell user at the end of slot j-1
Figure BDA00035371672800000411
Age of information at the end of time slot j-1 for the nth cellular user
Figure BDA00035371672800000412
The state at slot j can be expressed as:
Figure BDA00035371672800000413
further, in step 5, when the number of samples is sufficient, the parameters of the operator network and the critic network are updated according to an iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:
(1) actor network
In the confidence domain strategy optimization algorithm, an operator network can fit strategies and output high-dimensional action and environment interaction as a strategy function;
the Actor network can ensure that the cumulative prize value obtainable by the new strategy is higher than the cumulative prize value obtainable by the old strategy by maximizing the difference between the cumulative prize value obtained by using the new strategy and the cumulative prize value obtained by using the old strategy, i.e. the new strategy is better than the old strategy, thereby achieving the aim that the strategies are always improved monotonously; meanwhile, in order to ensure the stability of the strategy updating process, KL divergence constraint is introduced into the confidence domain strategy optimization algorithm to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:
Figure BDA0003537167280000051
Figure BDA0003537167280000052
where θ is the policy parameter vector, a and s are the action vector and the state vector, respectively,
Figure BDA0003537167280000053
is desired on the trajectory, DKL(. is) the KL divergence of the two distributions,
Figure BDA0003537167280000054
is an unrefreshed policy, piθIs an updated policy, delta is a threshold for KL divergence expectation in the trust domain,
Figure BDA0003537167280000055
is that
Figure BDA0003537167280000056
I.e. the dominance of action a over the average action in state s;
aiming at the optimization problem, firstly, a natural strategy gradient method is utilized to simplify the optimization problem, then a conjugate gradient method is utilized to avoid the inversion operation of the Fisher information matrix, and finally, a line search method with backtracking property is introduced to obtain an iterative equation of theta:
Figure BDA0003537167280000057
where x is the solution of a system of linear equations in the form Fx-g, F (θ) is the Fisher information matrix, g (θ) is the gradient,
Figure BDA0003537167280000058
the step length of the operator network, i is a first nonnegative integer which can simultaneously meet the requirement of improving the desired KL divergence constraint and strategy;
(2) critic network
In the confidence domain strategy optimization algorithm, the critic network can fit a state value function and is used as a value function to evaluate and guide an operator network according to high-dimensional state input;
the Critic network improves the accuracy of its predicted reward by minimizing the loss function as follows:
Loss(w)=(rt+γV(st+1,w′)-V(st,w))2
wherein w is the parameter vector of the critic evaluation network, w' is the parameter vector of the critic target network, the discount factor gamma is the [0,1 ] reflects the influence of the future reward on the accumulated reward, and V(s)tW) is the critic's evaluation of the state value of the network at time t, V(s)t+1W') is the state value at time t +1 of the critic target network;
the unconstrained nonlinear programming problem is solved by using an L-BFGS method, and an iterative formula of w is obtained as follows:
Figure BDA0003537167280000059
wherein, gkIs the gradient, ρ is the step size of the criticc network, DkIs an approximation of the inverse of the Hessian matrix, w1Set to a random initial point of time,
Figure BDA00035371672800000510
is the first non-negative integer that can ensure smooth update of the critic network parameter vector.
Compared with the prior art, the invention has the following remarkable advantages: (1) by means of the operator-critic framework, an operator network can fit the strategy, and the optimal strategy can be directly generated after convergence; (2) the strategy can be ensured to be improved monotonously, so that the convergence speed is high; (3) the KL divergence constraint limits the policy update magnitude during each iteration, thus ensuring stability.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a schematic model diagram of a car networking system according to an embodiment of the present invention.
FIG. 3 is a graph of average jackpot versus number of training rounds in accordance with an embodiment of the present invention.
Fig. 4 is a graph illustrating the variation of the average jackpot amount with the initial load amount of the vehicle and the vehicle link according to the embodiment of the present invention.
Fig. 5 is a graph illustrating the variation of the initial load of the average jackpot over the vehicle and infrastructure links according to an embodiment of the present invention.
Detailed Description
With reference to fig. 1 and fig. 2, the method for allocating age-aware resources based on deep reinforcement learning provided by the present invention specifically includes the following steps:
step 1, inputting a car networking environment, and initializing parameters of an operator network and a critic network of a base station;
step 2, in the current time slot, the base station firstly distributes channels and transmitting power for all vehicle user pairs in the environment;
step 3, after the vehicle user and the cellular user complete communication, updating the residual load capacity and the information age of all links;
step 4, after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and meanwhile, a buffer pool stores sample data generated at the time slot;
step 5, when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in the confidence domain strategy optimization algorithm, and emptying the buffer pool after updating is completed;
and 6, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, inputting the Internet of vehicles environment again, and repeating the steps 2 to 5.
In the invention, the discretized time slots are represented by an index j, wherein the duration of each time slot is tau; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.
Further, step 1, inputting the car networking environment, and the base station initializing parameters of an actor network and a critic network, wherein the car networking environment comprises:
(1) and (3) network model: the base station is positioned in the center of the map, has a large enough coverage range and can communicate with vehicles at any positions in the map; each vehicle in the system is a cellular user and a vehicle user, when the vehicle user is used as the vehicle user, the vehicle user needs to establish a vehicle and vehicle link with a target vehicle user corresponding to the vehicle user to transmit traffic information, and when the vehicle user is used as the cellular user, the vehicle and infrastructure link is established with a base station to realize the transmission of user data of the vehicle user;
(2) and (3) channel model: considering that limited sub-bands exist in a system, and both the channel power gain and the interference power gain are composed of slow fading and fast fading, wherein the slow fading comprises path loss and shadow fading and has two conditions of line-of-sight and non-line-of-sight, and the fast fading refers to Rayleigh fading;
(3) wireless transmission model: different vehicle and infrastructure links fixedly occupy different channels, and the transmit power of cellular users remains unchanged; definition of
Figure BDA0003537167280000071
And
Figure BDA0003537167280000072
the channel and the transmission power which are respectively distributed to the mth pair of vehicle users by the base station at the beginning of the time slot j;
Figure BDA0003537167280000073
the maximum value in (1) is denoted as PmaxThe minimum value is denoted as Pmin
Further, in step 2, in the current timeslot, the base station first allocates channels and transmission power to all pairs of vehicle users in the environment, specifically:
the base station is used as an intelligent agent and is responsible for distributing transmission channels and transmitting power for all vehicle user pairs; thus, the action at time slot j can be expressed as:
Figure BDA0003537167280000074
wherein,
Figure BDA0003537167280000075
the action of the m-th vehicle user in the time slot j.
Further, after the vehicle user and the cellular user complete communication in step 3, the remaining load and the information age of all links are updated, specifically:
(1) the method for updating the residual load comprises the following steps: if the residual load is successfully transmitted in the current time slot, updating the residual load to the initial load; otherwise, subtracting the successfully transmitted load from the residual load;
(2) the information age updating method comprises the following steps: if the residual load capacity in the current time slot is updated to the initial load capacity, the information age is updated to the initial value tau; otherwise, τ is added to the age of the message.
Further, after obtaining the reward fed back by the environment in step 4, the base station senses and collects the current state information of the environment, and simultaneously stores the sample data generated by the time slot in the buffer pool, specifically:
(1) reward
Definition of
Figure BDA0003537167280000076
Is the transmission channel vector that the base station allocates to all pairs of vehicle users at the beginning of time slot j,
Figure BDA0003537167280000077
is the transmit power vector that the base station assigns to all pairs of vehicle users at the beginning of time slot j,
Figure BDA0003537167280000078
and
Figure BDA0003537167280000079
the information ages of the nth cellular user and the mth pair of vehicle users at the end of the time slot j are respectively set;
in order to achieve the optimization goal of minimizing the sum of the information age of all links and the power of the vehicle user pairs, a separate definition is made
Figure BDA00035371672800000710
For the information age utility function of the mth pair of vehicle users,
Figure BDA00035371672800000711
for the power utility function of the mth pair of vehicle users,
Figure BDA00035371672800000712
an information age utility function for the nth cellular user;
considering different orders of magnitude of information age and power, respectively carrying out normalization processing on the information age and the power consumption in a utility function to ensure that the information age and the power consumption are in the same value range; therefore, the temperature of the molten metal is controlled,
Figure BDA00035371672800000713
and
Figure BDA0003537167280000081
respectively expressed as:
Figure BDA0003537167280000082
Figure BDA0003537167280000083
Figure BDA0003537167280000084
the reward at slot j can be expressed as:
Figure BDA0003537167280000085
wherein,
Figure BDA0003537167280000086
is the m-th reward to the vehicle user at time slot j,
Figure BDA0003537167280000087
is the reward for the nth cellular user at time slot j; lambda [ alpha ]aAnd λpWeight coefficients of the information age part and the power part respectively, and satisfy lambdaap=1;
(2) Status of state
The base station can observe and collect four kinds of state information, including: the channel states of all vehicle-to-vehicle links and vehicle-to-infrastructure links in the current time slot, the interference power suffered by all the vehicle-to-vehicle links when the previous time slot selects different channels, the residual load capacity of all cellular users and vehicle user pairs, and the information age of all the cellular users and vehicle user pairs;
first, the state information associated with all vehicle user pairs at time slot j can be expressed as:
Figure BDA0003537167280000088
wherein,
Figure BDA0003537167280000089
the state information of the mth pair of vehicle users in the time slot j comprises the following steps: in time slot j, the m-th pair of vehicle users are in LChannel state vector on a channel
Figure BDA00035371672800000810
Interference power vector received by receiver in m-th pair of vehicle users on L channel selections in time slot j-1
Figure BDA00035371672800000811
The residual load quantity of the mth pair of vehicle users at the end of the time slot j-1
Figure BDA00035371672800000812
Information age of mth pair of vehicle users at the end of time slot j-1
Figure BDA00035371672800000813
Second, the state information associated with all cellular users at slot j can be expressed as:
Figure BDA00035371672800000814
wherein,
Figure BDA00035371672800000815
the state information of the nth cellular user at time slot j includes: channel state vector of nth cellular user on L channels at time slot j
Figure BDA00035371672800000816
The remaining capacity of the nth cell user at the end of slot j-1
Figure BDA00035371672800000817
Age of information at the end of time slot j-1 for the nth cellular user
Figure BDA00035371672800000818
The state at slot j can be expressed as:
Figure BDA00035371672800000819
further, in step 5, when the number of samples is sufficient, the parameters of the operator network and the critical network are updated according to an iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:
(1) actor network
In the confidence domain strategy optimization algorithm, an actor network can fit strategies and output high-dimensional actions and environment interaction as a strategy function;
the Actor network can ensure that the jackpot value available for a new policy is higher than the jackpot value available for an old policy by maximising the difference between the jackpot value obtained using the new policy and the jackpot value obtained using the old policy, i.e. the new policy is better than the old policy, thereby achieving the goal that the policies are always improving monotonically; meanwhile, in order to ensure the stability of the strategy updating process, the confidence domain strategy optimization algorithm introduces KL divergence constraint to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:
Figure BDA0003537167280000091
Figure BDA0003537167280000092
where θ is the policy parameter vector, a and s are the action vector and the state vector, respectively,
Figure BDA0003537167280000093
is desired on the trajectory, DKL(. is) the KL divergence of the two distributions,
Figure BDA0003537167280000094
is an unrefreshed policy, piθIs an updated policy, delta is a threshold for KL divergence expectation in the trust domain,
Figure BDA0003537167280000095
is that
Figure BDA0003537167280000096
I.e. the dominance of action a over the average action in state s;
aiming at the optimization problem, firstly, a natural strategy gradient method is utilized to simplify the optimization problem, then a conjugate gradient method is utilized to avoid the inversion operation of the Fisher information matrix, and finally, a line search method with backtracking property is introduced to obtain an iterative equation of theta:
Figure BDA0003537167280000097
where x is the solution of a system of linear equations in the form Fx-g, F (θ) is the Fisher information matrix, g (θ) is the gradient,
Figure BDA0003537167280000098
the step length of the operator network, i is a first nonnegative integer which can simultaneously meet the requirement of improving the desired KL divergence constraint and strategy;
(2) critic network
In the confidence domain strategy optimization algorithm, the critic network can fit a state value function and is used as a value function to evaluate and guide an actor network according to high-dimensional state input;
the Critic network improves the accuracy of its predicted reward by minimizing the loss function as follows:
Loss(w)=(rt+γV(st+1,w′)-V(st,w))2
wherein w is the parameter vector of the critic evaluation network, w' is the parameter vector of the critic target network, the discount factor gamma is [0,1 ] reflects the influence of the future reward on the accumulated reward, and V(s)tW) is the critic's evaluation of the state value of the network at time t, V(s)t+1W') is the state value at time t +1 of the critic target network;
the unconstrained nonlinear programming problem is solved by using an L-BFGS method, and an iterative formula of w is obtained as follows:
Figure BDA0003537167280000101
wherein, gkIs the gradient, ρ is the step size of the criticc network, DkIs an approximation of the inverse of the Hessian matrix, w1Set to a random initial point of time,
Figure BDA0003537167280000102
is the first non-negative integer that can ensure the smooth update of the critic network parameter vector.
The invention is described in further detail below with reference to the figures and the embodiments.
Examples
One embodiment of the invention is described in detail below, with the simulation using python programming, and the parameter settings do not affect generality. In contrast to the method described, the method described is: (1) a random resource dynamic allocation method; (2) an information age-aware resource dynamic allocation method based on a deep Q network.
TABLE 1 Primary simulation parameter values
Figure BDA0003537167280000103
As shown in fig. 2, the car networking system model refers to the manhattan grid model in the 3GPP TR 36.885 standard. Table 1 lists the main simulation parameter values. During training, the number of rounds is set to 300 and the maximum number of steps per round is set to 50000. Before each round begins, each vehicle will select the vehicle closest to it as its target communication vehicle, thereby generating a plurality of vehicle-user pairs. The initial positions of the vehicles are randomly distributed in the map, and the speed interval of the vehicles is [10m/s,15m/s]. Further, when the vehicle travels to the intersection, there is a certain probability that its traveling direction is changed (this probability is set to 0.64). In the confidence domain strategy optimization algorithm, the actor network and the critic network are fully connected neural networks and comprise an input layer, an output layer and three hidden layersThe number of neurons in the three hidden layers is 500, 250 and 120. In addition, the step size of the operator network
Figure BDA0003537167280000111
Set to 0.5, the step size p for the critical network is also set to 0.5. It is worth noting that the freshness of the information is of crucial importance in order to ensure road traffic safety. Therefore, the weight coefficient of the information age part is set to be larger than that of the power part by simulation, and the values of the weight coefficient of the information age part and the weight coefficient of the power part are kept unchanged in the performance comparison later.
As shown in fig. 3, compared with the baseline method, the method can stably and monotonically improve the policy in each iteration of the operator network and the critic network, so that the converged average cumulative reward is higher, the convergence speed is higher, and the performance is more stable.
As shown in fig. 4 and 5, the performance of the method is significantly better than the baseline method when the initial load amount of both the vehicle-to-vehicle link and the vehicle-to-infrastructure link changes. The reason for this is that the method can be learned quickly and stably through its operator-critic framework and appropriate gradient update step selection method.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are given by way of illustration of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, and such changes and modifications are within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. An information age-aware resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
step 1, inputting a car networking environment, and initializing parameters of an operator network and a critic network by a base station;
step 2, in the current time slot, the base station firstly distributes channels and transmitting power for all vehicle user pairs in the environment;
step 3, after the vehicle user and the cellular user complete communication, the residual load capacity and the information age of all links are updated;
step 4, after the base station obtains the reward fed back by the environment, the base station senses and collects the current state information of the environment, and meanwhile, a buffer pool stores sample data generated at the time slot;
step 5, when the number of samples is enough, updating the parameters of the operator network and the critic network according to an iterative formula in the confidence domain strategy optimization algorithm, and emptying the buffer pool after updating is completed;
and 6, when the maximum number of steps of the training round is reached, ending the current round, starting the next round, inputting the Internet of vehicles environment again, and repeating the steps 2 to 5.
2. The method of claim 1, wherein the discretized time slots are represented by an index j, wherein the duration of each time slot is τ; suppose there are M pairs of vehicle users and N cellular users in the system, corresponding to M vehicle-to-vehicle links and N vehicle-to-infrastructure links, respectively.
3. The method according to claim 2, wherein the step 1 of inputting the car networking environment, the base station initializes the parameters of the own operator network and the critic network, wherein the car networking environment comprises:
(1) and (3) network model: the base station is positioned in the center of the map, has a large enough coverage range and can communicate with vehicles at any positions in the map; each vehicle in the system is a cellular user and a vehicle user, when the vehicle user is used as the vehicle user, the vehicle user needs to establish a vehicle and vehicle link with a target vehicle user corresponding to the vehicle user to transmit traffic information, and when the vehicle user is used as the cellular user, the vehicle and infrastructure link is established with a base station to realize the transmission of user data of the vehicle user;
(2) and (3) channel model: considering that limited sub-frequency bands exist in a system, and channel power gain and interference power gain are both composed of slow fading and fast fading, wherein the slow fading comprises path loss and shadow fading and has two conditions of line-of-sight and non-line-of-sight, and the fast fading refers to Rayleigh fading;
(3) wireless transmission model: different vehicle and infrastructure links fixedly occupy different channels, and the transmit power of cellular users remains unchanged; definition of
Figure FDA0003537167270000011
And
Figure FDA0003537167270000012
the channel and the transmission power which are respectively distributed to the mth pair of vehicle users by the base station at the beginning of the time slot j;
Figure FDA0003537167270000013
the maximum value in (1) is denoted as PmaxThe minimum value is denoted as Pmin
4. The method as claimed in claim 2 or 3, wherein in the current timeslot, the base station allocates channels and transmission power for all pairs of vehicle users in the environment first, specifically:
the base station is used as an intelligent agent and is responsible for distributing transmission channels and transmitting power for all vehicle user pairs; thus, the action at time slot j can be expressed as:
Figure FDA0003537167270000021
wherein,
Figure FDA0003537167270000022
the action of the m-th vehicle user in the time slot j.
5. The method for allocating information age-aware resources based on deep reinforcement learning according to claim 4, wherein in step 3, after the vehicle user and the cellular user complete communication, the remaining load and information age of all links are updated, specifically:
(1) the method for updating the residual load comprises the following steps: if the residual load is successfully transmitted in the current time slot, updating the residual load to the initial load; otherwise, subtracting the successfully transmitted load from the residual load;
(2) the information age updating method comprises the following steps: if the residual load capacity in the current time slot is updated to the initial load capacity, the information age is updated to the initial value tau; otherwise, τ is added to the age of the message.
6. The method according to claim 5, wherein the base station in step 4, after receiving the reward from the environment feedback, senses and collects current state information of the environment, and the buffer pool stores sample data generated in the time slot, specifically:
(1) reward
Definition of
Figure FDA0003537167270000023
Is the transmission channel vector that the base station allocates to all pairs of vehicle users at the beginning of time slot j,
Figure FDA0003537167270000024
is the transmit power vector that the base station assigns to all pairs of vehicle users at the beginning of time slot j,
Figure FDA0003537167270000025
and
Figure FDA0003537167270000026
the information ages of the nth cellular user and the mth pair of vehicle users at the end of the time slot j are respectively set;
in order to achieve the optimization goal of minimizing the sum of the information age of all links and the power of the vehicle user pairs, a separate definition is made
Figure FDA0003537167270000027
For the information age utility function of the mth pair of vehicle users,
Figure FDA0003537167270000028
for the power utility function of the mth pair of vehicle users,
Figure FDA0003537167270000029
an information age utility function for the nth cellular user;
considering different orders of magnitude of information age and power, respectively carrying out normalization processing on the information age and the power consumption in a utility function to ensure that the information age and the power consumption are in the same value range; therefore, the temperature of the molten metal is controlled,
Figure FDA00035371672700000210
and
Figure FDA00035371672700000211
respectively expressed as:
Figure FDA00035371672700000212
Figure FDA00035371672700000213
Figure FDA0003537167270000031
the reward at slot j can be expressed as:
Figure FDA0003537167270000032
wherein,
Figure FDA0003537167270000033
is the m-th reward to the vehicle user at time slot j,
Figure FDA0003537167270000034
is the reward for the nth cellular user at time slot j; lambda [ alpha ]aAnd λpWeight coefficients of the information age part and the power part respectively, and satisfy lambdaap=1;
(2) Status of state
The base station can observe and collect four kinds of state information, including: the channel states of all vehicle-to-vehicle links and vehicle-to-infrastructure links in the current time slot, the interference power suffered by all vehicle-to-vehicle links when the previous time slot selects different channels, the residual load capacity of all cellular users and vehicle user pairs, and the information ages of all cellular users and vehicle user pairs;
first, the state information associated with all vehicle user pairs at time slot j can be expressed as:
Figure FDA0003537167270000035
wherein,
Figure FDA0003537167270000036
the state information of the mth pair of vehicle users in the time slot j comprises the following steps: channel state vector of m-th pair of vehicle users on L channels at time slot j
Figure FDA0003537167270000037
Interference power vector received by receiver in m-th pair of vehicle users on L channel selections in time slot j-1
Figure FDA0003537167270000038
The residual load quantity of the mth pair of vehicle users at the end of the time slot j-1
Figure FDA0003537167270000039
Information age of mth pair of vehicle users at the end of time slot j-1
Figure FDA00035371672700000310
Second, the state information associated with all cellular users at slot j can be expressed as:
Figure FDA00035371672700000311
wherein,
Figure FDA00035371672700000312
the state information of the nth cellular user at time slot j includes: channel state vector of nth cellular user on L channels at time slot j
Figure FDA00035371672700000313
The remaining capacity of the nth cell user at the end of slot j-1
Figure FDA00035371672700000314
Age of information at the end of time slot j-1 for the nth cellular user
Figure FDA00035371672700000315
The state at slot j can be expressed as:
Figure FDA00035371672700000316
7. the method as claimed in claim 1, wherein in step 5, when the number of samples is sufficient, the parameters of the operator network and the critic network are updated according to the iterative formula in the confidence domain policy optimization algorithm, and the buffer pool is emptied after the update is completed, specifically:
(1) actor network
In the confidence domain strategy optimization algorithm, an actor network can fit strategies and output high-dimensional actions and environment interaction as a strategy function;
the Actor network can ensure that the jackpot value available for a new policy is higher than the jackpot value available for an old policy by maximising the difference between the jackpot value obtained using the new policy and the jackpot value obtained using the old policy, i.e. the new policy is better than the old policy, thereby achieving the goal that the policies are always improving monotonically; meanwhile, in order to ensure the stability of the strategy updating process, KL divergence constraint is introduced into the confidence domain strategy optimization algorithm to prevent the strategy from changing greatly; thus, the optimization problem for an actor network can be expressed as:
Figure FDA0003537167270000041
Figure FDA0003537167270000042
where θ is the policy parameter vector, a and s are the action vector and the state vector, respectively,
Figure FDA0003537167270000043
is desired on the trajectory, DKL(. is) the KL divergence of the two distributions,
Figure FDA0003537167270000044
is an unrefreshed policy, piθIs an updated policy, delta is a threshold for KL divergence expectation in the trust domain,
Figure FDA0003537167270000045
is that
Figure FDA0003537167270000046
I.e. the dominance of action a over the average action in state s;
aiming at the optimization problem, firstly, a natural strategy gradient method is used for simplifying the optimization problem, then, a conjugate gradient method is used for avoiding inversion operation on a Fisher information matrix, and finally, a line search method with backtracking property is introduced to obtain an iterative equation of theta:
Figure FDA0003537167270000047
where x is the solution of a system of linear equations in the form Fx-g, F (θ) is the Fisher information matrix, g (θ) is the gradient,
Figure FDA0003537167270000048
the step length of the operator network, i is a first nonnegative integer which can simultaneously meet the requirement of improving the desired KL divergence constraint and strategy;
(2) critic network
In the confidence domain strategy optimization algorithm, the critic network can fit a state value function and is used as a value function to evaluate and guide an actor network according to high-dimensional state input;
the Critic network improves the accuracy of its predicted reward by minimizing the loss function as follows:
Loss(w)=(rt+γV(st+1,w′)-V(st,w))2
wherein w is the parameter vector of the critic evaluation network, w' is the parameter vector of the critic target network, the discount factor gamma is [0,1 ] reflects the influence of the future reward on the accumulated reward, and V(s)tW) is the critic's evaluation of the state value of the network at time t, V(s)t+1W') is the state value at time t +1 of the critic target network;
the unconstrained nonlinear programming problem is solved by using an L-BFGS method, and the obtained iterative formula of w is as follows:
wk+1=wkιDkgk
wherein, gkIs the gradient, p is the step size of the criticc network, DkIs an approximation of the inverse of the Hessian matrix, w1Set to a random initial point, and iota is the first non-negative integer that can ensure smooth update of the critic network parameter vector.
CN202210228341.9A 2022-03-08 2022-03-08 Information age perceivable resource allocation method based on deep reinforcement learning Active CN114630299B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210228341.9A CN114630299B (en) 2022-03-08 2022-03-08 Information age perceivable resource allocation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210228341.9A CN114630299B (en) 2022-03-08 2022-03-08 Information age perceivable resource allocation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114630299A true CN114630299A (en) 2022-06-14
CN114630299B CN114630299B (en) 2024-04-23

Family

ID=81899620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210228341.9A Active CN114630299B (en) 2022-03-08 2022-03-08 Information age perceivable resource allocation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114630299B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001002A (en) * 2022-08-01 2022-09-02 广东电网有限责任公司肇庆供电局 Optimal scheduling method and system for solving energy storage participation peak clipping and valley filling
CN117896679A (en) * 2024-01-18 2024-04-16 西南交通大学 Message propagation method and device for self-adaptive key vehicle node selection

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109862610A (en) * 2019-01-08 2019-06-07 华中科技大学 A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
US20200398859A1 (en) * 2019-06-20 2020-12-24 Cummins Inc. Reinforcement learning control of vehicle systems
US20210123757A1 (en) * 2019-10-24 2021-04-29 Lg Electronics Inc. Method and apparatus for managing vehicle's resource in autonomous driving system
WO2021135554A1 (en) * 2019-12-31 2021-07-08 歌尔股份有限公司 Method and device for planning global path of unmanned vehicle
CN113438315A (en) * 2021-07-02 2021-09-24 中山大学 Internet of things information freshness optimization method based on dual-network deep reinforcement learning
US20220015068A1 (en) * 2020-07-09 2022-01-13 Qualcomm Incorporated Enhancements for improved cv2x scheduling and performance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109862610A (en) * 2019-01-08 2019-06-07 华中科技大学 A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
US20200398859A1 (en) * 2019-06-20 2020-12-24 Cummins Inc. Reinforcement learning control of vehicle systems
US20210123757A1 (en) * 2019-10-24 2021-04-29 Lg Electronics Inc. Method and apparatus for managing vehicle's resource in autonomous driving system
WO2021135554A1 (en) * 2019-12-31 2021-07-08 歌尔股份有限公司 Method and device for planning global path of unmanned vehicle
US20220015068A1 (en) * 2020-07-09 2022-01-13 Qualcomm Incorporated Enhancements for improved cv2x scheduling and performance
CN113438315A (en) * 2021-07-02 2021-09-24 中山大学 Internet of things information freshness optimization method based on dual-network deep reinforcement learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD PARVINI等: "AoI-Aware Resource Allocation for Platoon-Based C-V2X Networks via Multi-Agent Multi-Task Reinforcement Learning", 《IEEE》, 10 May 2021 (2021-05-10) *
彭诺蘅: "基于强化学习的无线自组织网络动态资源分配研究", 《南京理工大学硕士学位论文》, 22 August 2023 (2023-08-22) *
李孜恒;孟超;: "基于深度强化学习的无线网络资源分配算法", 通信技术, no. 08, 10 August 2020 (2020-08-10) *
熊轲;胡慧敏;艾渤;张煜;裴丽;: "6G时代信息新鲜度优先的无线网络设计", 物联网学报, no. 01, 30 March 2020 (2020-03-30) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001002A (en) * 2022-08-01 2022-09-02 广东电网有限责任公司肇庆供电局 Optimal scheduling method and system for solving energy storage participation peak clipping and valley filling
CN115001002B (en) * 2022-08-01 2022-12-30 广东电网有限责任公司肇庆供电局 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
CN117896679A (en) * 2024-01-18 2024-04-16 西南交通大学 Message propagation method and device for self-adaptive key vehicle node selection
CN117896679B (en) * 2024-01-18 2024-07-16 西南交通大学 Message propagation method and device for self-adaptive key vehicle node selection

Also Published As

Publication number Publication date
CN114630299B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
CN114389678B (en) Multi-beam satellite resource allocation method based on decision performance evaluation
Luo et al. Dynamic resource allocations based on Q-learning for D2D communication in cellular networks
CN114630299A (en) Information age-perceptible resource allocation method based on deep reinforcement learning
Wu et al. Load balance guaranteed vehicle-to-vehicle computation offloading for min-max fairness in VANETs
Vu et al. Multi-agent reinforcement learning for channel assignment and power allocation in platoon-based C-V2X systems
CN109819422B (en) Stackelberg game-based heterogeneous Internet of vehicles multi-mode communication method
CN112512121A (en) Radio frequency spectrum dynamic allocation method and device based on reinforcement learning algorithm
Şahin et al. Reinforcement learning scheduler for vehicle-to-vehicle communications outside coverage
CN115866787A (en) Network resource allocation method integrating terminal direct transmission communication and multi-access edge calculation
Liang et al. Multi-agent reinforcement learning for spectrum sharing in vehicular networks
Ji et al. Multi-agent reinforcement learning resources allocation method using dueling double deep Q-network in vehicular networks
CN113423087B (en) Wireless resource allocation method facing vehicle queue control requirement
Huang et al. Delay-oriented knowledge-driven resource allocation in sagin-based vehicular networks
Yang et al. Task-driven semantic-aware green cooperative transmission strategy for vehicular networks
Ren et al. Joint spectrum allocation and power control in vehicular communications based on dueling double DQN
CN117412391A (en) Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method
CN117354833A (en) Cognitive Internet of things resource allocation method based on multi-agent reinforcement learning algorithm
CN112750298A (en) Truck formation dynamic resource allocation method based on SMDP and DRL
Li et al. A Lightweight Transmission Parameter Selection Scheme Using Reinforcement Learning for LoRaWAN
Zhao et al. Multi-agent deep reinforcement learning based resource management in heterogeneous V2X networks
Lyu et al. Service-driven resource management in vehicular networks based on deep reinforcement learning
CN115551065A (en) Internet of vehicles resource allocation method based on multi-agent deep reinforcement learning
Chen et al. Caching in narrow-band burst-error channels via meta self-supervision learning
CN114916087A (en) Dynamic spectrum access method based on India buffet process in VANET system
CN113890653A (en) Multi-agent reinforcement learning power distribution method for multi-user benefits

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant