CN113365312A - Mobile load balancing method combining reinforcement learning and supervised learning - Google Patents

Mobile load balancing method combining reinforcement learning and supervised learning Download PDF

Info

Publication number
CN113365312A
CN113365312A CN202110689823.XA CN202110689823A CN113365312A CN 113365312 A CN113365312 A CN 113365312A CN 202110689823 A CN202110689823 A CN 202110689823A CN 113365312 A CN113365312 A CN 113365312A
Authority
CN
China
Prior art keywords
network
base station
action
value
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110689823.XA
Other languages
Chinese (zh)
Other versions
CN113365312B (en
Inventor
潘志文
姚猛
刘楠
尤肖虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Network Communication and Security Zijinshan Laboratory
Original Assignee
Southeast University
Network Communication and Security Zijinshan Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Network Communication and Security Zijinshan Laboratory filed Critical Southeast University
Priority to CN202110689823.XA priority Critical patent/CN113365312B/en
Publication of CN113365312A publication Critical patent/CN113365312A/en
Application granted granted Critical
Publication of CN113365312B publication Critical patent/CN113365312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/086Load balancing or load distribution among access entities
    • H04W28/0861Load balancing or load distribution among access entities between base stations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • H04W28/09Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps: firstly, initializing parameters; then starting the circulation of the current round, obtaining the initial state of the system, starting the circulation of small steps in each round, carrying out the reinforcement learning process, ending the small step circulation, turning to the next round for circulation until the set maximum round is reached, and ending the circulation: updating the data pool and sampling, updating the parameters of each actual execution network, finally taking the state of each base station in the system as the input of the actual execution network, obtaining the output value of the network, namely the cell offset value of each base station, applying the cell offset value to each base station in the system, and switching the users according to the A3 event, thereby realizing the redistribution between the users and the base stations, further reducing the load of the overloaded cell and realizing the load balance of the system. The invention has higher stability and load balancing capability, and better generalization and migration capability.

Description

Mobile load balancing method combining reinforcement learning and supervised learning
Technical Field
The invention belongs to the technical field of wireless networks in mobile communication, relates to a mobile load balancing optimization method, and particularly relates to a mobile load balancing method combining reinforcement learning and supervised learning.
Background
The mobile load balancing technology is an important technology for realizing load balancing among wireless network cells. The method realizes user transfer between base stations by adjusting individual offset parameters of the cells, thereby achieving the purpose of load balancing. Reinforcement learning has been applied to mobile load balancing problems. The mobile load balancing technology based on single-agent reinforcement learning or multi-agent reinforcement learning can achieve the purpose of mobile load balancing by adjusting appropriate cell individual offset parameters. However, the mobile load balancing method based on single agent reinforcement learning has a lot of signaling overhead caused by load information interaction, while the mobile load balancing method based on multi-agent reinforcement learning has the effect of distributed execution, but the training time cost is high. Therefore, a suitable mobile load balancing method is needed to avoid the disadvantages of the above two methods.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems, the invention provides a mobile load balancing method combining reinforcement learning and supervised learning, which is divided into a reinforcement learning stage and a supervised learning stage. In the reinforcement learning stage, a super-dense network scene downlink user connection method with a priority experience pool, a normalized reward function and a reward function prediction network is adopted; in the supervised learning stage, the action network trained in the reinforcement learning stage is utilized, a plurality of actual execution networks are trained through supervised learning, and the cell offset values of all base stations are jointly adjusted to realize load balancing.
The technical scheme is as follows: in order to achieve the above object, the mobile load balancing method combining reinforcement learning and supervised learning of the present invention comprises the following steps:
step 1: the initialization parameters include learning rate alpha, discount factor gamma, soft update factor tau, small batch numberData set size K, priority experience pool capacity M, and handoff hysteresis parameter HystNumber of physical resource blocks NPRBOutput action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、OcmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency fc
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, acquiring an initial state from the initialized system environment;
step 2.2, after the initial state is obtained, starting a reinforcement learning process;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the actual execution network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the radio resource control protocol, so that the redistribution between the user and the base station is realized, the load of the overloaded cell is further reduced, and the load balance of the system is realized.
Wherein,
step 2.1, obtaining an initial state from the initialized system environment, including the following processes:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; signal-to-interference-and-noise ratio of user u serving base station b at time t
Figure BDA0003126194400000021
Comprises the following steps:
Figure BDA0003126194400000022
wherein, PbAnd PxThe transmit power of the serving base station b and the interfering base station x respectively,
Figure BDA0003126194400000023
in order to be a set of interfering base stations,
Figure BDA0003126194400000024
and
Figure BDA0003126194400000025
the channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0Is the noise power;
step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource block
Figure BDA0003126194400000026
The definition is as follows:
Figure BDA0003126194400000027
wherein, BPRBIs the bandwidth of one physical resource block;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time t
Figure BDA0003126194400000028
Comprises the following steps:
Figure BDA0003126194400000029
wherein,
Figure BDA00031261944000000210
indicating the desired rate for user u at time t,
Figure BDA00031261944000000211
represents a ceiling operation;
step 2.1.4, calculating the load of each base station; load of serving base station b at time t
Figure BDA00031261944000000212
Is defined as:
Figure BDA0003126194400000031
wherein N isbAs the total number of physical resource blocks of the serving base station b,
Figure BDA0003126194400000032
a user set at a time t for a serving base station b;
step 2.1.5, Slave StateSpace(s)
Figure BDA0003126194400000033
To obtain an initial state: state space
Figure BDA0003126194400000034
The state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stations
Figure BDA0003126194400000035
Is defined as:
Figure BDA0003126194400000036
wherein,
Figure BDA0003126194400000037
is the resource utilization of the j base station at time t of the de-equalization, wherein
Figure BDA0003126194400000038
Is the average resource utilization of all base stations at time t,
Figure BDA0003126194400000039
j is the load of the jth base station at the time t, wherein j is 1, …, and N is the total number of base stations in the system;
Figure BDA00031261944000000310
is the proportion of edge users of all base stations at time t, defined by the a4 event defined in the radio resource control protocol RRC.
After the initial state is obtained in the step 2.2, the reinforcement learning process is started, and the method comprises the following processes:
step 2.2.1, selecting an action;
action atDefined as the cell offset of each base station, as shown in the following equation:
Figure BDA00031261944000000311
wherein s istIs the state of the system at time t, and
Figure BDA00031261944000000312
μ (-) is a deterministic strategy to estimate the action network, where θaTo estimate parameters of a motion network;
Figure BDA00031261944000000313
the action space for an agent is defined as follows:
Figure BDA00031261944000000314
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In which O iscmin、OcmaxRespectively a cell bias lower bound value and an upper bound value; the cell offset values of the z th base station and the j th base station satisfy: o iscz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1The bonus value rtThe expression is as follows:
Figure BDA00031261944000000315
wherein
Figure BDA00031261944000000316
For serving the load of base station b at time t, BS is the set of all base stations in the system;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt
Figure BDA00031261944000000317
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of the evaluation network; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,
Figure BDA00031261944000000318
and
Figure BDA0003126194400000041
parameters of a target action network and a target evaluation network are respectively, and gamma is a discount factor;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the state s of the current system at the moment t is determinedtTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, sampling K samples from the priority experience pool according to the sample priority, wherein the samples with higher priorities have higher sampling probability;
step 2.2.6, for the sampled K samples, the probability that the ith sample is sampled is calculated according to the following formula:
Figure BDA0003126194400000042
where m denotes the sample number, pmIs the priority of the mth sample, alpha' is the sampling priority impact factor, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample: w is ai=(P(i)/Pmin)In which P isminIs the smallest sampling probability among all samples, β is a gradually changing convergence factor;
step 2.2.8, calculating the corrected reward value of the ith sample: r'i=ri+η·R(si,air) Where eta is a bias factor, riThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrEstimating parameters of the reward function prediction network;
step 2.2.9, use the prize value r 'after correction'iUpdating deltaiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating and estimating the parameter theta of the evaluation networkcThe process is as follows:
calculating loss function of estimation evaluation network by using small-batch gradient descent method
Figure BDA0003126194400000043
Gradient (2):
Figure BDA0003126194400000044
Figure BDA0003126194400000045
wherein s isi' is the next time state contained in the ith sample, [ mu ] ' is the deterministic strategy of the target action network, Q ' (. cndot.) is the evaluation value of the target evaluation network output, and Q (-. cndot.) is the evaluation value of the estimated evaluation network output;
will be provided with
Figure BDA0003126194400000046
Is updated to estimate the parameter theta of the evaluation networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the parameter θ of the estimated action networkaThe process is as follows:
computing a loss function of an estimated action network
Figure BDA0003126194400000051
Gradient (2):
Figure BDA0003126194400000052
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided with
Figure BDA0003126194400000053
Is carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkrThe process is as follows:
calculating loss function of estimation reward function prediction network by using small batch gradient descent method
Figure BDA0003126194400000054
Gradient (2):
Figure BDA0003126194400000055
Figure BDA0003126194400000056
wherein R (-) is the mapping of the estimated reward function prediction networkR' (. cndot.) is a mapping of the target reward function prediction network,
Figure BDA0003126194400000057
μ' (·) is the deterministic policy of the target action network,
Figure BDA0003126194400000058
is a parameter of the target reward function prediction network;
will be provided with
Figure BDA0003126194400000059
Is updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
Figure BDA00031261944000000510
Figure BDA00031261944000000511
Figure BDA00031261944000000512
wherein,
Figure BDA00031261944000000513
and
Figure BDA00031261944000000514
respectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft updating parameter.
And 2.2.2, the value of the reward value is in the range of [ -1,1 ].
Has the advantages that: compared with the prior art, the invention has the following advantages and beneficial effects:
the method does not need any prior knowledge of the wireless environment, and can automatically learn the optimal Mobile Load Balancing (MLB) strategy through the exploration environment; the invention adopts the normalized reward function, the experience pool with priority and the reward function prediction network, thereby having higher stability and load balancing capability, and simultaneously adopts the supervised learning method to realize the distributed execution effect, thereby avoiding higher training cost when the multi-agent reinforcement learning is carried out, which has important significance in the real network scene and simultaneously has better generalization and migration capability.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following detailed description is only illustrative and not intended to limit the scope of the present invention.
The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps:
step 1: and (5) initializing. The present invention is illustrated with the following parameter settings as examples:
step 1.1, the present invention is explained by taking the following network initialization parameters as an example: the action network, the evaluation network and the reward function prediction network all comprise two hidden layers respectively comprising 400 and 300 neurons, and the learning rates alpha of the hidden layers are all 10-3. The discounting factor gamma is 0.99 and the soft update factor tau is 0.001. The small batch size K is 64, the priority empirical pool capacity M is 10000, the normal number ε is 1e-5, the bias factor η is 0.0001, and the sampling priority impact factor α' is 0.9. There are 7 actual execution networks, each containing 5 hidden layers, 600, 400, 400, and 300 neurons, respectively. The initial learning rate of each actually executed network is 10-3. The capacity of the new data pool is 60000.
Step 1.2, the present invention takes the following initialization parameters of the system environment as an exampleThe description is as follows: the system environment comprises 7 base stations distributed in an area of 60 meters by 60 meters. The area contains 200 users, and is in a free walking state, and the speed is between 1.0m/s and 1.2 m/s. It is assumed that each user is a guaranteed bit rate user and the bit rate is 128 kb/s. The transmit power of each base station is 20 dBm. The path loss modeling case is as follows: 38.3 × log (d) +17.3+24.9 × log (f) in the case of non-line of sightc) In the case of visual range, 17.3 × log (d) +32.4+20 × log (f)c) Wherein d is the distance from the user to the base station and is measured in meters; f. ofcFor the carrier frequency, 3.5GHz is chosen. The shadow fading is modeled as a log normal distribution with a mean of 0 and a standard deviation of 8 dB. Switching hysteresis parameter HystSet to 2 dB. Number N of Physical Resource Blocks (PRB) of each base stationPRB111, output action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、 OcmaxRespectively, a cell bias lower bound value and an upper bound value, [ O ]cmin,Ocmax]Initialized to [ -1.5dB,1.5 dB)]。
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, obtaining an initial state from the initialized system environment, and comprising the following processes:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; signal-to-interference-and-noise ratio of user u serving base station b at time t
Figure BDA0003126194400000061
Comprises the following steps:
Figure BDA0003126194400000062
wherein, PbAnd PxThe transmit powers of the user's serving base station b and interfering base station x, respectively, are 20dBm in this example;
Figure BDA0003126194400000063
in order to be a set of interfering base stations,
Figure BDA0003126194400000064
and
Figure BDA0003126194400000065
the channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0For noise power, 3.9811 × 10 in this example-13
Step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource block
Figure BDA0003126194400000066
The definition is as follows:
Figure BDA0003126194400000067
wherein, BPRBIs the bandwidth of one physical resource block, in this case 180 kHz;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time t
Figure BDA0003126194400000071
Comprises the following steps:
Figure BDA0003126194400000072
wherein,
Figure BDA0003126194400000073
indicating the desired rate for user u at time t, in this example 128kb/s,
Figure BDA0003126194400000074
represents a ceiling operation;
step 2.1.4, calculating the load of each base station; load of serving base station b at time t
Figure BDA0003126194400000075
The ratio of the total number of physical resource blocks required by all users to the total number of the base station physical resource blocks is defined as:
Figure BDA0003126194400000076
wherein N isbThe total number of physical resource blocks serving base station b, in this example 111,
Figure BDA0003126194400000077
a user set at a time t for a serving base station b;
step 2.1.5 Slave State space
Figure BDA0003126194400000078
To obtain an initial state: state space
Figure BDA0003126194400000079
The state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stations
Figure BDA00031261944000000710
Is defined as:
Figure BDA00031261944000000711
wherein,
Figure BDA00031261944000000712
is the resource utilization of the j base station at time t of the de-equalization, wherein
Figure BDA00031261944000000713
Is the average resource utilization of all base stations at time t,
Figure BDA00031261944000000714
the load of the jth base station at time t, N is the total number of base stations in the system, and in this example, N is 7;
Figure BDA00031261944000000715
is the edge user proportion of all base stations at time t, defined by an a4 event defined in a Radio Resource Control (RRC) protocol;
step 2.2, after the initial state is obtained, starting a reinforcement learning process, which comprises the following procedures:
step 2.2.1, selecting an action;
action atCell offset defined for each base station:
Figure BDA00031261944000000716
wherein s istIs the state of the system at time t, and
Figure BDA00031261944000000717
μ (-) is a deterministic strategy to estimate the action network, where θaTo estimate parameters of a motion network;
Figure BDA00031261944000000718
the action space for an agent is defined as follows:
Figure BDA00031261944000000719
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In this case Ocmin=-1.5dB,Ocmax1.5 dB; the cell offset values of the z th base station and the j th base station satisfy Ocz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1Is awardedValue rtThe expression is as follows:
Figure BDA00031261944000000720
wherein
Figure BDA0003126194400000081
For the load of the service base station b at the time t, the BS is a set of all base stations in the system;
the value of the reward value is designed in the range of [ -1,1], so that the convergence and the stability of the method are facilitated;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt
Figure BDA0003126194400000082
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of an evaluation network, wherein the initial value of the parameters is determined by system initialization; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,
Figure BDA0003126194400000083
and
Figure BDA0003126194400000084
the parameters of the target action network and the target evaluation network are respectively, the initial values of the parameters are determined by system initialization, and gamma is a discount factor; time difference error deltatThe larger the absolute value of (a), the higher the priority;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the current state stTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, in this example, K is 64, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, selecting K samples from the priority experience pool according to the priority of the samples, wherein the higher the priority of the samples is, the higher the probability of sampling is, and K is 64 in the example;
step 2.2.6, for the sampled K samples, calculating the probability that the ith sample is sampled:
Figure BDA0003126194400000085
where m denotes the sample number, pmIs the priority of the mth sample, and α' is the sample priority impact factor, which in this example is 0.9, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample:
wi=(M·P(i))/(M·Pmin)=(P(i)/Pmin)
wherein P isminIs the minimum sampling probability among all samples, M is the capacity of the priority experience pool, in this example 10000, β is a gradually changing convergence factor, and the initial value is 0.01 in this example;
step 2.2.8, calculating the corrected reward value of the ith sample: r isi′=ri+η·R(si,air) Where η is the bias factor, in this case 0.0001; r isiThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrIs to estimateThe reward function predicts parameters of the network;
step 2.2.9, update delta with the revised prize valueiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating estimation evaluation network parameter thetac
Firstly, calculating loss function of estimation evaluation network
Figure BDA0003126194400000091
The gradient descent method using a small batch is defined as:
Figure BDA0003126194400000092
Figure BDA0003126194400000093
wherein s isi'is the state of the next moment contained in the ith sample, Q' () is the evaluation value of the target evaluation network output, and Q (-) is the evaluation value of the estimated evaluation network output;
estimation using small batch gradient descent method
Figure BDA0003126194400000094
Gradient (2):
Figure BDA0003126194400000095
will be provided with
Figure BDA0003126194400000096
Is updated to estimate the parameter theta of the evaluation networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the estimated action network parameter θa
Computing a loss function of an estimated action network
Figure BDA0003126194400000097
Gradient (2):
Figure BDA0003126194400000098
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided with
Figure BDA0003126194400000099
Is carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkr
Estimating loss functions of reward function prediction networks
Figure BDA00031261944000000910
The gradient descent method using a small batch is defined as:
Figure BDA00031261944000000911
Figure BDA00031261944000000912
wherein,
Figure BDA00031261944000000913
r (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network;
estimation using small batch gradient descent method
Figure BDA00031261944000000914
Gradient (2):
Figure BDA00031261944000000915
will be provided with
Figure BDA00031261944000000916
Is updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
Figure BDA0003126194400000101
Figure BDA0003126194400000102
Figure BDA0003126194400000103
wherein,
Figure BDA0003126194400000104
and
Figure BDA0003126194400000105
respectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft update parameter which is 0.001 in the example;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4; the maximum number of steps in this example is 100 steps;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5; the maximum number of rounds in this example is 600 rounds;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool; the new data pool capacity is 60000 in this example;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the RRC protocol, so that the redistribution between the user and the base station is realized, the load of the overload cell is reduced, and the load balance of the system is realized.
The method can effectively realize the load balance of the network, simultaneously improve the robustness and stability of the network, reduce the load fluctuation of the network, simultaneously realize the effect of distributed execution and avoid higher training cost of multi-agent reinforcement learning.
It will be understood by those skilled in the art that, unless otherwise defined, all terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims (4)

1. A mobile load balancing method combining reinforcement learning and supervised learning is characterized by comprising the following steps:
step 1: the initialization parameters comprise a learning rate alpha, a discount factor gamma, a soft update factor tau, a small batch data set size K, a priority experience pool capacity M and a switching hysteresis parameter HystNumber of physical resource blocks NPRBOutput action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、OcmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency fc
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, acquiring an initial state from the initialized system environment;
step 2.2, after the initial state is obtained, starting a reinforcement learning process;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the actual execution network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the radio resource control protocol, so that the redistribution between the user and the base station is realized, the load of the overloaded cell is further reduced, and the load balance of the system is realized.
2. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 1, wherein the step 2.1 of obtaining the initial state from the initialized system environment includes the following steps:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; interference of user u serving base station b at time tNoise ratio
Figure FDA0003126194390000011
Comprises the following steps:
Figure FDA0003126194390000021
wherein, PbAnd PxThe transmit power of the serving base station b and the interfering base station x respectively,
Figure FDA0003126194390000022
in order to be a set of interfering base stations,
Figure FDA0003126194390000023
and
Figure FDA0003126194390000024
the channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0Is the noise power;
step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource block
Figure FDA0003126194390000025
The definition is as follows:
Figure FDA0003126194390000026
wherein, BPRBIs the bandwidth of one physical resource block;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time t
Figure FDA0003126194390000027
Comprises the following steps:
Figure FDA0003126194390000028
wherein,
Figure FDA0003126194390000029
indicating the desired rate for user u at time t,
Figure FDA00031261943900000210
represents a ceiling operation;
step 2.1.4, calculating the load of each base station; load of serving base station b at time t
Figure FDA00031261943900000211
Is defined as:
Figure FDA00031261943900000212
wherein N isbAs the total number of physical resource blocks of the serving base station b,
Figure FDA00031261943900000213
a user set at a time t for a serving base station b;
step 2.1.5 Slave State space
Figure FDA00031261943900000214
To obtain an initial state: state space
Figure FDA00031261943900000215
The state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stations
Figure FDA00031261943900000216
Is defined as:
Figure FDA00031261943900000217
wherein,
Figure FDA00031261943900000218
is the resource utilization of the j base station at time t of the de-equalization, wherein
Figure FDA00031261943900000219
Is the average resource utilization of all base stations at time t,
Figure FDA00031261943900000220
j is the load of the jth base station at the time t, wherein j is 1, …, and N is the total number of base stations in the system;
Figure FDA00031261943900000221
is the proportion of edge users of all base stations at time t, defined by the a4 event defined in the radio resource control protocol RRC.
3. The method for balancing mobile load combining reinforcement learning and supervised learning according to claim 1, wherein after the initial state is obtained in step 2.2, the reinforcement learning process is started, and the method includes the following steps:
step 2.2.1, selecting an action;
action atDefined as the cell offset of each base station, as shown in the following equation:
Figure FDA00031261943900000222
wherein s istIs the state of the system at time t, and
Figure FDA0003126194390000031
μ (-) is a deterministic strategy to estimate the action network, whichMiddle thetaaTo estimate parameters of a motion network;
Figure FDA0003126194390000032
the action space for an agent is defined as follows:
Figure FDA0003126194390000033
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In which O iscmin、OcmaxRespectively a cell bias lower bound value and an upper bound value; the cell offset values of the z th base station and the j th base station satisfy: o iscz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1The bonus value rtThe expression is as follows:
Figure FDA0003126194390000034
wherein
Figure FDA0003126194390000035
For serving the load of base station b at time t, BS is the set of all base stations in the system;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt
Figure FDA0003126194390000036
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of the evaluation network; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,
Figure FDA0003126194390000037
and
Figure FDA0003126194390000038
parameters of a target action network and a target evaluation network are respectively, and gamma is a discount factor;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the state s of the current system at the moment t is determinedtTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, sampling K samples from the priority experience pool according to the sample priority, wherein the samples with higher priorities have higher sampling probability;
step 2.2.6, for the sampled K samples, the probability that the ith sample is sampled is calculated according to the following formula:
Figure FDA0003126194390000039
where m denotes the sample number, pmIs the priority of the mth sample, alpha' is the sampling priority impact factor, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample: w is ai=(P(i)/Pmin)In which P isminIs the smallest sampling probability among all samples, β is a gradually changing convergence factor;
step 2.2.8, calculating the corrected reward value of the ith sample: r isi′=ri+η·R(si,air) Where eta is a bias factor, riThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrEstimating parameters of the reward function prediction network;
step 2.2.9, using the revised prize value ri' update deltaiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating and estimating the parameter theta of the evaluation networkcThe process is as follows:
calculating loss function of estimation evaluation network by using small-batch gradient descent method
Figure FDA0003126194390000041
Gradient (2):
Figure FDA0003126194390000042
Figure FDA0003126194390000043
wherein s isi' is the next time state contained in the ith sample, [ mu ] ' is the deterministic strategy of the target action network, Q ' (. cndot.) is the evaluation value of the target evaluation network output, and Q (-. cndot.) is the evaluation value of the estimated evaluation network output;
will be provided with
Figure FDA0003126194390000044
Is updated and estimated by back propagation of the gradientEvaluating a parameter θ of a networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the parameter θ of the estimated action networkaThe process is as follows:
computing a loss function of an estimated action network
Figure FDA0003126194390000045
Gradient (2):
Figure FDA0003126194390000046
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided with
Figure FDA0003126194390000047
Is carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkrThe process is as follows:
calculating loss function of estimation reward function prediction network by using small batch gradient descent method
Figure FDA0003126194390000048
Gradient (2):
Figure FDA0003126194390000049
Figure FDA00031261943900000410
wherein R (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network,
Figure FDA00031261943900000411
μ' (·) is the deterministic policy of the target action network,
Figure FDA00031261943900000412
is a parameter of the target reward function prediction network;
will be provided with
Figure FDA00031261943900000413
Is updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
Figure FDA0003126194390000051
Figure FDA0003126194390000052
Figure FDA0003126194390000053
wherein,
Figure FDA0003126194390000054
and
Figure FDA0003126194390000055
respectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft updating parameter.
4. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 3, wherein in step 2.2.2, the reward value is in the range of [ -1,1 ].
CN202110689823.XA 2021-06-22 2021-06-22 Mobile load balancing method combining reinforcement learning and supervised learning Active CN113365312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110689823.XA CN113365312B (en) 2021-06-22 2021-06-22 Mobile load balancing method combining reinforcement learning and supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110689823.XA CN113365312B (en) 2021-06-22 2021-06-22 Mobile load balancing method combining reinforcement learning and supervised learning

Publications (2)

Publication Number Publication Date
CN113365312A true CN113365312A (en) 2021-09-07
CN113365312B CN113365312B (en) 2022-10-14

Family

ID=77535530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110689823.XA Active CN113365312B (en) 2021-06-22 2021-06-22 Mobile load balancing method combining reinforcement learning and supervised learning

Country Status (1)

Country Link
CN (1) CN113365312B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598655A (en) * 2022-03-10 2022-06-07 东南大学 Mobility load balancing method based on reinforcement learning
CN114666840A (en) * 2022-03-28 2022-06-24 东南大学 Load balancing method based on multi-agent reinforcement learning
CN114675977A (en) * 2022-05-30 2022-06-28 山东科华电力技术有限公司 Distributed monitoring and transportation and management system based on power internet of things
CN115314379A (en) * 2022-08-10 2022-11-08 汉桑(南京)科技股份有限公司 Method, system, device and medium for configuring equipment parameters
CN115395993A (en) * 2022-04-21 2022-11-25 东南大学 Reconfigurable intelligent surface enhanced MISO-OFDM transmission method
CN115514614A (en) * 2022-11-15 2022-12-23 阿里云计算有限公司 Cloud network anomaly detection model training method based on reinforcement learning and storage medium
WO2023059106A1 (en) * 2021-10-06 2023-04-13 Samsung Electronics Co., Ltd. Method and system for multi-batch reinforcement learning via multi-imitation learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
CN111181618A (en) * 2020-01-03 2020-05-19 东南大学 Intelligent reflection surface phase optimization method based on deep reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110493826A (en) * 2019-08-28 2019-11-22 重庆邮电大学 A kind of isomery cloud radio access network resources distribution method based on deeply study
CN111181618A (en) * 2020-01-03 2020-05-19 东南大学 Intelligent reflection surface phase optimization method based on deep reinforcement learning

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023059106A1 (en) * 2021-10-06 2023-04-13 Samsung Electronics Co., Ltd. Method and system for multi-batch reinforcement learning via multi-imitation learning
CN114598655A (en) * 2022-03-10 2022-06-07 东南大学 Mobility load balancing method based on reinforcement learning
CN114598655B (en) * 2022-03-10 2024-02-02 东南大学 Reinforcement learning-based mobility load balancing method
CN114666840A (en) * 2022-03-28 2022-06-24 东南大学 Load balancing method based on multi-agent reinforcement learning
CN115395993A (en) * 2022-04-21 2022-11-25 东南大学 Reconfigurable intelligent surface enhanced MISO-OFDM transmission method
CN114675977A (en) * 2022-05-30 2022-06-28 山东科华电力技术有限公司 Distributed monitoring and transportation and management system based on power internet of things
CN114675977B (en) * 2022-05-30 2022-08-23 山东科华电力技术有限公司 Distributed monitoring and transportation and management system based on power internet of things
CN115314379A (en) * 2022-08-10 2022-11-08 汉桑(南京)科技股份有限公司 Method, system, device and medium for configuring equipment parameters
CN115314379B (en) * 2022-08-10 2023-11-07 汉桑(南京)科技股份有限公司 Configuration method, system, device and medium for equipment parameters
CN115514614A (en) * 2022-11-15 2022-12-23 阿里云计算有限公司 Cloud network anomaly detection model training method based on reinforcement learning and storage medium
CN115514614B (en) * 2022-11-15 2023-02-24 阿里云计算有限公司 Cloud network anomaly detection model training method based on reinforcement learning and storage medium

Also Published As

Publication number Publication date
CN113365312B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
CN113365312B (en) Mobile load balancing method combining reinforcement learning and supervised learning
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN110809306B (en) Terminal access selection method based on deep reinforcement learning
CN109947545B (en) Task unloading and migration decision method based on user mobility
CN111050330B (en) Mobile network self-optimization method, system, terminal and computer readable storage medium
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN109862610A (en) A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
CN112887999B (en) Intelligent access control and resource allocation method based on distributed A-C
CN116456493A (en) D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm
WO2021083230A1 (en) Power adjusting method and access network device
CN113795050B (en) Sum Tree sampling-based deep double-Q network dynamic power control method
Attiah et al. Load balancing in cellular networks: A reinforcement learning approach
Elsayed et al. Deep reinforcement learning for reducing latency in mission critical services
CN109714786B (en) Q-learning-based femtocell power control method
CN110492955A (en) Spectrum prediction switching method based on transfer learning strategy
CN113395723B (en) 5G NR downlink scheduling delay optimization system based on reinforcement learning
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
Xu et al. Deep reinforcement learning based mobility load balancing under multiple behavior policies
CN110267274A (en) A kind of frequency spectrum sharing method according to credit worthiness selection sensing user social between user
CN114501667A (en) Multi-channel access modeling and distributed implementation method considering service priority
CN114598655A (en) Mobility load balancing method based on reinforcement learning
CN115412134A (en) Off-line reinforcement learning-based user-centered non-cellular large-scale MIMO power distribution method
CN114423070A (en) D2D-based heterogeneous wireless network power distribution method and system
CN111935777A (en) 5G mobile load balancing method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant