CN113365312A - Mobile load balancing method combining reinforcement learning and supervised learning - Google Patents
Mobile load balancing method combining reinforcement learning and supervised learning Download PDFInfo
- Publication number
- CN113365312A CN113365312A CN202110689823.XA CN202110689823A CN113365312A CN 113365312 A CN113365312 A CN 113365312A CN 202110689823 A CN202110689823 A CN 202110689823A CN 113365312 A CN113365312 A CN 113365312A
- Authority
- CN
- China
- Prior art keywords
- network
- base station
- action
- value
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000002787 reinforcement Effects 0.000 title claims abstract description 30
- 238000005070 sampling Methods 0.000 claims abstract description 21
- 230000008569 process Effects 0.000 claims abstract description 18
- 230000009471 action Effects 0.000 claims description 66
- 238000011156 evaluation Methods 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 46
- 230000000875 corresponding effect Effects 0.000 claims description 12
- 239000003795 chemical substances by application Substances 0.000 claims description 9
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 238000011478 gradient descent method Methods 0.000 claims description 8
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000013508 migration Methods 0.000 abstract description 2
- 230000005012 migration Effects 0.000 abstract description 2
- 210000004027 cell Anatomy 0.000 description 16
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- MCRWZBYTLVCCJJ-DKALBXGISA-N [(1s,3r)-3-[[(3s,4s)-3-methoxyoxan-4-yl]amino]-1-propan-2-ylcyclopentyl]-[(1s,4s)-5-[6-(trifluoromethyl)pyrimidin-4-yl]-2,5-diazabicyclo[2.2.1]heptan-2-yl]methanone Chemical class C([C@]1(N(C[C@]2([H])C1)C(=O)[C@@]1(C[C@@H](CC1)N[C@@H]1[C@@H](COCC1)OC)C(C)C)[H])N2C1=CC(C(F)(F)F)=NC=N1 MCRWZBYTLVCCJJ-DKALBXGISA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000005562 fading Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/08—Load balancing or load distribution
- H04W28/086—Load balancing or load distribution among access entities
- H04W28/0861—Load balancing or load distribution among access entities between base stations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/08—Load balancing or load distribution
- H04W28/09—Management thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps: firstly, initializing parameters; then starting the circulation of the current round, obtaining the initial state of the system, starting the circulation of small steps in each round, carrying out the reinforcement learning process, ending the small step circulation, turning to the next round for circulation until the set maximum round is reached, and ending the circulation: updating the data pool and sampling, updating the parameters of each actual execution network, finally taking the state of each base station in the system as the input of the actual execution network, obtaining the output value of the network, namely the cell offset value of each base station, applying the cell offset value to each base station in the system, and switching the users according to the A3 event, thereby realizing the redistribution between the users and the base stations, further reducing the load of the overloaded cell and realizing the load balance of the system. The invention has higher stability and load balancing capability, and better generalization and migration capability.
Description
Technical Field
The invention belongs to the technical field of wireless networks in mobile communication, relates to a mobile load balancing optimization method, and particularly relates to a mobile load balancing method combining reinforcement learning and supervised learning.
Background
The mobile load balancing technology is an important technology for realizing load balancing among wireless network cells. The method realizes user transfer between base stations by adjusting individual offset parameters of the cells, thereby achieving the purpose of load balancing. Reinforcement learning has been applied to mobile load balancing problems. The mobile load balancing technology based on single-agent reinforcement learning or multi-agent reinforcement learning can achieve the purpose of mobile load balancing by adjusting appropriate cell individual offset parameters. However, the mobile load balancing method based on single agent reinforcement learning has a lot of signaling overhead caused by load information interaction, while the mobile load balancing method based on multi-agent reinforcement learning has the effect of distributed execution, but the training time cost is high. Therefore, a suitable mobile load balancing method is needed to avoid the disadvantages of the above two methods.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems, the invention provides a mobile load balancing method combining reinforcement learning and supervised learning, which is divided into a reinforcement learning stage and a supervised learning stage. In the reinforcement learning stage, a super-dense network scene downlink user connection method with a priority experience pool, a normalized reward function and a reward function prediction network is adopted; in the supervised learning stage, the action network trained in the reinforcement learning stage is utilized, a plurality of actual execution networks are trained through supervised learning, and the cell offset values of all base stations are jointly adjusted to realize load balancing.
The technical scheme is as follows: in order to achieve the above object, the mobile load balancing method combining reinforcement learning and supervised learning of the present invention comprises the following steps:
step 1: the initialization parameters include learning rate alpha, discount factor gamma, soft update factor tau, small batch numberData set size K, priority experience pool capacity M, and handoff hysteresis parameter HystNumber of physical resource blocks NPRBOutput action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、OcmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency fc;
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, acquiring an initial state from the initialized system environment;
step 2.2, after the initial state is obtained, starting a reinforcement learning process;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the actual execution network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the radio resource control protocol, so that the redistribution between the user and the base station is realized, the load of the overloaded cell is further reduced, and the load balance of the system is realized.
Wherein,
step 2.1, obtaining an initial state from the initialized system environment, including the following processes:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; signal-to-interference-and-noise ratio of user u serving base station b at time tComprises the following steps:
wherein, PbAnd PxThe transmit power of the serving base station b and the interfering base station x respectively,in order to be a set of interfering base stations,andthe channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0Is the noise power;
step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource blockThe definition is as follows:
wherein, BPRBIs the bandwidth of one physical resource block;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time tComprises the following steps:
step 2.1.4, calculating the load of each base station; load of serving base station b at time tIs defined as:
wherein N isbAs the total number of physical resource blocks of the serving base station b,a user set at a time t for a serving base station b;
step 2.1.5, Slave StateSpace(s)To obtain an initial state: state spaceThe state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stationsIs defined as:
wherein,is the resource utilization of the j base station at time t of the de-equalization, whereinIs the average resource utilization of all base stations at time t,j is the load of the jth base station at the time t, wherein j is 1, …, and N is the total number of base stations in the system;is the proportion of edge users of all base stations at time t, defined by the a4 event defined in the radio resource control protocol RRC.
After the initial state is obtained in the step 2.2, the reinforcement learning process is started, and the method comprises the following processes:
step 2.2.1, selecting an action;
action atDefined as the cell offset of each base station, as shown in the following equation:
wherein s istIs the state of the system at time t, andμ (-) is a deterministic strategy to estimate the action network, where θaTo estimate parameters of a motion network;the action space for an agent is defined as follows:
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In which O iscmin、OcmaxRespectively a cell bias lower bound value and an upper bound value; the cell offset values of the z th base station and the j th base station satisfy: o iscz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1The bonus value rtThe expression is as follows:
whereinFor serving the load of base station b at time t, BS is the set of all base stations in the system;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt:
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of the evaluation network; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,andparameters of a target action network and a target evaluation network are respectively, and gamma is a discount factor;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the state s of the current system at the moment t is determinedtTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, sampling K samples from the priority experience pool according to the sample priority, wherein the samples with higher priorities have higher sampling probability;
step 2.2.6, for the sampled K samples, the probability that the ith sample is sampled is calculated according to the following formula:
where m denotes the sample number, pmIs the priority of the mth sample, alpha' is the sampling priority impact factor, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample: w is ai=(P(i)/Pmin)-βIn which P isminIs the smallest sampling probability among all samples, β is a gradually changing convergence factor;
step 2.2.8, calculating the corrected reward value of the ith sample: r'i=ri+η·R(si,ai|θr) Where eta is a bias factor, riThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrEstimating parameters of the reward function prediction network;
step 2.2.9, use the prize value r 'after correction'iUpdating deltaiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating and estimating the parameter theta of the evaluation networkcThe process is as follows:
calculating loss function of estimation evaluation network by using small-batch gradient descent methodGradient (2):
wherein s isi' is the next time state contained in the ith sample, [ mu ] ' is the deterministic strategy of the target action network, Q ' (. cndot.) is the evaluation value of the target evaluation network output, and Q (-. cndot.) is the evaluation value of the estimated evaluation network output;
will be provided withIs updated to estimate the parameter theta of the evaluation networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the parameter θ of the estimated action networkaThe process is as follows:
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided withIs carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkrThe process is as follows:
calculating loss function of estimation reward function prediction network by using small batch gradient descent methodGradient (2):
wherein R (-) is the mapping of the estimated reward function prediction networkR' (. cndot.) is a mapping of the target reward function prediction network,μ' (·) is the deterministic policy of the target action network,is a parameter of the target reward function prediction network;
will be provided withIs updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
wherein,andrespectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft updating parameter.
And 2.2.2, the value of the reward value is in the range of [ -1,1 ].
Has the advantages that: compared with the prior art, the invention has the following advantages and beneficial effects:
the method does not need any prior knowledge of the wireless environment, and can automatically learn the optimal Mobile Load Balancing (MLB) strategy through the exploration environment; the invention adopts the normalized reward function, the experience pool with priority and the reward function prediction network, thereby having higher stability and load balancing capability, and simultaneously adopts the supervised learning method to realize the distributed execution effect, thereby avoiding higher training cost when the multi-agent reinforcement learning is carried out, which has important significance in the real network scene and simultaneously has better generalization and migration capability.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following detailed description is only illustrative and not intended to limit the scope of the present invention.
The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps:
step 1: and (5) initializing. The present invention is illustrated with the following parameter settings as examples:
step 1.1, the present invention is explained by taking the following network initialization parameters as an example: the action network, the evaluation network and the reward function prediction network all comprise two hidden layers respectively comprising 400 and 300 neurons, and the learning rates alpha of the hidden layers are all 10-3. The discounting factor gamma is 0.99 and the soft update factor tau is 0.001. The small batch size K is 64, the priority empirical pool capacity M is 10000, the normal number ε is 1e-5, the bias factor η is 0.0001, and the sampling priority impact factor α' is 0.9. There are 7 actual execution networks, each containing 5 hidden layers, 600, 400, 400, and 300 neurons, respectively. The initial learning rate of each actually executed network is 10-3. The capacity of the new data pool is 60000.
Step 1.2, the present invention takes the following initialization parameters of the system environment as an exampleThe description is as follows: the system environment comprises 7 base stations distributed in an area of 60 meters by 60 meters. The area contains 200 users, and is in a free walking state, and the speed is between 1.0m/s and 1.2 m/s. It is assumed that each user is a guaranteed bit rate user and the bit rate is 128 kb/s. The transmit power of each base station is 20 dBm. The path loss modeling case is as follows: 38.3 × log (d) +17.3+24.9 × log (f) in the case of non-line of sightc) In the case of visual range, 17.3 × log (d) +32.4+20 × log (f)c) Wherein d is the distance from the user to the base station and is measured in meters; f. ofcFor the carrier frequency, 3.5GHz is chosen. The shadow fading is modeled as a log normal distribution with a mean of 0 and a standard deviation of 8 dB. Switching hysteresis parameter HystSet to 2 dB. Number N of Physical Resource Blocks (PRB) of each base stationPRB111, output action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、 OcmaxRespectively, a cell bias lower bound value and an upper bound value, [ O ]cmin,Ocmax]Initialized to [ -1.5dB,1.5 dB)]。
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, obtaining an initial state from the initialized system environment, and comprising the following processes:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; signal-to-interference-and-noise ratio of user u serving base station b at time tComprises the following steps:
wherein, PbAnd PxThe transmit powers of the user's serving base station b and interfering base station x, respectively, are 20dBm in this example;in order to be a set of interfering base stations,andthe channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0For noise power, 3.9811 × 10 in this example-13;
Step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource blockThe definition is as follows:
wherein, BPRBIs the bandwidth of one physical resource block, in this case 180 kHz;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time tComprises the following steps:
wherein,indicating the desired rate for user u at time t, in this example 128kb/s,represents a ceiling operation;
step 2.1.4, calculating the load of each base station; load of serving base station b at time tThe ratio of the total number of physical resource blocks required by all users to the total number of the base station physical resource blocks is defined as:
wherein N isbThe total number of physical resource blocks serving base station b, in this example 111,a user set at a time t for a serving base station b;
step 2.1.5 Slave State spaceTo obtain an initial state: state spaceThe state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stationsIs defined as:
wherein,is the resource utilization of the j base station at time t of the de-equalization, whereinIs the average resource utilization of all base stations at time t,the load of the jth base station at time t, N is the total number of base stations in the system, and in this example, N is 7;is the edge user proportion of all base stations at time t, defined by an a4 event defined in a Radio Resource Control (RRC) protocol;
step 2.2, after the initial state is obtained, starting a reinforcement learning process, which comprises the following procedures:
step 2.2.1, selecting an action;
wherein s istIs the state of the system at time t, andμ (-) is a deterministic strategy to estimate the action network, where θaTo estimate parameters of a motion network;the action space for an agent is defined as follows:
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In this case Ocmin=-1.5dB,Ocmax1.5 dB; the cell offset values of the z th base station and the j th base station satisfy Ocz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1Is awardedValue rtThe expression is as follows:
whereinFor the load of the service base station b at the time t, the BS is a set of all base stations in the system;
the value of the reward value is designed in the range of [ -1,1], so that the convergence and the stability of the method are facilitated;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt:
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of an evaluation network, wherein the initial value of the parameters is determined by system initialization; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,andthe parameters of the target action network and the target evaluation network are respectively, the initial values of the parameters are determined by system initialization, and gamma is a discount factor; time difference error deltatThe larger the absolute value of (a), the higher the priority;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the current state stTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, in this example, K is 64, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, selecting K samples from the priority experience pool according to the priority of the samples, wherein the higher the priority of the samples is, the higher the probability of sampling is, and K is 64 in the example;
step 2.2.6, for the sampled K samples, calculating the probability that the ith sample is sampled:
where m denotes the sample number, pmIs the priority of the mth sample, and α' is the sample priority impact factor, which in this example is 0.9, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample:
wi=(M·P(i))-β/(M·Pmin)-β=(P(i)/Pmin)-β
wherein P isminIs the minimum sampling probability among all samples, M is the capacity of the priority experience pool, in this example 10000, β is a gradually changing convergence factor, and the initial value is 0.01 in this example;
step 2.2.8, calculating the corrected reward value of the ith sample: r isi′=ri+η·R(si,ai|θr) Where η is the bias factor, in this case 0.0001; r isiThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrIs to estimateThe reward function predicts parameters of the network;
step 2.2.9, update delta with the revised prize valueiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating estimation evaluation network parameter thetac:
Firstly, calculating loss function of estimation evaluation networkThe gradient descent method using a small batch is defined as:
wherein s isi'is the state of the next moment contained in the ith sample, Q' () is the evaluation value of the target evaluation network output, and Q (-) is the evaluation value of the estimated evaluation network output;
will be provided withIs updated to estimate the parameter theta of the evaluation networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the estimated action network parameter θa:
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided withIs carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkr:
Estimating loss functions of reward function prediction networksThe gradient descent method using a small batch is defined as:
wherein,r (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network;
will be provided withIs updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
wherein,andrespectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft update parameter which is 0.001 in the example;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4; the maximum number of steps in this example is 100 steps;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5; the maximum number of rounds in this example is 600 rounds;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool; the new data pool capacity is 60000 in this example;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the RRC protocol, so that the redistribution between the user and the base station is realized, the load of the overload cell is reduced, and the load balance of the system is realized.
The method can effectively realize the load balance of the network, simultaneously improve the robustness and stability of the network, reduce the load fluctuation of the network, simultaneously realize the effect of distributed execution and avoid higher training cost of multi-agent reinforcement learning.
It will be understood by those skilled in the art that, unless otherwise defined, all terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.
Claims (4)
1. A mobile load balancing method combining reinforcement learning and supervised learning is characterized by comprising the following steps:
step 1: the initialization parameters comprise a learning rate alpha, a discount factor gamma, a soft update factor tau, a small batch data set size K, a priority experience pool capacity M and a switching hysteresis parameter HystNumber of physical resource blocks NPRBOutput action value range [ O ] of action networkcmin,Ocmax]In which O iscmin、OcmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency fc;
Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:
step 2.1, acquiring an initial state from the initialized system environment;
step 2.2, after the initial state is obtained, starting a reinforcement learning process;
and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4;
and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5;
and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;
step 6: storing each state vector and the corresponding action tag into a new data pool;
and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;
and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;
and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;
step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;
step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the actual execution network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the radio resource control protocol, so that the redistribution between the user and the base station is realized, the load of the overloaded cell is further reduced, and the load balance of the system is realized.
2. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 1, wherein the step 2.1 of obtaining the initial state from the initialized system environment includes the following steps:
step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; interference of user u serving base station b at time tNoise ratioComprises the following steps:
wherein, PbAnd PxThe transmit power of the serving base station b and the interfering base station x respectively,in order to be a set of interfering base stations,andthe channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of0Is the noise power;
step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource blockThe definition is as follows:
wherein, BPRBIs the bandwidth of one physical resource block;
step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time tComprises the following steps:
step 2.1.4, calculating the load of each base station; load of serving base station b at time tIs defined as:
wherein N isbAs the total number of physical resource blocks of the serving base station b,a user set at a time t for a serving base station b;
step 2.1.5 Slave State spaceTo obtain an initial state: state spaceThe state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stationsIs defined as:
wherein,is the resource utilization of the j base station at time t of the de-equalization, whereinIs the average resource utilization of all base stations at time t,j is the load of the jth base station at the time t, wherein j is 1, …, and N is the total number of base stations in the system;is the proportion of edge users of all base stations at time t, defined by the a4 event defined in the radio resource control protocol RRC.
3. The method for balancing mobile load combining reinforcement learning and supervised learning according to claim 1, wherein after the initial state is obtained in step 2.2, the reinforcement learning process is started, and the method includes the following steps:
step 2.2.1, selecting an action;
action atDefined as the cell offset of each base station, as shown in the following equation:
wherein s istIs the state of the system at time t, andμ (-) is a deterministic strategy to estimate the action network, whichMiddle thetaaTo estimate parameters of a motion network;the action space for an agent is defined as follows:
wherein O iscjIs a cell offset value of the jth base station, and Ocj∈[Ocmin,Ocmax]In which O iscmin、OcmaxRespectively a cell bias lower bound value and an upper bound value; the cell offset values of the z th base station and the j th base station satisfy: o iscz-Ocj=Ozj=-Ojz=-(Ocj-Ocz);
Step 2.2.2, action atInteracting with the environment to obtain a reward value rtAnd observing the state s of the system at the next instant t +1t+1The bonus value rtThe expression is as follows:
whereinFor serving the load of base station b at time t, BS is the set of all base stations in the system;
step 2.2.3, calculate the time difference error δ of the state action combination at time tt:
Where Q (-) is an evaluation value, θ, that estimates the evaluation network outputcEstimating parameters of the evaluation network; q' (. cndot.) is the evaluation value, s, of the target evaluation network outputt+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,andparameters of a target action network and a target evaluation network are respectively, and gamma is a discount factor;
step 2.2.4, sample information(s)t,at,rt,st+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration numbertA maximum value of comparison; then the state s of the current system at the moment t is determinedtTransition to the state s at the next time t +1t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;
2.2.5, sampling K samples from the priority experience pool according to the sample priority, wherein the samples with higher priorities have higher sampling probability;
step 2.2.6, for the sampled K samples, the probability that the ith sample is sampled is calculated according to the following formula:
where m denotes the sample number, pmIs the priority of the mth sample, alpha' is the sampling priority impact factor, pi=|δiI + ε is the priority of the ith sample, δiThe time difference error of the ith sample is shown, and epsilon is a normal number;
step 2.2.7, calculating the normalized importance sampling weight of the ith sample: w is ai=(P(i)/Pmin)-βIn which P isminIs the smallest sampling probability among all samples, β is a gradually changing convergence factor;
step 2.2.8, calculating the corrected reward value of the ith sample: r isi′=ri+η·R(si,ai|θr) Where eta is a bias factor, riThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, siIs the state contained in the ith sample, aiFor the motion contained in the ith sample, θrEstimating parameters of the reward function prediction network;
step 2.2.9, using the revised prize value ri' update deltaiAnd piAnd according to piUpdating the priorities of the sampled K samples;
step 2.2.10, updating and estimating the parameter theta of the evaluation networkcThe process is as follows:
calculating loss function of estimation evaluation network by using small-batch gradient descent methodGradient (2):
wherein s isi' is the next time state contained in the ith sample, [ mu ] ' is the deterministic strategy of the target action network, Q ' (. cndot.) is the evaluation value of the target evaluation network output, and Q (-. cndot.) is the evaluation value of the estimated evaluation network output;
will be provided withIs updated and estimated by back propagation of the gradientEvaluating a parameter θ of a networkc(ii) a After the updating is finished, the step 2.2.11 is carried out;
step 2.2.11, updating the parameter θ of the estimated action networkaThe process is as follows:
wherein μ (·) is a deterministic strategy for estimating an action network;
will be provided withIs carried out with backward propagation to update the parameter theta of the estimated action networka(ii) a After the updating is finished, the step 2.2.12 is carried out;
step 2.2.12, updating the parameter θ of the estimated reward function prediction networkrThe process is as follows:
calculating loss function of estimation reward function prediction network by using small batch gradient descent methodGradient (2):
wherein R (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network,μ' (·) is the deterministic policy of the target action network,is a parameter of the target reward function prediction network;
will be provided withIs updated to estimate the parameter theta of the reward function prediction networkr(ii) a After the updating is finished, the step 2.2.13 is carried out;
step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:
wherein,andrespectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction networkc、θaAnd thetarThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft updating parameter.
4. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 3, wherein in step 2.2.2, the reward value is in the range of [ -1,1 ].
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110689823.XA CN113365312B (en) | 2021-06-22 | 2021-06-22 | Mobile load balancing method combining reinforcement learning and supervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110689823.XA CN113365312B (en) | 2021-06-22 | 2021-06-22 | Mobile load balancing method combining reinforcement learning and supervised learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113365312A true CN113365312A (en) | 2021-09-07 |
CN113365312B CN113365312B (en) | 2022-10-14 |
Family
ID=77535530
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110689823.XA Active CN113365312B (en) | 2021-06-22 | 2021-06-22 | Mobile load balancing method combining reinforcement learning and supervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113365312B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114598655A (en) * | 2022-03-10 | 2022-06-07 | 东南大学 | Mobility load balancing method based on reinforcement learning |
CN114666840A (en) * | 2022-03-28 | 2022-06-24 | 东南大学 | Load balancing method based on multi-agent reinforcement learning |
CN114675977A (en) * | 2022-05-30 | 2022-06-28 | 山东科华电力技术有限公司 | Distributed monitoring and transportation and management system based on power internet of things |
CN115314379A (en) * | 2022-08-10 | 2022-11-08 | 汉桑(南京)科技股份有限公司 | Method, system, device and medium for configuring equipment parameters |
CN115395993A (en) * | 2022-04-21 | 2022-11-25 | 东南大学 | Reconfigurable intelligent surface enhanced MISO-OFDM transmission method |
CN115514614A (en) * | 2022-11-15 | 2022-12-23 | 阿里云计算有限公司 | Cloud network anomaly detection model training method based on reinforcement learning and storage medium |
WO2023059106A1 (en) * | 2021-10-06 | 2023-04-13 | Samsung Electronics Co., Ltd. | Method and system for multi-batch reinforcement learning via multi-imitation learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
CN111181618A (en) * | 2020-01-03 | 2020-05-19 | 东南大学 | Intelligent reflection surface phase optimization method based on deep reinforcement learning |
-
2021
- 2021-06-22 CN CN202110689823.XA patent/CN113365312B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110493826A (en) * | 2019-08-28 | 2019-11-22 | 重庆邮电大学 | A kind of isomery cloud radio access network resources distribution method based on deeply study |
CN111181618A (en) * | 2020-01-03 | 2020-05-19 | 东南大学 | Intelligent reflection surface phase optimization method based on deep reinforcement learning |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023059106A1 (en) * | 2021-10-06 | 2023-04-13 | Samsung Electronics Co., Ltd. | Method and system for multi-batch reinforcement learning via multi-imitation learning |
CN114598655A (en) * | 2022-03-10 | 2022-06-07 | 东南大学 | Mobility load balancing method based on reinforcement learning |
CN114598655B (en) * | 2022-03-10 | 2024-02-02 | 东南大学 | Reinforcement learning-based mobility load balancing method |
CN114666840A (en) * | 2022-03-28 | 2022-06-24 | 东南大学 | Load balancing method based on multi-agent reinforcement learning |
CN115395993A (en) * | 2022-04-21 | 2022-11-25 | 东南大学 | Reconfigurable intelligent surface enhanced MISO-OFDM transmission method |
CN114675977A (en) * | 2022-05-30 | 2022-06-28 | 山东科华电力技术有限公司 | Distributed monitoring and transportation and management system based on power internet of things |
CN114675977B (en) * | 2022-05-30 | 2022-08-23 | 山东科华电力技术有限公司 | Distributed monitoring and transportation and management system based on power internet of things |
CN115314379A (en) * | 2022-08-10 | 2022-11-08 | 汉桑(南京)科技股份有限公司 | Method, system, device and medium for configuring equipment parameters |
CN115314379B (en) * | 2022-08-10 | 2023-11-07 | 汉桑(南京)科技股份有限公司 | Configuration method, system, device and medium for equipment parameters |
CN115514614A (en) * | 2022-11-15 | 2022-12-23 | 阿里云计算有限公司 | Cloud network anomaly detection model training method based on reinforcement learning and storage medium |
CN115514614B (en) * | 2022-11-15 | 2023-02-24 | 阿里云计算有限公司 | Cloud network anomaly detection model training method based on reinforcement learning and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113365312B (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113365312B (en) | Mobile load balancing method combining reinforcement learning and supervised learning | |
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
CN110809306B (en) | Terminal access selection method based on deep reinforcement learning | |
CN109947545B (en) | Task unloading and migration decision method based on user mobility | |
CN111050330B (en) | Mobile network self-optimization method, system, terminal and computer readable storage medium | |
CN111800828B (en) | Mobile edge computing resource allocation method for ultra-dense network | |
CN113162679A (en) | DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method | |
CN109862610A (en) | A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm | |
CN112887999B (en) | Intelligent access control and resource allocation method based on distributed A-C | |
CN116456493A (en) | D2D user resource allocation method and storage medium based on deep reinforcement learning algorithm | |
WO2021083230A1 (en) | Power adjusting method and access network device | |
CN113795050B (en) | Sum Tree sampling-based deep double-Q network dynamic power control method | |
Attiah et al. | Load balancing in cellular networks: A reinforcement learning approach | |
Elsayed et al. | Deep reinforcement learning for reducing latency in mission critical services | |
CN109714786B (en) | Q-learning-based femtocell power control method | |
CN110492955A (en) | Spectrum prediction switching method based on transfer learning strategy | |
CN113395723B (en) | 5G NR downlink scheduling delay optimization system based on reinforcement learning | |
CN114867030A (en) | Double-time-scale intelligent wireless access network slicing method | |
Xu et al. | Deep reinforcement learning based mobility load balancing under multiple behavior policies | |
CN110267274A (en) | A kind of frequency spectrum sharing method according to credit worthiness selection sensing user social between user | |
CN114501667A (en) | Multi-channel access modeling and distributed implementation method considering service priority | |
CN114598655A (en) | Mobility load balancing method based on reinforcement learning | |
CN115412134A (en) | Off-line reinforcement learning-based user-centered non-cellular large-scale MIMO power distribution method | |
CN114423070A (en) | D2D-based heterogeneous wireless network power distribution method and system | |
CN111935777A (en) | 5G mobile load balancing method based on deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |