CN113365312A

CN113365312A - Mobile load balancing method combining reinforcement learning and supervised learning

Info

Publication number: CN113365312A
Application number: CN202110689823.XA
Authority: CN
Inventors: 潘志文; 姚猛; 刘楠; 尤肖虎
Original assignee: Southeast University; Network Communication and Security Zijinshan Laboratory
Current assignee: Southeast University; Network Communication and Security Zijinshan Laboratory
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-07
Anticipated expiration: 2041-06-22
Also published as: CN113365312B

Abstract

The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps: firstly, initializing parameters; then starting the circulation of the current round, obtaining the initial state of the system, starting the circulation of small steps in each round, carrying out the reinforcement learning process, ending the small step circulation, turning to the next round for circulation until the set maximum round is reached, and ending the circulation: updating the data pool and sampling, updating the parameters of each actual execution network, finally taking the state of each base station in the system as the input of the actual execution network, obtaining the output value of the network, namely the cell offset value of each base station, applying the cell offset value to each base station in the system, and switching the users according to the A3 event, thereby realizing the redistribution between the users and the base stations, further reducing the load of the overloaded cell and realizing the load balance of the system. The invention has higher stability and load balancing capability, and better generalization and migration capability.

Description

Mobile load balancing method combining reinforcement learning and supervised learning

Technical Field

The invention belongs to the technical field of wireless networks in mobile communication, relates to a mobile load balancing optimization method, and particularly relates to a mobile load balancing method combining reinforcement learning and supervised learning.

Background

The mobile load balancing technology is an important technology for realizing load balancing among wireless network cells. The method realizes user transfer between base stations by adjusting individual offset parameters of the cells, thereby achieving the purpose of load balancing. Reinforcement learning has been applied to mobile load balancing problems. The mobile load balancing technology based on single-agent reinforcement learning or multi-agent reinforcement learning can achieve the purpose of mobile load balancing by adjusting appropriate cell individual offset parameters. However, the mobile load balancing method based on single agent reinforcement learning has a lot of signaling overhead caused by load information interaction, while the mobile load balancing method based on multi-agent reinforcement learning has the effect of distributed execution, but the training time cost is high. Therefore, a suitable mobile load balancing method is needed to avoid the disadvantages of the above two methods.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems, the invention provides a mobile load balancing method combining reinforcement learning and supervised learning, which is divided into a reinforcement learning stage and a supervised learning stage. In the reinforcement learning stage, a super-dense network scene downlink user connection method with a priority experience pool, a normalized reward function and a reward function prediction network is adopted; in the supervised learning stage, the action network trained in the reinforcement learning stage is utilized, a plurality of actual execution networks are trained through supervised learning, and the cell offset values of all base stations are jointly adjusted to realize load balancing.

The technical scheme is as follows: in order to achieve the above object, the mobile load balancing method combining reinforcement learning and supervised learning of the present invention comprises the following steps:

step 1: the initialization parameters include learning rate alpha, discount factor gamma, soft update factor tau, small batch numberData set size K, priority experience pool capacity M, and handoff hysteresis parameter H_ystNumber of physical resource blocks N_PRBOutput action value range [ O ] of action network_cmin,O_cmax]In which O is_cmin、O_cmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency f_c；

Step 2: the system initial state acquisition and reinforcement learning process comprises the following steps:

step 2.1, acquiring an initial state from the initialized system environment;

step 2.2, after the initial state is obtained, starting a reinforcement learning process;

and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4;

and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5;

and 5: the trained action network is used as a label generation network, and all the stored states are input into the label generation network to obtain corresponding action labels;

step 6: storing each state vector and the corresponding action tag into a new data pool;

and 7: randomly disordering the storage sequence of all state-action label pairs in the new data pool;

and 8: sequentially sampling K samples from a new data pool, and completely sampling the residual data when the number of the samples is less than K;

and step 9: training each actual execution network by using a supervised learning method; for each actual execution network, obtaining corresponding local action according to the sampled local state, and calculating a mean square error with the action label for updating the parameters of each actual execution network;

step 10: if the mean square error of each actual execution network is not reduced to a proper error range, and the error range is determined by the performance requirement, the step 8 is carried out; otherwise, turning to step 11;

step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the actual execution network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the radio resource control protocol, so that the redistribution between the user and the base station is realized, the load of the overloaded cell is further reduced, and the load balance of the system is realized.

Wherein,

step 2.1, obtaining an initial state from the initialized system environment, including the following processes:

step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; signal-to-interference-and-noise ratio of user u serving base station b at time t

Comprises the following steps:

wherein, P_bAnd P_xThe transmit power of the serving base station b and the interfering base station x respectively,

in order to be a set of interfering base stations,

and

the channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of₀Is the noise power;

step 2.1.2, calculating the maximum data transmission rate of each user on a physical resource block PRB, wherein the maximum data transmission rate of the user u at the moment t on the physical resource block

The definition is as follows:

wherein, B_PRBIs the bandwidth of one physical resource block;

step 2.1.3, calculating the number of physical resource blocks required by each user; the number of physical resource blocks required by user u at time t

Comprises the following steps:

wherein,

indicating the desired rate for user u at time t,

represents a ceiling operation;

step 2.1.4, calculating the load of each base station; load of serving base station b at time t

Is defined as:

wherein N is_bAs the total number of physical resource blocks of the serving base station b,

a user set at a time t for a serving base station b;

step 2.1.5, Slave StateSpace(s)

To obtain an initial state: state space

The state space is composed of the resource utilization rate of the base station with the value removed and the edge user proportion of all the base stations

Is defined as:

wherein,

is the resource utilization of the j base station at time t of the de-equalization, wherein

Is the average resource utilization of all base stations at time t,

j is the load of the jth base station at the time t, wherein j is 1, …, and N is the total number of base stations in the system;

is the proportion of edge users of all base stations at time t, defined by the a4 event defined in the radio resource control protocol RRC.

After the initial state is obtained in the step 2.2, the reinforcement learning process is started, and the method comprises the following processes:

step 2.2.1, selecting an action;

action a_tDefined as the cell offset of each base station, as shown in the following equation:

wherein s is_tIs the state of the system at time t, and

μ (-) is a deterministic strategy to estimate the action network, where θ_aTo estimate parameters of a motion network;

the action space for an agent is defined as follows:

wherein O is_cjIs a cell offset value of the jth base station, and O_cj∈[O_cmin,O_cmax]In which O is_cmin、O_cmaxRespectively a cell bias lower bound value and an upper bound value; the cell offset values of the z th base station and the j th base station satisfy: o is_cz-O_cj＝O_zj＝-O_jz＝-(O_cj-O_cz)；

Step 2.2.2, action a_tInteracting with the environment to obtain a reward value r_tAnd observing the state s of the system at the next instant t +1_t+1The bonus value r_tThe expression is as follows:

wherein

For serving the load of base station b at time t, BS is the set of all base stations in the system;

step 2.2.3, calculate the time difference error δ of the state action combination at time t_t：

Where Q (-) is an evaluation value, θ, that estimates the evaluation network output_cEstimating parameters of the evaluation network; q' (. cndot.) is the evaluation value, s, of the target evaluation network output_t+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,

and

parameters of a target action network and a target evaluation network are respectively, and gamma is a discount factor;

step 2.2.4, sample information(s)_t,a_t,r_t,s_t+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration number_tA maximum value of comparison; then the state s of the current system at the moment t is determined_tTransition to the state s at the next time t +1_t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;

2.2.5, sampling K samples from the priority experience pool according to the sample priority, wherein the samples with higher priorities have higher sampling probability;

step 2.2.6, for the sampled K samples, the probability that the ith sample is sampled is calculated according to the following formula:

where m denotes the sample number, p_mIs the priority of the mth sample, alpha' is the sampling priority impact factor, p_i＝|δ_iI + ε is the priority of the ith sample, δ_iThe time difference error of the ith sample is shown, and epsilon is a normal number;

step 2.2.7, calculating the normalized importance sampling weight of the ith sample: w is a_i＝(P(i)/P_min)^-βIn which P is_minIs the smallest sampling probability among all samples, β is a gradually changing convergence factor;

step 2.2.8, calculating the corrected reward value of the ith sample: r'_i＝r_i+η·R(s_i,a_i|θ_r) Where eta is a bias factor, r_iThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, s_iIs the state contained in the ith sample, a_iFor the motion contained in the ith sample, θ_rEstimating parameters of the reward function prediction network;

step 2.2.9, use the prize value r 'after correction'_iUpdating delta_iAnd p_iAnd according to p_iUpdating the priorities of the sampled K samples;

step 2.2.10, updating and estimating the parameter theta of the evaluation network_cThe process is as follows:

calculating loss function of estimation evaluation network by using small-batch gradient descent method

Gradient (2):

wherein s is_i' is the next time state contained in the ith sample, [ mu ] ' is the deterministic strategy of the target action network, Q ' (. cndot.) is the evaluation value of the target evaluation network output, and Q (-. cndot.) is the evaluation value of the estimated evaluation network output;

will be provided with

Is updated to estimate the parameter theta of the evaluation network_c(ii) a After the updating is finished, the step 2.2.11 is carried out;

step 2.2.11, updating the parameter θ of the estimated action network_aThe process is as follows:

computing a loss function of an estimated action network

Gradient (2):

wherein μ (·) is a deterministic strategy for estimating an action network;

will be provided with

Is carried out with backward propagation to update the parameter theta of the estimated action network_a(ii) a After the updating is finished, the step 2.2.12 is carried out;

step 2.2.12, updating the parameter θ of the estimated reward function prediction network_rThe process is as follows:

calculating loss function of estimation reward function prediction network by using small batch gradient descent method

Gradient (2):

wherein R (-) is the mapping of the estimated reward function prediction networkR' (. cndot.) is a mapping of the target reward function prediction network,

μ' (·) is the deterministic policy of the target action network,

is a parameter of the target reward function prediction network;

will be provided with

Is updated to estimate the parameter theta of the reward function prediction network_r(ii) a After the updating is finished, the step 2.2.13 is carried out;

step 2.2.13, soft update of parameters of the goal action network, the goal evaluation network and the goal reward function prediction network:

wherein,

and

respectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction network_c、θ_aAnd theta_rThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft updating parameter.

And 2.2.2, the value of the reward value is in the range of [ -1,1 ].

Has the advantages that: compared with the prior art, the invention has the following advantages and beneficial effects:

the method does not need any prior knowledge of the wireless environment, and can automatically learn the optimal Mobile Load Balancing (MLB) strategy through the exploration environment; the invention adopts the normalized reward function, the experience pool with priority and the reward function prediction network, thereby having higher stability and load balancing capability, and simultaneously adopts the supervised learning method to realize the distributed execution effect, thereby avoiding higher training cost when the multi-agent reinforcement learning is carried out, which has important significance in the real network scene and simultaneously has better generalization and migration capability.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following detailed description is only illustrative and not intended to limit the scope of the present invention.

The invention discloses a mobile load balancing method combining reinforcement learning and supervised learning, which comprises the following steps:

step 1: and (5) initializing. The present invention is illustrated with the following parameter settings as examples:

step 1.1, the present invention is explained by taking the following network initialization parameters as an example: the action network, the evaluation network and the reward function prediction network all comprise two hidden layers respectively comprising 400 and 300 neurons, and the learning rates alpha of the hidden layers are all 10^-3. The discounting factor gamma is 0.99 and the soft update factor tau is 0.001. The small batch size K is 64, the priority empirical pool capacity M is 10000, the normal number ε is 1e-5, the bias factor η is 0.0001, and the sampling priority impact factor α' is 0.9. There are 7 actual execution networks, each containing 5 hidden layers, 600, 400, 400, and 300 neurons, respectively. The initial learning rate of each actually executed network is 10^-3. The capacity of the new data pool is 60000.

Step 1.2, the present invention takes the following initialization parameters of the system environment as an exampleThe description is as follows: the system environment comprises 7 base stations distributed in an area of 60 meters by 60 meters. The area contains 200 users, and is in a free walking state, and the speed is between 1.0m/s and 1.2 m/s. It is assumed that each user is a guaranteed bit rate user and the bit rate is 128 kb/s. The transmit power of each base station is 20 dBm. The path loss modeling case is as follows: 38.3 × log (d) +17.3+24.9 × log (f) in the case of non-line of sight_c) In the case of visual range, 17.3 × log (d) +32.4+20 × log (f)_c) Wherein d is the distance from the user to the base station and is measured in meters; f. of_cFor the carrier frequency, 3.5GHz is chosen. The shadow fading is modeled as a log normal distribution with a mean of 0 and a standard deviation of 8 dB. Switching hysteresis parameter H_ystSet to 2 dB. Number N of Physical Resource Blocks (PRB) of each base station_PRB111, output action value range [ O ] of action network_cmin,O_cmax]In which O is_cmin、 O_cmaxRespectively, a cell bias lower bound value and an upper bound value, [ O ]_cmin,O_cmax]Initialized to [ -1.5dB,1.5 dB)]。

step 2.1, obtaining an initial state from the initialized system environment, and comprising the following processes:

Comprises the following steps:

wherein, P_bAnd P_xThe transmit powers of the user's serving base station b and interfering base station x, respectively, are 20dBm in this example;

in order to be a set of interfering base stations,

and

the channel gains of the user u to the serving base station b and the interfering base station x at the time t, respectively; n is a radical of₀For noise power, 3.9811 × 10 in this example^-13；

The definition is as follows:

wherein, B_PRBIs the bandwidth of one physical resource block, in this case 180 kHz;

Comprises the following steps:

wherein,

indicating the desired rate for user u at time t, in this example 128kb/s,

represents a ceiling operation;

The ratio of the total number of physical resource blocks required by all users to the total number of the base station physical resource blocks is defined as:

wherein N is_bThe total number of physical resource blocks serving base station b, in this example 111,

a user set at a time t for a serving base station b;

step 2.1.5 Slave State space

To obtain an initial state: state space

Is defined as:

wherein,

Is the average resource utilization of all base stations at time t,

the load of the jth base station at time t, N is the total number of base stations in the system, and in this example, N is 7;

is the edge user proportion of all base stations at time t, defined by an a4 event defined in a Radio Resource Control (RRC) protocol;

step 2.2, after the initial state is obtained, starting a reinforcement learning process, which comprises the following procedures:

step 2.2.1, selecting an action;

action a_tCell offset defined for each base station:

wherein s is_tIs the state of the system at time t, and

the action space for an agent is defined as follows:

wherein O is_cjIs a cell offset value of the jth base station, and O_cj∈[O_cmin,O_cmax]In this case O_cmin＝-1.5dB，O_cmax1.5 dB; the cell offset values of the z th base station and the j th base station satisfy O_cz-O_cj＝O_zj＝-O_jz＝-(O_cj-O_cz)；

Step 2.2.2, action a_tInteracting with the environment to obtain a reward value r_tAnd observing the state s of the system at the next instant t +1_t+1Is awardedValue r_tThe expression is as follows:

wherein

For the load of the service base station b at the time t, the BS is a set of all base stations in the system;

the value of the reward value is designed in the range of [ -1,1], so that the convergence and the stability of the method are facilitated;

Where Q (-) is an evaluation value, θ, that estimates the evaluation network output_cEstimating parameters of an evaluation network, wherein the initial value of the parameters is determined by system initialization; q' (. cndot.) is the evaluation value, s, of the target evaluation network output_t+1For the state of the system at time t +1, μ' (. cndot.) is the deterministic policy for the target action network,

and

the parameters of the target action network and the target evaluation network are respectively, the initial values of the parameters are determined by system initialization, and gamma is a discount factor; time difference error delta_tThe larger the absolute value of (a), the higher the priority;

step 2.2.4, sample information(s)_t,a_t,r_t,s_t+1) And storing the initial sample priority into a priority experience pool, wherein the priority experience pool is constructed in a binary tree mode, the initial sample priority is uniformly set to be 1, and the initial sample priority is set to be delta with the initial sample priority subsequently along with the increase of the iteration number_tA maximum value of comparison; then the current state s_tTransition to the state s at the next time t +1_t+1And all the experienced states are stored, if the number of samples in the priority experience pool reaches the size K of the small-batch data set, in this example, K is 64, the step 2.2.5 is carried out, otherwise, the step 2.2.1 is carried out;

2.2.5, selecting K samples from the priority experience pool according to the priority of the samples, wherein the higher the priority of the samples is, the higher the probability of sampling is, and K is 64 in the example;

step 2.2.6, for the sampled K samples, calculating the probability that the ith sample is sampled:

where m denotes the sample number, p_mIs the priority of the mth sample, and α' is the sample priority impact factor, which in this example is 0.9, p_i＝|δ_iI + ε is the priority of the ith sample, δ_iThe time difference error of the ith sample is shown, and epsilon is a normal number;

step 2.2.7, calculating the normalized importance sampling weight of the ith sample:

w_i＝(M·P(i))^-β/(M·P_min)^-β＝(P(i)/P_min)^-β

wherein P is_minIs the minimum sampling probability among all samples, M is the capacity of the priority experience pool, in this example 10000, β is a gradually changing convergence factor, and the initial value is 0.01 in this example;

step 2.2.8, calculating the corrected reward value of the ith sample: r is_i′＝r_i+η·R(s_i,a_i|θ_r) Where η is the bias factor, in this case 0.0001; r is_iThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, s_iIs the state contained in the ith sample, a_iFor the motion contained in the ith sample, θ_rIs to estimateThe reward function predicts parameters of the network;

step 2.2.9, update delta with the revised prize value_iAnd p_iAnd according to p_iUpdating the priorities of the sampled K samples;

step 2.2.10, updating estimation evaluation network parameter theta_c：

Firstly, calculating loss function of estimation evaluation network

The gradient descent method using a small batch is defined as:

wherein s is_i'is the state of the next moment contained in the ith sample, Q' () is the evaluation value of the target evaluation network output, and Q (-) is the evaluation value of the estimated evaluation network output;

estimation using small batch gradient descent method

Gradient (2):

will be provided with

step 2.2.11, updating the estimated action network parameter θ_a：

Computing a loss function of an estimated action network

Gradient (2):

wherein μ (·) is a deterministic strategy for estimating an action network;

will be provided with

step 2.2.12, updating the parameter θ of the estimated reward function prediction network_r：

Estimating loss functions of reward function prediction networks

The gradient descent method using a small batch is defined as:

wherein,

r (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network;

estimation using small batch gradient descent method

Gradient (2):

will be provided with

wherein,

and

respectively, the parameters theta of the target evaluation network, the target action network and the target reward function prediction network_c、θ_aAnd theta_rThe parameters of the evaluation network, the action network and the reward function prediction network are estimated respectively, and tau is a soft update parameter which is 0.001 in the example;

and step 3: judging whether the current small step times reach the set maximum step number, and if the current small step times do not reach the set maximum step number, turning to the step 2.2; otherwise, turning to step 4; the maximum number of steps in this example is 100 steps;

and 4, step 4: judging whether the current round times reach the set maximum round times or not, and if the current round times do not reach the set maximum round times, turning to the step 2; otherwise, turning to step 5; the maximum number of rounds in this example is 600 rounds;

step 6: storing each state vector and the corresponding action tag into a new data pool; the new data pool capacity is 60000 in this example;

step 11: the state of each base station in the system is used as the input of the corresponding actual execution network, the output value of the network, namely the cell offset value of each base station is obtained and is applied to each base station in the system, and the user is switched according to the A3 event defined by the RRC protocol, so that the redistribution between the user and the base station is realized, the load of the overload cell is reduced, and the load balance of the system is realized.

The method can effectively realize the load balance of the network, simultaneously improve the robustness and stability of the network, reduce the load fluctuation of the network, simultaneously realize the effect of distributed execution and avoid higher training cost of multi-agent reinforcement learning.

It will be understood by those skilled in the art that, unless otherwise defined, all terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A mobile load balancing method combining reinforcement learning and supervised learning is characterized by comprising the following steps:

step 1: the initialization parameters comprise a learning rate alpha, a discount factor gamma, a soft update factor tau, a small batch data set size K, a priority experience pool capacity M and a switching hysteresis parameter H_ystNumber of physical resource blocks N_PRBOutput action value range [ O ] of action network_cmin,O_cmax]In which O is_cmin、O_cmaxRespectively, a cell bias lower bound value, an upper bound value, and a carrier frequency f_c；

step 2.1, acquiring an initial state from the initialized system environment;

2. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 1, wherein the step 2.1 of obtaining the initial state from the initialized system environment includes the following steps:

step 2.1.1, calculating the signal-to-interference-and-noise ratio of the user; interference of user u serving base station b at time tNoise ratio

Comprises the following steps:

in order to be a set of interfering base stations,

and

The definition is as follows:

wherein, B_PRBIs the bandwidth of one physical resource block;

Comprises the following steps:

wherein,

indicating the desired rate for user u at time t,

represents a ceiling operation;

Is defined as:

a user set at a time t for a serving base station b;

step 2.1.5 Slave State space

To obtain an initial state: state space

Is defined as:

wherein,

Is the average resource utilization of all base stations at time t,

3. The method for balancing mobile load combining reinforcement learning and supervised learning according to claim 1, wherein after the initial state is obtained in step 2.2, the reinforcement learning process is started, and the method includes the following steps:

step 2.2.1, selecting an action;

wherein s is_tIs the state of the system at time t, and

μ (-) is a deterministic strategy to estimate the action network, whichMiddle theta_aTo estimate parameters of a motion network;

the action space for an agent is defined as follows:

wherein

and

step 2.2.8, calculating the corrected reward value of the ith sample: r is_i′＝r_i+η·R(s_i,a_i|θ_r) Where eta is a bias factor, r_iThe reward value contained for the ith sample, R (-) is the mapping of the estimated reward function prediction network, s_iIs the state contained in the ith sample, a_iFor the motion contained in the ith sample, θ_rEstimating parameters of the reward function prediction network;

step 2.2.9, using the revised prize value r_i' update delta_iAnd p_iAnd according to p_iUpdating the priorities of the sampled K samples;

Gradient (2):

will be provided with

Is updated and estimated by back propagation of the gradientEvaluating a parameter θ of a network_c(ii) a After the updating is finished, the step 2.2.11 is carried out;

computing a loss function of an estimated action network

Gradient (2):

wherein μ (·) is a deterministic strategy for estimating an action network;

will be provided with

Gradient (2):

wherein R (-) is a mapping of the estimated reward function prediction network, R' () is a mapping of the target reward function prediction network,

μ' (·) is the deterministic policy of the target action network,

is a parameter of the target reward function prediction network;

will be provided with

wherein,

and

4. The method for mobile load balancing with reinforcement learning and supervised learning combined according to claim 3, wherein in step 2.2.2, the reward value is in the range of [ -1,1 ].