CN114051252B

CN114051252B - Multi-user intelligent transmitting power control method in radio access network

Info

Publication number: CN114051252B
Application number: CN202111145720.3A
Authority: CN
Inventors: 张先超; 赵耀; 张庆华
Original assignee: Jiaxing University
Current assignee: Jiaxing University
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2023-05-26
Anticipated expiration: 2041-09-28
Also published as: CN114051252A

Abstract

The invention relates to a multi-user intelligent transmitting power control method in a wireless access network, which comprises the following steps: modeling and analyzing the communication system of each wireless access device accessing to the network to obtain the global channel state and the global sequence state of the wireless access device; determining a power control strategy of each wireless access device based on Markov decision processes of a plurality of entities; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy; training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network; and each wireless access device performs intelligent transmitting power control according to the trained strategy network. The invention reduces the time delay and the power consumption of the whole uplink communication system, provides high-quality communication service by utilizing limited resources, and has good realizability and expandability due to low complexity and distributed decision.

Description

Multi-user intelligent transmitting power control method in radio access network

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method for controlling multi-user intelligent transmission power in a radio access network.

Background

With the rapid development of mobile internet and artificial intelligence technology in recent years, intelligent wireless access devices such as smart phones, augmented Reality (AR), virtual Reality (VR) and intelligent applications such as telemedicine, industry 4.0, and autopilot have entered a explosive growth stage, which means that a large number of wireless access devices will access a communication network, and the requirements of these intelligent wireless access devices on communication performance are more severe and diverse than those of previous mobile phones. In order to guarantee the communication service quality and experience of the access users, the limited wireless communication resources must be reasonably configured. The transmission power in these resources plays a direct and critical role, the low power natural communication quality is poor, the high power also brings about the interference problem of multiple users to reduce the communication quality, and meanwhile, the high power consumption problem of the wireless access device is also of great concern, so the control of the transmission power of multiple users in the future wireless access network is a key problem in the current wireless communication field.

However, the current power control method based on the model and the numerical optimization algorithm faces the problems of difficult modeling, excessively high algorithm complexity, excessively long solving time and the like when facing the future complex access network, and needs to be re-optimized to adapt to new parameters when the environment changes, so that the method is difficult to be used for power control in practice. Therefore, an intelligent power control method is provided herein, which considers complex channel environment and user demand queue, and performs distributed intelligent control on the multi-user transmitting power in the wireless access network based on multi-agent deep reinforcement learning technology, so as to realize high-quality communication service with low power consumption and low time delay.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a method for controlling multi-user intelligent transmitting power in a wireless access network, which solves the problem that the prior art is difficult to be applied to a future wireless access network.

The technical scheme provided by the invention is as follows:

the invention discloses a multi-user intelligent transmitting power control method in a wireless access network, which comprises the following steps:

modeling and analyzing a communication system of each wireless access device accessing to the network to obtain a global channel state and a global sequence state of the wireless access device;

determining a power control strategy of each wireless access device based on Markov decision processes of a plurality of entities; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network;

and each wireless access device performs intelligent transmitting power control according to the trained strategy network.

Further, each wireless access device accessing to the network performs uplink communication with a single base station in an OFDMA access mode, and the number of assignable subcarriers of the OFDMA is smaller than the number of the wireless access devices; the OFDMA is non-orthogonal multiplexing of carriers, and information of more than one wireless access device is carried on the same subcarrier.

Further, in the non-orthogonal multiplexing, the base station receives a radio access deviceThe achievable data rate for k on subcarrier m is:

wherein H is _k,m (t) is the channel state information of the wireless access device k at the subcarrier m at the time t; p (P) _k,m (t) transmitting power information of the wireless access device k on the subcarrier m at the moment t; h _j,m (t) is the channel state information of the wireless access device j at the subcarrier m at the moment t; p (P) _j,m (t) transmitting power information from the wireless access device j to the subcarrier m at the moment t; Γ is the SINR gap due to the signal modulation multiplexing mode; n (N) ₀ Is the noise power.

Further, the queue dynamics of radio access device k on subcarrier m:

I _k (t) is the length of the sequence to be transmitted of the wireless access device k at the time t; c (C) _k,m (t) is the achievable data rate for the base station to receive wireless access device k on subcarrier M, M being the number of subcarriers.

Further, in step S2, based on the markov decision process, the wireless access device k performs a corresponding power control policy pi _k Selecting action a _k The method comprises the steps of carrying out a first treatment on the surface of the And entering a next state S (t+1) according to the current state S (t) of the wireless access equipment and actions of all the wireless access equipment; and, at the time of state transition, each wireless access device gets a corresponding reward function r _k (t)＝r(S(t),a _k (t), S (t+1)), and obtains the observed quantity o of the new state of the self _k (t+1); in the power control strategy, each wireless access device pursues the long-term return of maximizing itself as

Where γ is the discount factor and T is the time length.

Further, the optimizing target model of the power control strategy establishes a multi-wireless access device transmitting power control problem in the wireless access network according to the low-power consumption and low-delay targets as follows:

α _k and beta _k Respectively corresponding positive weight of power consumption and time delay of the wireless access equipment;

for controlling policy pi _k The average uplink transmitting power consumption and the average uplink communication time delay of the wireless access equipment k; p (P) _max Maximum transmit power for the wireless access device; p (P) _k,m (t) transmitting power information of the wireless access device k on the subcarrier m at the moment t; m is the number of subcarriers;

the rewards for each wireless access device in the optimization objective model are:

k is the number of wireless access devices; l (L) _k (t) is the queue dynamics of wireless access device k on subcarrier m; lambda (lambda) _k The average arrival rate of packets for wireless access device k.

Further, the training the power control strategy by using the multi-agent deep reinforcement learning method comprises the following steps:

step S301, in each iteration round, operating the power control strategy of each wireless access device in the time length T; the central node of the wireless access network collects the actions, states and rewards of each wireless access device;

step S302, calculating estimated dominance values of all wireless access devices;

step S303, traversing all wireless access devices, wherein each wireless access device acquires channel state information in rewards and observation values of the wireless access device from the central node, acquires queue state information from the wireless access device, and combines the queue state information to obtain a final observation value of the wireless access device;

step S304, according to the final observed value, each wireless access device locally uses a gradient descent method to update the corresponding strategy parameters;

step S305, the central node updates the corresponding dominant function network parameters of each wireless access device by using a gradient descent method;

step S306, adding 1 to the round number, and starting the iterative training process from step S301 again;

after iteration is carried out to the maximum round times, the algorithm converges, and the trained strategy network is output.

Further, in step S302, the dominance function for calculating the estimated dominance value of the wireless access device is:

wherein the time parameter n=0, 1,2, …, N-1; n-1 is the number of time points corresponding to the time length T; gamma, lambda E [0,1 ]]A discount factor that balances the estimated bias and variance; v (V) _k (S(t)；φ _k ) State S (t) of radio access device at time t and neural network parameter phi for radio access device k _k The following centralized cost function; r is (r) _k (t) is a prize for wireless access device k.

Further, in step S305, the central node updates the minimization loss function of the corresponding dominance function network parameter of each wireless access device by using the gradient descent method to be;

further, in step S306, each wireless access device updates the objective function of the corresponding policy parameter locally using the gradient descent method as follows:

wherein l _k (t；θ _k ) Indicating the adjustment control strategy pi _k Parameter θ _k Likelihood ratio between new and old strategies; clip (l) _k (t；θ _k ) 1- ε,1+ε) represents the step of adding l _k (t；θ _k ) The amplitude limit is [ 1-epsilon, 1+epsilon ]]A section; epsilon is the error;

is an estimate of the dominance function.

The invention has the beneficial effects that:

the invention takes the requirement of a future wireless access network as a starting point, considers the environmental variability and complexity of the future wireless access network, provides a multi-user intelligent power control method, reduces the time delay and the power consumption of the whole uplink communication system, provides high-quality communication service by using limited resources, and has good realizability and expandability due to low complexity and distributed decision.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

Fig. 1 is a flowchart of a method for controlling intelligent transmission power of multiple users according to an embodiment of the present invention;

FIG. 2 is a framework diagram of multi-agent deep reinforcement learning in an embodiment of the invention;

FIG. 3 is a flowchart of a multi-agent proximity policy optimization method in an embodiment of the present invention;

FIG. 4 is a pseudo code example graph of a multi-agent proximity policy optimization algorithm in an embodiment of the invention.

Detailed Description

Preferred embodiments of the present invention are described in detail below with reference to the attached drawing figures, which form a part of the present application and, together with the embodiments of the present invention, serve to explain the principles of the invention.

General purpose in this embodimentThe communication system takes uplink communication between a base station and ground wireless access equipment as an example, 50 wireless access equipment are arranged in an area with the diameter of 1km at random and are communicated with a single base station in an uplink mode, the total available communication bandwidth is 10MHz, the number of OFDMA available subcarriers is 20, and the path loss of a communication channel is 120.9+37.6log ₁₀ d (in dB), where d is the distance between the transmitting and receiving ends, the doppler frequency is set to 10hz, and the sinr gap is Γ=7.5 dB. The average arrival rate of the data packet is 4Mbps, the maximum transmission power of the wireless access device is 38dBm, the total time step is 1s, the data packet is divided into 1000 time blocks, and the discount coefficients are respectively gamma=0.98 and lambda=0.96. Training was performed for a total of 10000 iterations.

The implementation of the method requires that an environment simulation platform is firstly built (or in an actual environment) to train and learn the power control strategies of a plurality of wireless access devices. After the algorithm converges, the trained strategy is applied to the actual wireless access network, and the wireless access equipment is used as an intelligent agent for intelligent power control. Each agent makes intelligent power control decisions through the collected own user information (queue state information) and part of environment information (own channel state information). Thus, the multi-user high-quality communication service of the wireless access network with long-term low power consumption and low time delay is realized.

The disclosed method for controlling multi-user intelligent transmitting power in a wireless access network in this embodiment, as shown in fig. 1, includes the following steps:

step S101, modeling and analyzing a communication system of each wireless access device accessing to the network to obtain a global channel state and a global sequence state of the wireless access device;

step S102, determining a power control strategy of each wireless access device based on Markov decision processes of a plurality of entities; determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

step S103, training the power control strategy by using a multi-agent deep reinforcement learning method to obtain a trained strategy network;

step S104, each wireless access device performs intelligent transmitting power control according to the trained strategy network.

The present embodiment optimizes the communication service quality of multiple users in the radio access network, so in step S101, modeling analysis is performed on the communication system of the radio access device, including:

1) Calculating the transmission rate of the wireless access equipment;

each wireless access device accessing to the network performs uplink communication with a single base station in an OFDMA access mode, and the number of allocable subcarriers of the OFDMA is smaller than the number of the wireless access devices; the OFDMA allows non-orthogonal multiplexing of carriers, and carries information of more than one wireless access device on the same subcarrier.

Specifically, in the communication system of this embodiment, K intelligent wireless access devices are set to perform uplink communication with a single base station in an OFDMA access manner, where the number of allocable subcarriers of OFDMA is M, so that M < K is better to simulate the situation of accessing a large number of wireless access devices in the future, and in addition, in order to further reduce the queue waiting delay and improve the spectrum utilization rate, non-orthogonal multiplexing of carriers is allowed here, which means that more than one wireless access device may be carried on the same subcarrier. Let the transmission power of the kth wireless access device on subcarrier m at time t be P _k,m (t) the transmitted signal is x _k,m (t). The signal received by the base station at time t at subcarrier m from the kth wireless access device can be expressed as:

wherein h is _k,m (t) is the complex channel coefficient between radio access device k and base station on subcarrier m at time t, z _k,m (t) is independently and uniformly distributed complex Gaussian white noise, and the noise power is N ₀ . Order the

Representing global letterChannel State Information (CSI), where H _k , _m (t)＝|h _k,m (t)| ² The instantaneous channel gain on subcarrier m between radio access device k and the base station at time t is indicated. Here, a rayleigh fading channel model common in a radio access network is adopted, and in order to characterize the dynamic characteristics of a channel, the channel coefficient is expressed as a first-order complex gaussian markov process according to a Jakes fading model:

wherein h is _k,m (t) and channel update procedure e _k,m And (t) are unit variance circularly symmetric complex Gaussian random variables which are independently and uniformly distributed. Correlation coefficient ρ=j ₀ (2πf _d T), where J ₀ (.) is a zero-order Bessel function, f _d Is the maximum doppler frequency.

Since multiplexing of subcarriers is allowed here, the base station will receive signals from multiple terrestrial radio access devices on one OFDMA resource block, for one of which the signals of the other radio access devices will be regarded as noise, and the received signal rate of the radio access devices will also depend on the signal-to-interference-and-noise ratio (SINR). At a given channel state information H (t) and transmit power

The achievable data rate of base station receiving radio access device k on subcarrier m can be expressed as:

where Γ is the SINR gap due to multiplexing such as signal modulation.

2) Modeling and analyzing the queue dynamics of the communication wireless access equipment;

in a radio access network, one of the greatest visual experiences of a user of a radio access device on a communication service is the time delay of communication, and the user's demand is represented by the size of a data packet at the bottom layer of communication, so that high-quality communication service means that low-time delay transmission can be realized and communication resources can be efficiently utilized regardless of the user's demand. The continuous increase of the communication rate is finally aimed at meeting the requirement of large data transmission of users more quickly; the power and communication rate are reduced if the user requires a small amount of data to save power consumption while reducing interference to other users. Therefore, in consideration of the performance index of time delay, modeling analysis is carried out on the dynamic information of the data packet queue.

Assuming that the wireless access device transmits data packets to randomly enter a sequence to be transmitted in a poisson distribution process, the average arrival rate of the data packets of the wireless access device k is set as lambda _k ，I(t)＝(I ₁ (t),…,I _K When (t)) is set to the size of the packet information amount reaching the wireless access device at time t

Mathematical expectation E [ I ] _k (t)]＝λ _k . Let L be _k (t) ∈0, +. L (t) = (L) ₁ (t),…,L _K (t))∈[0,∞) ^K Is global sequence state information (QSI). For wireless access device k, its queue dynamics can be expressed as:

after the system environment, state models (i.e., CSI and QSI) are built in step S101, the power control strategy and optimization target models are designed in step S102, including:

1) Establishing a power control strategy model

Because both the wireless channel environment and the wireless access device queue dynamics have Markov properties, and a distributed control strategy is adopted here, each wireless access device makes an autonomous decision according to part of state information observed by itself, so the dynamic decision process is modeled as a Markov decision process of a plurality of bodies, namely a part of observed Markov games.

Specifically, let s= (H, L) be the global state, and the action set of the radio access device k be

o _k As the observation set of radio access device k, it is assumed here that the radio access device can observe its own channel state information H _k,m (t) and queue status information L _k (t). Wireless access device k selects actions according to a random policy: a, a _k (t)～π _k (a _k (t)|o _k (t)) and then enter the next state according to the state transfer function: s (t+1) to P (S (t+1) |S (t), a) ₁ (t),…,a _K (t)). Each wireless access device will get a corresponding reward function r _k (t)＝r(S(t),a _k (t), S (t+1)), and obtains the observed quantity o of the new state of the self _k (t+1). Each wireless access device pursues a maximum of its own long-term return +.>

Where γ is the discount coefficient and T is the time range.

2) Determining an optimization target model of the power control strategy according to the average uplink transmission power consumption and the average uplink communication time delay of the wireless access equipment under the power control strategy;

from the above model building we can further build specific targets and facing problems. First of all, the object of the invention is to reduce the communication power consumption of a radio access device, in a control strategy pi _k The average uplink transmit power consumption of wireless access device k can be expressed as

In addition, communication delay of wireless access equipment is reduced, and control strategy pi is adopted _k In the following, according to the littermate rule, the average uplink communication delay of radio access device k can be expressed as

Where T is the time range. According to the mathematical expression and the established low-power consumption and low-delay target, the problem of establishing multi-user intelligent transmitting power control in the wireless access network is as follows:

the objective of the problem is to minimize the weighted power consumption and the delay, alpha _k And beta _k Respectively, the power consumption and the time delay of the wireless access equipment are corresponding positive weight. According to the objective, defining rewards for each wireless access device as

Cooperation must be established between wireless access devices to achieve such team-type goals.

Specifically, in step S103, a multi-agent deep reinforcement learning method is applied to obtain an optimal power control policy of each wireless access device;

the multi-agent deep reinforcement learning technology applied in this embodiment is specifically a multi-agent proximity strategy optimization method, and the overall framework is centralized training and distributed execution, as shown in fig. 2, and the optimal power control strategy is obtained by performing multi-agent deep reinforcement learning based on an actor-arbiter algorithm.

In order to obtain the optimal power control strategy, strategy evaluation and strategy improvement are required to be continuously iterated. In a Markov game with multiple agents, the value of a strategy is determined by the global state value and the actions of each agent, so strategy pi is measured _k Centralized evaluation was performed. To reduce the evaluation variance, a generic merit function evaluation strategy is employed here, in particular, defining the centralized merit function of agent k-resorting strategy as V ^πk (S(t))＝E[R _k |S(t)]Action-cost function of

The dominance function can be expressed as

In reality, the accurate value of the dominance function cannot be obtained, the dominance function needs to be estimated by adopting a deep neural network, and the parameters of the dominance function network are set to phi= { phi ₁ ,…,φ _K The estimate of the dominance function can be written as:

wherein, gamma, lambda E [0,1 ]]To balance the discount factor of estimated bias and variance, δ _k (t+n)＝r _k (t+n)+γV _k (S(t+n+1)；φ _k )-V _k (S(t+n)；φ _k ) As a time difference function, n is a time parameter, which indicates the point in time to which the strategy is running, and the expansion (8) is performed to obtain:

network parameter phi = { phi ₁ ,…,φ _K By minimizing the loss function:

the above-described procedure of evaluating the merit function is implemented at a central node (e.g., a wireless access point such as a base station).

The distributed strategy improvement can be performed by transmitting the dominance function value back to each wireless access device with the dominance function required by the evaluation strategy, and the basic idea of the improvement is to adjust the strategy parameter theta= { theta ₁ ,…,θ _K To maximize the objective function J (θ _k )＝E[R _k ]In order to improve training stability, excessive generation in strategy training process is preventedChanging, the proximity gradient optimization algorithm changes the objective function to:

/>

wherein the likelihood ratio between new and old strategies

clip(l _k (t；θ _k ) 1-. Epsilon.1+epsilon.) will be l _k (t；θ _k ) The amplitude limit is [ 1-epsilon, 1+epsilon ]]Interval epsilon is the error. The policy improvement requires only a partial observation of each wireless access device itself and can be performed at the wireless access device.

More specifically, in the communication system, the multi-agent proximity policy optimization method implemented based on the actor-arbiter is shown in fig. 3; the method specifically comprises the following steps:

step S301, in each iteration round, the power control strategy of each wireless access device is operated within the time length T

The central node collects the actions, states and rewards of each wireless access device to obtain { S (t), a } ₁ (t),…,a _K (t), r (t); wherein the initial power control strategy is a random strategy;

the center node is a base station or other wireless access equipment serving as the center node;

the dominance function for calculating the estimated dominance value of the wireless access device is formula (9);

step S303, traversing all wireless access devices, wherein each wireless access device acquires channel state information in rewards and observation values of the wireless access device from a central node, acquires queue state information from the wireless access device, and combines the queue state information to obtain a final observation value of the wireless access device;

step S304, according to the final observed value, each wireless access device locally uses a gradient descent method to update the corresponding strategy parameter theta;

wherein the gradient descent method used locally by each wireless access device is performed according to the objective function of formula (11);

step S305, updating the corresponding dominance function network parameter phi of each wireless access device by using a gradient descent method at the central node;

wherein the gradient descent method used by the central node is performed according to the minimization loss function of formula (10);

after the iteration reaches the maximum round times, the algorithm converges, the training process is finished, and the trained strategy network is output.

Specifically, in step S104, when each wireless access device performs intelligent transmission power control according to the trained policy network,

and each wireless access device selects the optimal transmitting power to access the wireless communication network according to the trained strategy network and the respective pi (a (t) |o (t)) in a complex change environment. At this time, centralized training is not performed any more, and intelligent decision making is performed in a fully distributed manner.

As shown in fig. 4, this embodiment also provides a pseudo code example of the whole multi-agent proximity policy optimization algorithm, and uses double-layer nesting of for statements to implement optimization of the power control policy of the network access wireless access device.

In summary, the method for controlling multi-user intelligent transmitting power in the wireless access network of the embodiment reduces time delay and power consumption of the whole uplink communication system, provides high-quality communication service by using limited resources, and has good realizability and expandability due to low complexity and distributed decision.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. A multi-user intelligent transmitting power control method in a wireless access network is characterized by comprising the following steps:

each wireless access device performs intelligent transmitting power control according to the trained strategy network;

each wireless access device accessing to the network performs uplink communication with a single base station in an OFDMA access mode, wherein the number of assignable subcarriers of the OFDMA is smaller than the number of the wireless access devices; the OFDMA is non-orthogonal multiplexing of carriers, and information of more than one wireless access device is carried on the same subcarrier;

modeling and analyzing a communication system of each wireless access device accessing to the network to obtain a global channel state and a global sequence state of the wireless access device, wherein the method comprises the following steps:

1) Calculating the transmission rate of the wireless access equipment;

in the non-orthogonal multiplexing, the base station receives the achievable data rate of the wireless access device k on the subcarrier m as follows:

wherein H is _k,m (t) is the channel state information of the wireless access device k at the subcarrier m at the time t; p (P) _k,m (t) transmitting power information of the wireless access device k on the subcarrier m at the moment t; h _j,m (t) is the channel state information of the wireless access device j at the subcarrier m at the moment t; p (P) _j,m (t) transmitting power information from the wireless access device j to the subcarrier m at the moment t; Γ is the SINR gap due to the signal modulation multiplexing mode; n (N) ₀ Is the noise power;

the determined queue dynamics of wireless access device k on subcarrier m:

I _k (t) is the length of the sequence to be transmitted of the wireless access device k at the time t; c (C) _k,m (t) receiving an achievable data rate of the wireless access device k on the subcarrier M for the base station, wherein M is the number of subcarriers;

based on Markov decision process, wireless access device k is based on corresponding power control strategy pi _k Selecting action a _k The method comprises the steps of carrying out a first treatment on the surface of the And entering a next state S (t+1) according to the current state S (t) of the wireless access equipment and actions of all the wireless access equipment; and, at the time of state transition, each wireless access device gets a corresponding reward function r _k (t)＝r(S(t),a _k (t), S (t+1)), and obtains the observed quantity o of the new state of the self _k (t+1); in the power control strategy, each wireless access device pursues the long-term return of maximizing itself as

Wherein gamma is a discount factor and T is a time length;

the optimizing target model of the power control strategy establishes a multi-wireless access device transmitting power control problem in the wireless access network according to the low-power consumption and low-delay targets, wherein the transmitting power control problem comprises the following steps:

k is the number of wireless access devices; l (L) _k (t) is the queue dynamics of wireless access device k on subcarrier m; lambda (lambda) _k Average arrival rate of data packets for wireless access device k;

the process of training the power control strategy by using the multi-agent deep reinforcement learning method comprises the following steps:

after iteration is carried out to the maximum round times, the algorithm converges, and a trained strategy network is output;

in step S302, the dominance function for calculating the estimated dominance value of the wireless access device is:

wherein the time parameter n=0, 1,2, …, N-1; n-1 is the number of time points corresponding to the time length T; gamma, lambda E [0,1 ]]A discount factor that balances the estimated bias and variance; v (V) _k (S(t)；φ _k ) State S (t) of radio access device at time t and neural network parameter phi for radio access device k _k The following centralized cost function; r is (r) _k (t) is a reward for wireless access device k;

in step S305, the central node updates the minimization loss function of the corresponding dominance function network parameter of each wireless access device by using a gradient descent method as follows;

in step S306, each wireless access device updates the objective function of the corresponding policy parameter locally using the gradient descent method as follows:

is an estimate of the dominance function. />