WO2021128805A1 - 一种基于生成对抗强化学习的无线网络资源分配方法 - Google Patents

一种基于生成对抗强化学习的无线网络资源分配方法 Download PDF

Info

Publication number
WO2021128805A1
WO2021128805A1 PCT/CN2020/100753 CN2020100753W WO2021128805A1 WO 2021128805 A1 WO2021128805 A1 WO 2021128805A1 CN 2020100753 W CN2020100753 W CN 2020100753W WO 2021128805 A1 WO2021128805 A1 WO 2021128805A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
value
resource allocation
generator
discriminator
Prior art date
Application number
PCT/CN2020/100753
Other languages
English (en)
French (fr)
Inventor
李荣鹏
华郁秀
马琳
张宏纲
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2021128805A1 publication Critical patent/WO2021128805A1/zh
Priority to US17/708,059 priority Critical patent/US11452077B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/16Central resource management; Negotiation of resources or communication parameters, e.g. negotiating bandwidth or QoS [Quality of Service]

Definitions

  • the present invention relates to the technical field of wireless network resource allocation and reinforcement learning, in particular to a wireless network resource allocation method based on generative confrontation reinforcement learning.
  • 5G networks will support a large number of diversified business scenarios from vertical industries, such as smart security, high-definition video, telemedicine, smart home, autonomous driving, and augmented reality. These business scenarios usually have different communication requirements. For example, augmented reality technology requires more With low latency, autonomous driving technology requires the network to provide higher reliability.
  • traditional mobile networks are mainly designed to serve a single mobile broadband service and cannot adapt to the diversified business scenarios of 5G in the future. If a dedicated physical network is built for each business scenario, it will inevitably lead to problems such as complex network operation and maintenance, high cost, and poor scalability.
  • network slicing technology came into being. Specifically, on a common physical network, resources such as network and computing can be divided into multiple slices to meet different requirements. This allows network tenants to arrange and configure different network slice instances according to specific requirements, thereby effectively reducing costs and improving network flexibility.
  • Radio Access Network (RAN) slices face some challenging technical problems in realizing real-time management of resources on existing slices: (a) For In RAN, spectrum is a scarce resource, so it is important to ensure spectrum efficiency (SE); (b) The service level agreement (SLA) of slice tenants usually affects the quality of experience of users. QoE) puts forward strict requirements; (c) The actual resource demand of each slice largely depends on the user's request mode.
  • SE spectrum efficiency
  • SLA service level agreement
  • QoE puts forward strict requirements
  • the actual resource demand of each slice largely depends on the user's request mode.
  • reinforcement learning is a machine learning method dedicated to seeking optimal decision-making.
  • the subject perceives the state information in the environment and searches for the action that can produce the largest cumulative return (this cumulative return is also called action value).
  • this action causes a change in state and obtains an instant reward value, updates the estimate of the cumulative reward (action value function), completes a learning process, enters the next round of learning training, and repeats the loop iteration until the learning termination condition is met.
  • Generative adversarial networks were first used to generate fake and real images, and later they were gradually used as data generation tools in many fields.
  • the generative confrontation network is composed of two neural networks, namely the generative network and the discriminant network.
  • the generation network is responsible for mapping the data sampled from Gaussian white noise to the space of the real image to obtain the generated "fake” image; then the "fake” image and the real image will be disrupted for input discrimination Network, to determine the probability that the image given by the network output is a real image.
  • the goal of generating the network is to produce as realistic images as possible to confuse the discriminating network; the goal of discriminating the network is to distinguish "fake” images from real images as accurately as possible.
  • Two neural networks are trained alternately and eventually reach the Nash equilibrium. At this equilibrium point, the image produced by the generating network and the real image can no longer be distinguished by the discriminant network.
  • the purpose of the present invention is to provide a wireless network resource allocation method based on generative confrontation reinforcement learning.
  • the method proposed by the present invention is more efficient and flexible; compared with other methods based on reinforcement learning Therefore, the method proposed by the present invention can reduce the negative impact caused by the interference factors in the communication environment and the uncertainty of the instant return. Therefore, the use of generative confrontation reinforcement learning algorithm for wireless network resource allocation can greatly improve wireless network performance.
  • the present invention provides a wireless network resource allocation method based on generative adversarial reinforcement learning.
  • the method includes the following steps:
  • the initialization of the generator network G and the discriminator network D specifically includes the following sub-steps:
  • the generative confrontation reinforcement learning algorithm includes two neural networks, namely the generator network G and the discriminator network D.
  • the weights of the generator network G and the discriminator network D are randomly initialized through Gaussian distribution. ;
  • set Network where the The structure of the network is the same as the structure of the generator network G, and is completed by copying the weight of the generator network G Network weight initialization;
  • the input generator network G network state s, the output N a ⁇ N dimensional vector, the N a ⁇ N dimensional vector sequence output of the generator G network segmentation, to obtain the N a N dimensional vector;
  • the discriminator network D inputs an N-dimensional vector, and the discriminator network D inputs an N-dimensional vector from the output of the generator network G or through the The output of the network and the instant return are calculated.
  • the generator network G outputs a scalar representing the authenticity of the input.
  • the absolute value of the difference between the scalar and 0 is less than the absolute value of the difference between the scalar and 1.
  • the discriminator network D judges that the input vector is taken from the output of the generator network G, and the absolute value of the difference between the scalar and 1 is less than the absolute value of the difference between the scalar and 0, then the discriminator network D judges that the input vector is determined by the The output of the network and the instant return are calculated;
  • N represents the number of samples of Z (s, a) samples
  • N A N-dimensional vector p-th vector indicates the distribution of sample values return overall operation of the p-th obtained
  • Z (s, a) represents distribution cumulative return operation in a network state obtained s, s network state within a time interval for each type of service requested quantity, representative of the operation of a service type for each size of the allocated bandwidth
  • N is a number of effective operation
  • radio resource manager acquires the current time t, the network state s observed value s t; radio resource manager using ⁇ -greedy strategy selection operation A t; when performing the operation a t, a radio resource manager receives system return value J, and observe the observed value s t+1 of the network state s at time t+1;
  • the radio resource manager using ⁇ -greedy strategy selection operation of a t comprises: a radio resource manager from (0,1) to get a uniform distribution random number, if the random number is less than ⁇ , randomly selected radio resource manager an effective action; each operation if the random number is greater than or equal to ⁇ , radio resource manager s t the input generator network G, accumulated sample value to obtain the N a return operation of the distribution, and then calculate the The average value of the sampled value of the cumulative return distribution, and the action corresponding to the maximum average value is selected;
  • the radio resource manager stores the (s t , a t , r t , s t+1 ) quadruple in a buffer area B of size N B ; if the space of B is full, it will be stored first The four-tuple in B will be deleted, and then stored in the latest four-tuple;
  • step 2 Each time the resource allocation of step 2 is executed K times, the weights of the generator network G and the discriminator network D stored in B are then used to train;
  • N s is the number of service types, and
  • G(s m ) is the observation of the network state at the mth time t Take action a m under the value st , and obtain N return sample values, denote G(s m ) as the sample value of the estimated action value distribution;
  • y i is the sampling value of the target action value distribution
  • r i is the instant return
  • is the discount factor
  • the loss function LD of the discriminator network D is:
  • D(G(s i )) represents the output of the discriminator network D when the input is G(s i );
  • D(y i ) represents the output of the discriminator network D when the input is y i;
  • Means on The gradient value obtained by derivation, ⁇ is the penalty factor; then the weight of the discriminator network D is trained by the gradient descent algorithm, and the training of the discriminator network D is completed;
  • the loss function L G of the generator network G is a
  • Step (3) After executing N train times, the training of the discriminator network D and the generator network G is completed.
  • the wireless resource manager inputs the current network state into the generator network G, and the generator network G outputs the sampling of the cumulative return distribution corresponding to each resource allocation strategy, and then calculates the average value of the return sample value of each resource allocation strategy, and the maximum average value corresponds to The action as the resource allocation strategy corresponding to the radio resource manager.
  • the discount factor ⁇ is 0.75 to 0.9.
  • the value of N is 30-55.
  • the initial value of ⁇ is 0.9, and every 100 resource allocation steps (2) ⁇ is reduced by 0.05, and remains unchanged when ⁇ is reduced to 0.05; ⁇ is 0.8-1.5.
  • the size NB of the buffer area B is 3000 ⁇ 10000.
  • n d is 1-10; the number m of quaternions is 32 or 64.
  • the penalty factor ⁇ is 10, 20 or 30.
  • the gradient descent algorithms used for training the generator network G and the discriminator network D are both Adam, and the learning rate is both 0.001.
  • the number of times K of executing resource allocation is 10-50.
  • the value of N train is 2000-3000.
  • the present invention discloses the following technical effects:
  • the present invention uses a reinforcement learning method to estimate the distribution of action values. Compared with the traditional method of estimating the expected action value, the learning method proposed by the present invention has better stability and adaptability, and can enable the radio resource manager to learn the best from the system environment with noise interference and randomness. Resource allocation strategy.
  • the present invention adopts a method of alternating training of two neural networks of generator and discriminator to learn the distribution of action values. Compared with the traditional method of learning the distribution of random variables, the present invention does not require any prior assumptions about the distribution of action values.
  • the resource allocation strategy obtained by the present invention can obtain a higher system return value, that is, higher spectrum efficiency and better user experience.
  • FIG. 1 is a flowchart of a wireless network resource allocation method for generating confrontation reinforcement learning according to an embodiment of the present invention
  • Figure 2 is an embodiment of the present invention when the data packet size of the ultra-reliable low-latency service is uniformly selected from ⁇ 6.4, 12.8, 19.2, 25.6, 32 ⁇ Kbyte, the method of the present invention and the DQN resource allocation algorithm, evenly allocated
  • Figure 3 is an embodiment of the present invention when the data packet size of the ultra-reliable low-latency service is uniformly selected from ⁇ 0.3, 0.4, 0.5, 0.6, 0.7 ⁇ Mbyte, the method of the present invention and the DQN resource allocation algorithm and the average allocation The method's system return value changes in the process of wireless resource allocation.
  • the purpose of the present invention is to provide a wireless network resource allocation method based on generative confrontation reinforcement learning.
  • the method proposed by the present invention is more efficient and flexible; compared with other methods based on reinforcement learning, the present invention
  • the proposed method can reduce the negative effects of interference factors in the communication environment and the uncertainty of instant returns. Therefore, the use of generative confrontation reinforcement learning algorithms for wireless network resource allocation can greatly improve wireless network performance.
  • Fig. 1 is a flowchart of a wireless network resource allocation method for generating confrontation reinforcement learning according to the present invention, which specifically includes the following steps:
  • the initialization of the generator network G and the discriminator network D specifically includes the following sub-steps:
  • the generation of confrontation reinforcement learning algorithm includes two neural networks, which are denoted as generator network G and discriminator network D.
  • the weights of the generator network G and the discriminator network D are respectively initialized randomly through Gaussian distribution.
  • set Network where The structure of the network is exactly the same as the structure of the generator network G, and its own weight is initialized by copying the weight of the generator network G.
  • the discriminator network D inputs an N-dimensional vector, which is taken from the output of the generator network G or passed through The network output and the instant return r are calculated, and a scalar is output through the calculation of the fully connected neural network.
  • the generator network G outputs a scalar representing the authenticity of the input.
  • the absolute value of the difference between the scalar and 0 is less than the absolute value of the difference between the scalar and 1, and the discriminator network D judges that the input vector is taken from the generator network G If the absolute value of the difference between the scalar and 1 is less than the absolute value of the difference between the scalar and 0, the discriminator network D judges that the input vector is determined by the The network output and instant return are calculated.
  • N represents the number of samples of Z (s, a) samples
  • N A N-dimensional vector p-th vector indicates the distribution of sample values return overall operation of the p-th obtained
  • Z (s, a) represents the overall distribution of the return operation of a network state obtained s, s network state within a time interval for each type of service requested quantity, representative of the operation of a service type for each size of the allocated bandwidth
  • N is the number of a valid operation.
  • the wireless resource manager obtains the observed value s t of the network state s at the current time t.
  • Radio resource manager using ⁇ -greedy strategy selection operation of a t; when performing the operation a t, a radio resource manager receives system return value J, and observed network state s t + 1 time observation value s t + 1 .
  • Radio resource manager using ⁇ -greedy strategy selection operation of a t comprises: a radio resource manager acquires a random number (0, 1) uniform distribution, if the random number is less than ⁇ , radio resource manager randomly selects a valid action. If the random number is greater than or equal to ⁇ , radio resource manager s t the input generator network G, to obtain sample value return overall operation of the N a distribution of the sampled value is then calculated for each cumulative return operations of the distribution, respectively mean, maximum selecting operation corresponding to the mean, the operation of this step is referred to the radio resource manager to take a t.
  • a radio resource manager receives system return value J, the observed network state and time t + 1 of the observation value s t + 1.
  • the initial value of ⁇ is 0.9, and every time the resource allocation step (2) is run 100 times, ⁇ decreases by 0.05, and remains unchanged when it decreases to 0.05.
  • radio resource manager (s t, a t, r t, s t + 1) four-tuple stored in a size of buffer B N B's, the size of the N B is 3000 ⁇ 10000, N B too small will undermine the stability of the training process, N B over the General Assembly to increase the amount of calculation. If the space of B is full, the first four-tuple stored in B will be deleted, and then the newest four-tuple will be stored.
  • step 2 Every time the resource allocation of step 2 is executed K times, the value of K is 10-50. If K is too small, it will increase the amount of calculation, and if K is too large, it will slow down the convergence speed; then use the four-tuple stored in B to train the generator The weight of the network G and the discriminator network D.
  • N s is the number of service types
  • G(s m ) is the action a m taken under the observed value s t of the network state at the mth time t
  • the N return sample values obtained are recorded as the sample value of the estimated action value distribution .
  • y i is the sampling value of the target action value distribution
  • r i is the instant return
  • is the discount factor, with a value of 0.75 to 0.9. If ⁇ is too small or too large, it will cause the wireless resource manager to fail in any network state. Take the best action.
  • I the weighted sum of the step-by-step sample value of the target action value and the sample value of the estimated action value distribution
  • i the i-th of the m samples.
  • the loss function LD of the discriminator network D is:
  • D(G(s i )) represents the output of the discriminator network D when the input is G(s i );
  • D(y i ) represents the output of the discriminator network D when the input is y i;
  • Means on The gradient value obtained by derivation, ⁇ is the penalty factor, and the value is 10, 20 or 30. If ⁇ is too small, the effect of the penalty term will be weakened. If ⁇ is too large, the discriminator network D will converge prematurely, which is not conducive to the generator network G. training. Then use the gradient descent algorithm to train the weights of the discriminator network D, and complete the training of the discriminator network D once.
  • n d After training the discriminator network D n d times, obtain the latest weight value of the discriminator network D and participate in the training of the generator network G.
  • the value of n d is 1-10. If n d is too large, the discriminator network D will pass Early convergence is not conducive to the training of the generator network G.
  • the loss function L G of the generator network G is a
  • the gradient descent algorithms used in the training generator network G and the discriminator network D are both Adam, and the learning rate is 0.001. Too small a learning rate will slow down the convergence speed, and a too large learning rate will cause the training process to be unstable.
  • Step (3) After executing N train times, the value of N train is 2000 ⁇ 3000, and the training of discriminator network D and generator network G is completed. If N train is too small, it will cause the wireless resource manager to be in any network. The optimal action cannot be taken in the state, and the N train is too large to increase the amount of calculation.
  • the radio resource manager inputs the current network state vector st into the generator network G, and the generator network G outputs samples of the cumulative return distribution corresponding to each resource allocation strategy, and then calculates the average value of the return sample value of each resource allocation strategy, and takes The action corresponding to the maximum average value is used as the resource allocation strategy corresponding to the radio resource manager.
  • the resources that need to be allocated are wireless bandwidth, the total bandwidth is 10M, and the granularity of the allocation is 1M, so there are 36 allocation strategies in total, that is, the number of effective actions is 36.
  • the algorithm decreases by 0.05 every 100 runs, and remains unchanged when it decreases to 0.05.
  • the size of the buffer area B NB is 10,000.
  • the input layer of the G network has 3 neurons, the first hidden layer has 512 neurons, the second hidden layer has 512 neurons, and the output layer has 1800 neurons.
  • the input layer of the D network has 50 neurons, the first hidden layer has 256 neurons, the second hidden layer has 256 neurons, and the output layer has 1 neuron.
  • the penalty factor ⁇ in the D network loss function is 30.
  • Figure 2 shows the changes in the system return value obtained by the three methods during the wireless resource allocation process. It can be seen from the figure that with As the number of iterations increases, the method proposed by the present invention has better stability. Note that in this simulation, the data packet size of the ultra-reliable low-latency service is uniformly selected from ⁇ 6.4,12.8,19.2,25.6,32 ⁇ KByte. Because the data packet is small, the ultra-reliable low-time The performance requirements of extended service are easily met, so the method proposed by the present invention and the method based on DQN can achieve a high system return value.
  • Figure 3 shows the case when the data packet size of the ultra-reliable low-latency service is uniformly selected from ⁇ 0.3, 0.4, 0.5, 0.6, 0.7 ⁇ MByte.
  • the system return value obtained by the three methods is reduced, but the system return value obtained by the method proposed by the present invention is higher than that of the DQN.
  • the discount factor ⁇ is set to 0.75 again, the number of samples N for the overall return distribution is 30, and the initial value of ⁇ is 0.9.
  • the algorithm is reduced by 0.05 every 100 runs, and remains unchanged when it is reduced to 0.05.
  • the size of the buffer area B NB is 3000.
  • the input layer of the G network has 3 neurons, the first hidden layer has 512 neurons, the second hidden layer has 512 neurons, and the output layer has 1080 neurons.
  • the input layer of the D network has 50 neurons, the first hidden layer has 256 neurons, the second hidden layer has 256 neurons, and the output layer has 1 neuron.
  • the penalty factor ⁇ in the loss function of the D network is 10.
  • the gradient descent algorithms used to train the G network and the D network are both Adam, and the learning rate is both 0.001.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本发明公开一种基于生成对抗强化学习的无线网络资源分配方法,属于无线资源分配以及强化学习领域。该方法包括:生成器网络G和鉴别器网络D的初始化,执行资源分配,训练生成器网络G和鉴别器网络D的权重,最后实现无线网络资源分配。本发明得到的资源分配策略相较于基于DQN的资源分配方法以及平均分配资源的方法,能得到更高的系统回报值,即更高的频谱效率和更好的用户体验。

Description

一种基于生成对抗强化学习的无线网络资源分配方法
本申请要求于2019年12月24日提交中国专利局、申请号为201911347500.1、发明名称为“一种基于生成对抗强化学习的无线网络资源分配方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及无线网络资源分配以及强化学习技术领域,特别是涉及一种基于生成对抗强化学习的无线网络资源分配方法。
背景技术
5G网络将支持大量来自垂直行业的多样化业务场景,例如智能安防、高清视频、远程医疗、智能家居、自动驾驶和增强现实等,这些业务场景通常具有不同的通信需求,比如增强现实技术需要更低的时延,自动驾驶技术需要网络提供更高的可靠性。然而,传统移动网络主要被设计用来服务单一的移动宽带业务,无法适应未来5G多样化的业务场景。如果为每种业务场景都建设一个专有的物理网络必然会导致网络运维复杂、成本昂贵以及可扩展性差等问题。
为了解决上述问题,网络切片技术应运而生。具体地,在一个共同的物理网络上,网络和计算等资源可以被划分成多个切片以满足不同的需求。这使得网络租户可以根据特定的要求来编排和配置不同的网络切片实例,从而有效地降低成本,提高网络的灵活性。
为了提供性能更好,成本更低的服务,无线接入网(Radio Access Network,RAN)切片在实现对现有切片上的资源进行实时管理方面面临一些具有挑战性的技术问题:(a)对于RAN,频谱是一种稀缺资源,因此保证频谱效率(Spectrum Efficiency,SE)至关重要;(b)切片租户的服务水平协议(Service Level Agreement,SLA)通常对用户的体验质量(Quality of Experience,QoE)提出严格要求;(c)每个切片的实际资源需求很大程度上取决于用户的请求模式。
传统的专用资源分配无法同时解决这些问题。因此,有必要根据用户的服务请求动态地智能地将频谱资源分配给不同切片,以便在获得令人满 意的QoE的同时保持较高的SE。
另一方面,强化学习是一种致力于寻求最优决策的机器学习方法,主体感知环境状中的状态信息,搜索可以产生最大累计回报(这种累计回报也被称为动作值)的动作,执行该动作从而引起状态的改变并得到一个即时回报值,更新对累计回报的估计(动作值函数),完成一次学习过程,进入下一轮的学习训练,重复循环迭代,直到满足学习终止条件。
然而传统的基于动作值学习的方法(如深度Q网络)难以应付环境中存在的干扰和即时回报的不确定性,因此人们引入了分布强化学习,其主要变化在于直接对动作值分布进行估计,而不像传统方法那样估计动作值的期望。
生成对抗网络最先被用来生成能够以假乱真的图像,后来逐渐被很多领域用作数据生成的工具。生成对抗网络由两个神经网络组成,分别是生成网络和判别网络。以生成图像为例,生成网络负责将从高斯白噪声中采样得到的数据映射到真实图像的空间,得到生成出来的“假”的图像;然后“假”图像和真实图像会被打乱输入判别网络,判别网络输出所给图像是真实图像的概率。生成网络的目标是产生尽可能逼真的图像,以迷惑判别网络;判别网络的目标是尽可能准确地区分“假”图像和真实图像。两个神经网络交替训练,最终会达到纳什均衡,在这个平衡点上,生成网络产生的图像和真实图像已不能被判别网络区分。
发明内容
基于此,本发明的目的是提供一种基于生成对抗强化学习的无线网络资源分配方法,相较于传统的专用资源分配方法,本发明提出的方法更加的高效灵活;对比其他基于强化学习的方法,本发明提出的方法能够减少通信环境中的干扰因素和即时回报的不确定性带来的负面影响。因此,采用生成对抗强化学习算法进行无线网络资源分配,能够大幅提高无线网络性能。
为实现上述目的,本发明提供了一种基于生成对抗强化学习的无线网络资源分配方法,该方法包括以下步骤:
(1)生成器网络G和鉴别器网络D的初始化,具体包括以下子步骤:
(1.1)生成对抗强化学习算法中包括两个神经网络,分别为生成器网络G和鉴别器网络D,通过高斯分布分别将所述生成器网络G和所述鉴别器网络D的权重进行随机初始化;同时,设置
Figure PCTCN2020100753-appb-000001
网络,其中,所述
Figure PCTCN2020100753-appb-000002
网络的结构与所述生成器网络G的结构相同,并通过复制生成器网络G权重的方法完成
Figure PCTCN2020100753-appb-000003
网络权重初始化;
(1.2)所述生成器网络G输入网络状态s,输出N a×N维向量,将所述生成器网络G输出的N a×N维向量顺序切分,得到N a个N维向量;所述鉴别器网络D输入N维向量,所述鉴别器网络D输入的N维向量取自所述生成器网络G的输出或者通过所述
Figure PCTCN2020100753-appb-000004
网络的输出与即时回报计算得到,所述生成器网络G输出一个表示输入真实性的标量,所述标量与0的差值的绝对值小于所述标量与1的差值的绝对值,则所述鉴别器网络D判断输入的向量取自所述生成器网络G的输出,所述标量与1的差值的绝对值小于所述标量与0的差值的绝对值,则所述鉴别器网络D判断输入的向量是由所述
Figure PCTCN2020100753-appb-000005
网络的输出与所述即时回报计算得到;
其中,N表示对Z(s,a)采样的样本个数,N a个N维向量中第p个向量表示第p个动作得到的总体回报的分布的采样值,Z(s,a)表示网络状态s下动作a得到的累计回报的分布,网络状态s为一个时间间隔内每种类型服务请求的数量,动作a代表为每种类型服务分配的带宽大小,N a为有效动作的数量;
(2)执行资源分配,具体包括以下子步骤:
(2.1)无线资源管理器获取当前t时刻网络状态s的观测值s t;无线资源管理器采用∈-greedy策略选择动作a t;当执行了动作a t,无线资源管 理器接收到系统回报值J,并观察到t+1时刻的网络状态s的观测值s t+1
所述无线资源管理器采用∈-greedy策略选择动作a t具体包括:无线资源管理器从(0,1)均匀分布中获取一个随机数,如果所述随机数小于∈,无线资源管理器随机选择一个有效的动作;如果所述随机数大于或等于∈,无线资源管理器将s t输入所述生成器网络G,得到N a个动作的累计回报分布的采样值,然后分别计算每个动作的累计回报分布的采样值的均值,选取最大均值对应的动作;
(2.2)无线资源管理器设置两个阈值c 1和c 2以及固定即时回报的绝对值ξ,其中c 1>c 2,并设定当J>c 1时,t时刻的即时回报r t=ξ;当c 2<J<c 1时,t时刻的即时回报r t=0;当J<c 2时,t时刻的即时回报r t=-ξ;
(2.3)无线资源管理器将(s t,a t,r t,s t+1)四元组储存到一个大小为N B的缓存区B里;如果B的空间满了,最先存到B中的四元组会被删除,然后存进去最新的四元组;
(3)每执行步骤2的资源分配K次,再利用B中储存的四元组训练生成器网络G和鉴别器网络D的权重;
(3.1)首先训练鉴别器网络D,具体过程为:
从B中随机选取m个四元组(s t,a t,r t,s t+1)作为训练数据;
将m个四元组中的t时刻网络状态的观测值s t组合成m×N s的矩阵[s 1,s 2,…s m] T,s m表示第m个t时刻网络状态的观测值s t,将组合成的矩阵输入生成器网络G,得到m个t时刻网络状态的观测值s t下每个动作的累计回报分布的采样值,然后保留a 1,a 2,…a m对应的采样值,记作G(s 1),G(s 2),…G(s m);N s为服务类型的数量,G(s m)为在第m个t时刻网络状 态的观测值s t下采取动作a m,得到的N个回报采样值,记G(s m)为估计动作值分布的采样值;
将训练数据中的m个t+1时刻网络状态的观测值s t+1组合成m×N s的矩阵[s′ 1,s′ 2,…s′ m] T,并将其输入
Figure PCTCN2020100753-appb-000006
网络,得到m个t+1时刻网络状态的观测值s t+1下每个动作的累计回报分布的采样值,然后保留每个t+1时刻网络状态的观测值s t+1下产生的最大累计回报均值的采样值,记作
Figure PCTCN2020100753-appb-000007
s m’表示第m个t+1时刻网络状态的观测值s t+1
Figure PCTCN2020100753-appb-000008
其中,y i为目标动作值分布的采样值,r i为即时回报,γ为折扣因子;
从(0,1)均匀分布中随机获取m个样本,记作ε 1,ε 2,…ε m
Figure PCTCN2020100753-appb-000009
其中,
Figure PCTCN2020100753-appb-000010
为目标动作值分步采样值和估计动作值分布采样值的加权和;
鉴别器网络D的损失函数L D为:
Figure PCTCN2020100753-appb-000011
其中,D(G(s i))表示输入为G(s i)时,鉴别器网络D的输出;D(y i)表示输入为y i时,鉴别器网络D的输出;
Figure PCTCN2020100753-appb-000012
表示输入为
Figure PCTCN2020100753-appb-000013
时,鉴别器网络D的输出;
Figure PCTCN2020100753-appb-000014
表示
Figure PCTCN2020100753-appb-000015
关于
Figure PCTCN2020100753-appb-000016
求导得到的梯度值,λ为惩罚因子;然后用梯度下降算法训练鉴别器网络D的权重,完成一次鉴别器网络D的训练;
(3.2)训练鉴别器网络D n d次后,获得鉴别器网络D的最新权重值,参与训练生成器网络G;
生成器网络G的损失函数L G
Figure PCTCN2020100753-appb-000017
然后应用梯度下降算法训练生成器网络G的权重;
(3.3)每完成上述训练过程(3.1)和(3.2)C次,将生成器网络G的权重复制给
Figure PCTCN2020100753-appb-000018
网络,实现
Figure PCTCN2020100753-appb-000019
网络权重的更新;
(4)步骤(3)执行N train次后,完成对鉴别器网络D和生成器网络G的训练。无线资源管理器将当前网络状态输入生成器网络G,生成器网络G输出每个资源分配策略对应的累计回报分布的采样,然后分别计算每个资源分配策略回报采样值的均值,取最大均值对应的动作作为无线资源管理器对应的资源分配策略。
可选的,折扣因子γ为0.75~0.9。
可选的,N的取值为30~55。
可选的,∈初始值为0.9,每执行100次资源分配的步骤(2)∈减少0.05,当∈减小到0.05时保持不变;ξ为0.8~1.5。
可选的,缓存区B的大小N B为3000~10000。
可选的,n d的取值为1~10;四元组个数m为32或64。
可选的,惩罚因子λ为10,20或30。
可选的,训练生成器网络G和鉴别器网络D所使用的梯度下降算法均为Adam,学习率均为0.001。
可选的,执行资源分配次数K为10~50。
可选的,N train的取值为2000~3000。
根据本发明提供的具体实施例,本发明公开了以下技术效果:
(1)本发明利用强化学习方法对动作值的分布进行估计。相较于传 统估计动作值期望的方法,本发明提出的学习方法具有更好的稳定性和自适应性,能使无线资源管理器从存在噪声干扰和随机性的系统环境中学习到最优的资源分配策略。
(2)本发明采用生成器和鉴别器两个神经网络交替训练的方法,学习动作值的分布。相较于传统的学习随机变量分布的方法,本发明不需要任何对动作值分布的先验假设。
(3)本发明得到的资源分配策略相较于基于流量预测结果分配资源以及平均分配资源的方法,能得到更高的系统回报值,即更高的频谱效率和更好的用户体验。
说明书附图
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例生成对抗强化学习的无线网络资源分配方法流程图;
图2为本发明实施例当超可靠低时延服务的数据包大小从{6.4,12.8,19.2,25.6,32}Kbyte中均匀取值时,本发明方法与DQN的资源分配算法、平均分配的方法的系统回报值在无线资源分配过程中的变化图;
图3为本发明实施例当超可靠低时延服务的数据包大小从{0.3,0.4,0.5,0.6,0.7}Mbyte中均匀取值时,本发明方法与DQN的资源分配算法、平均分配的方法的系统回报值在无线资源分配过程中的变化图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的目的是提供一种基于生成对抗强化学习的无线网络资源分配方法,相较于传统的专用资源分配方法,本发明提出的方法更加的高效 灵活;对比其他基于强化学习的方法,本发明提出的方法能够减少通信环境中的干扰因素和即时回报的不确定性带来的负面影响。因此,采用生成对抗强化学习算法进行无线网络资源分配,能够大幅提高无线网络性能。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。
图1,为本发明生成对抗强化学习的无线网络资源分配方法流程图,具体包括以下步骤:
(1)生成器网络G和鉴别器网络D的初始化,具体包括以下子步骤:
(1.1)生成对抗强化学习算法中包含两个神经网络,记作生成器网络G和鉴别器网络D,通过高斯分布分别将所述生成器网络G和鉴别器网络D的权重进行随机初始化。为了增强算法的收敛性,设置
Figure PCTCN2020100753-appb-000020
网络,其中,
Figure PCTCN2020100753-appb-000021
网络的结构与所述生成器网络G的结构完全相同,并通过复制生成器网络G权重的方法完成自身权重初始化。
(1.2)所述生成器网络G输入强化学习中的网络状态s,输出N a×N维向量,将生成器网络G输出的N a×N维向量顺序切分,得到N a个N维向量,N的取值为30~50,N过小的话不能充分刻画动作值分布,N过大会增加计算量。所述鉴别器网络D输入N维向量,该向量取自生成器网络G的输出或者通过
Figure PCTCN2020100753-appb-000022
网络输出与即时回报r计算得到,通过全连接神经网络计算,输出一个标量。生成器网络G输出一个表示输入真实性的标量,该标量与0的差值的绝对值小于该标量与1的差值的绝对值,则鉴别器网络D判断输入的向量取自生成器网络G的输出,标量与1的差值的绝对值小于标量与0的差值的绝对值,则鉴别器网络D判断输入的向量是由所述
Figure PCTCN2020100753-appb-000023
网络的输出与即时回报计算得到。
其中,N表示对Z(s,a)采样的样本个数,N a个N维向量中第p个向量表示第p个动作得到的总体回报的分布的采样值,Z(s,a)表示网络状态s下动作a得到的总体回报的分布,网络状态s为一个时间间隔内每种类型服务请求的数量,动作a代表为每种类型服务分配的带宽大小,N a为有效动作的数量。
(2)执行资源分配,具体包括以下子步骤:
(2.1)无线资源管理器获取当前t时刻网络状态s的观测值s t。无线资源管理器采用∈-greedy策略选择动作a t;当执行了动作a t,无线资源管理器接收到系统回报值J,并观察到t+1时刻的网络状态s的观测值s t+1
无线资源管理器采用∈-greedy策略选择动作a t具体包括:无线资源管理器从(0,1)均匀分布中获取一个随机数,如果该随机数小于∈,无线资源管理器随机选择一个有效的动作。如果该随机数大于或等于∈,无线资源管理器将s t输入所述生成器网络G,得到N a个动作的总体回报分布的采样值,然后分别计算每个动作的累计回报分布的采样值的均值,选取最大均值对应的动作,记这一步无线资源管理器采取的动作为a t。当执行了动作a t,无线资源管理器接收到系统回报值J,并观察到t+1时刻的网络状态的观测值s t+1。这里,∈初始值为0.9,每运行100次资源分配步骤(2),∈减少0.05,当减小到0.05时保持不变。
(2.2)无线资源管理器设置两个阈值c 1和c 2(c 1>c 2)以及固定即时回报的绝对值ξ,并规定当J>c 1时,t时刻的即时回报r t=ξ;当c 2<J<c 1时,t时刻的即时回报r t=0;当J<c 2时,t时刻的即时回报r t=-ξ;ξ的取值为0.8~1.5,ξ过小会减慢收敛速度,ξ过大会破坏训练过程的稳定性。
(2.3)无线资源管理器将(s t,a t,r t,s t+1)四元组储存到一个大小为N B的缓存区B里,N B的大小为3000~10000,N B过小会破坏训练过程的稳定性,N B过大会增加计算量。如果B的空间满了,最先存到B中的四元组会被删除,然后存进去最新的四元组。
(3)每执行步骤2的资源分配K次,K的取值为10~50,K过小会增加计算量,K过大会减慢收敛速度;再利用B中储存的四元组训练生成器网络G和鉴别器网络D的权重。
(3.1)首先训练鉴别器网络D,具体过程为:
从B中随机选取m个四元组(s t,a t,r t,s t+1)作为训练数据,m的取值为32或64。
将训练数据中的m个t时刻网络状态的观测值s t组合成m×N s的矩阵[s 1,s 2,…s m] T,s m表示第m个t时刻网络状态的观测值s t,并将其输入生成器网络G,得到m个t时刻网络状态的观测值s t下每个动作产生的累计回报分布的采样,然后保留a 1,a 2,…a m对应的采样值,记作G(s 1),G(s 2),…G(s m)。N s为服务类型的数量,G(s m)为在第m个t时刻网络状态的观测值s t下采取动作a m,得到的N个回报采样值,记为估计动作值分布的采样值。
将训练数据中的m个t+1时刻网络状态的观测值s t+1组合成m×N s的矩阵[s′ 1,s′ 2,…s′ m] T,并将其输入
Figure PCTCN2020100753-appb-000024
网络,得到m个t+1时刻网络状态的观测值s t+1下每个动作的累计回报分布的采样值,然后保留个t+1时刻网络状态的观测值s t+1下产生的最大总体回报均值的采样,记作
Figure PCTCN2020100753-appb-000025
s m’表示第m个t+1时刻网络状态的观测值s t+1
Figure PCTCN2020100753-appb-000026
其中,y i为目标动作值分布的采样值,r i为即时回报,γ为折扣因子,取值为0.75~0.9,γ过小或过大都会导致无线资源管理器在任何网络状态下都无法采取最优动作。
从(0,1)均匀分布中随机获取m个样本,记作ε 1,ε 2,…ε m
Figure PCTCN2020100753-appb-000027
其中,
Figure PCTCN2020100753-appb-000028
为目标动作值分步采样值和估计动作值分布采样值的加权和,i表示m个样本里的第i个。
鉴别器网络D的损失函数L D为:
Figure PCTCN2020100753-appb-000029
其中,D(G(s i))表示输入为G(s i)时,鉴别器网络D的输出;D(y i)表示输入为y i时,鉴别器网络D的输出;
Figure PCTCN2020100753-appb-000030
表示输入为
Figure PCTCN2020100753-appb-000031
时,鉴别器网络D的输出;
Figure PCTCN2020100753-appb-000032
表示
Figure PCTCN2020100753-appb-000033
关于
Figure PCTCN2020100753-appb-000034
求导得到的梯度值,λ为惩罚因子,取值为10,20或30,λ过小会减弱惩罚项的作用,λ过大会使得鉴别器网络D过早收敛,不利于生成器网络G的训练。然后用梯度下降算法训练鉴别器网络D的权重,完成一次鉴别器网络D的训练。
(3.2)训练鉴别器网络D n d次后,获得鉴别器网络D的最新权重值,参与训练生成器网络G,n d的取值为1~10,n d过大会使得鉴别器网络D过早收敛,不利于生成器网络G的训练。
生成器网络G的损失函数L G
Figure PCTCN2020100753-appb-000035
然后应用梯度下降算法训练生成器网络G的权重。
训练生成器网络G和鉴别器网络D所使用的梯度下降算法均为Adam,学习率均为0.001,学习率过小会减慢收敛速度,学习率过大会导致训练过程不稳定。
(3.3)每完成上述训练过程(3.1)和(3.2)C次,将生成器网络G的权重复制给
Figure PCTCN2020100753-appb-000036
网络,实现
Figure PCTCN2020100753-appb-000037
网络权重的更新,C的取值为50~200,C过小会导致训练过程不稳定,C过大会减慢收敛速度。
(4)步骤(3)执行N train次后,N train的取值为2000~3000,完成对鉴别器网络D和生成器网络G的训练,N train过小会导致无线资源管理器在任何网络状态下都无法采取最优动作,N train过大会增加计算量。无线资源管理器将当前网络状态向量s t输入生成器网络G,生成器网络G输出每个资源分配策略对应的累计回报分布的采样,然后分别计算每个资源分配策略回报采样值的均值,取最大均值对应的动作作为无线资源管理器对应的资源分配策略。
在配置如表1所示的主机上,采用Python语言编写了仿真环境,并以3种不同类型服务(通话、视频和超可靠低延时服务)为例进行测试。需要分配的资源为无线带宽,总带宽为10M,分配的颗粒度为1M,所以总共有36种分配策略,即有效动作的数量为36。设置折扣因子γ为0.9,对总体回报分布采样的样本个数N为50,∈初始值为0.9,每运行100次算法减少0.05,当减小到0.05时保持不变。缓存区B的大小N B为10000。 G网络输入层有3个神经元,第一隐藏层有512个神经元,第二隐藏层有512个神经元,输出层有1800个神经元。D网络输入层有50个神经元,第一隐藏层有256个神经元,第二隐藏层有256个神经元,输出层有1个神经元。D网络损失函数中的惩罚因子λ为30。训练G网络和D网络所使用的梯度下降算法均为Adam,学习率均为0.001。其他参数为ξ=1.5,K=50,n d=5,m=64,C=200。
表1系统测试平台参数
处理器 Intel i7-6900K 3.2GHZ
内存 16G DDR
显卡 NVIDIA Titan X
软件平台 Pytorch 1.0
将本发明的方法与基于DQN的资源分配算法、平均分配的方法进行比较:图2显示了三种方法得到的系统回报值在无线资源分配过程中的变化,从图中可以看出,随着迭代次数的增加,本发明提出的方法具有更好的稳定性。需要注意的事,在这个仿真中,超可靠低时延服务的数据包大小是从{6.4,12.8,19.2,25.6,32}KByte中均匀取值的,由于数据包很小,超可靠低时延服务的性能要求很容易满足,所以本发明提出的方法和基于DQN的方法都能取得很高的系统回报值。图3显示的是当超可靠低时延服务的数据包大小是从{0.3,0.4,0.5,0.6,0.7}MByte中均匀取值的情况。从图中可以看出,由于超可靠低时延服务的数据包很大,三种方法得到的系统回报值都有降低,但是本发明提出的方法得到的系统回报值要比DQN更高。
随后,再次设置折扣因子γ为0.75,对总体回报分布采样的样本个数N为30,∈初始值为0.9,每运行100次算法减少0.05,当减小到0.05时保持不变。缓存区B的大小N B为3000。G网络输入层有3个神经元,第一隐藏层有512个神经元,第二隐藏层有512个神经元,输出层有1080个神经元。D网络输入层有50个神经元,第一隐藏层有256个神经元,第二隐藏层有256个神经元,输出层有1个神经元。D网络损失函数中的惩罚因子λ为10。训练G网络和D网络所使用的梯度下降算法均为Adam,学习率均为0.001。其他参数为ξ=0.8,K=10,n d=1,m=32,C=50。通过上述参数设置,采用本发明的方法进行无线网络资源分配,该方法仍然具有较好的稳定性和较高的系统回报值。
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种基于生成对抗强化学习的无线网络资源分配方法,其特征在于,所述方法包括:
    (1)生成器网络G和鉴别器网络D的初始化,具体包括以下子步骤:
    (1.1)生成对抗强化学习算法中包括两个神经网络,分别为生成器网络G和鉴别器网络D,通过高斯分布分别将所述生成器网络G和所述鉴别器网络D的权重进行随机初始化;同时,设置
    Figure PCTCN2020100753-appb-100001
    网络,其中,所述
    Figure PCTCN2020100753-appb-100002
    网络的结构与所述生成器网络G的结构相同,并通过复制生成器网络G权重的方法完成
    Figure PCTCN2020100753-appb-100003
    网络权重初始化;
    (1.2)所述生成器网络G输入网络状态s,输出N a×N维向量,将所述生成器网络G输出的N a×N维向量顺序切分,得到N a个N维向量;所述鉴别器网络D输入N维向量,所述鉴别器网络D输入的N维向量取自所述生成器网络G的输出或者通过所述
    Figure PCTCN2020100753-appb-100004
    网络的输出与即时回报计算得到,所述生成器网络G输出一个表示输入真实性的标量,所述标量与0的差值的绝对值小于所述标量与1的差值的绝对值,则所述鉴别器网络D判断输入的向量取自所述生成器网络G的输出,所述标量与1的差值的绝对值小于所述标量与0的差值的绝对值,则所述鉴别器网络D判断输入的向量是由所述
    Figure PCTCN2020100753-appb-100005
    网络的输出与所述即时回报计算得到;
    其中,N表示对Z(s,a)采样的样本个数,N a个N维向量中第p个向量表示第p个动作得到的总体回报的分布的采样值,Z(s,a)表示网络状态s下动作a得到的累计回报的分布,网络状态s为一个时间间隔内每种类型服务请求的数量,动作a代表为每种类型服务分配的带宽大小,N a为有效动作的数量;
    (2)执行资源分配,具体包括以下子步骤:
    (2.1)无线资源管理器获取当前t时刻网络状态s的观测值s t;无线资源管理器采用∈-greedy策略选择动作a t;当执行了动作a t,无线资源管理器接收到系统回报值J,并观察到t+1时刻的网络状态s的观测值s t+1
    所述无线资源管理器采用∈-greedy策略选择动作a t具体包括:无线资源管理器从(0,1)均匀分布中获取一个随机数,如果所述随机数小于∈,无线资源管理器随机选择一个有效的动作;如果所述随机数大于或等于∈,无线资源管理器将s t输入所述生成器网络G,得到N a个动作的累计回报分布的采样值,然后分别计算每个动作的累计回报分布的采样值的均值,选取最大均值对应的动作;
    (2.2)无线资源管理器设置两个阈值c 1和c 2以及固定即时回报的绝对值ξ,其中c 1>c 2,并设定当J>c 1时,t时刻的即时回报r t=ξ;当c 2<J<c 1时,t时刻的即时回报r t=0;当J<c 2时,t时刻的即时回报r t=-ξ;
    (2.3)无线资源管理器将(s t,a t,r t,s t+1)四元组储存到一个大小为N B的缓存区
    Figure PCTCN2020100753-appb-100006
    里;如果
    Figure PCTCN2020100753-appb-100007
    的空间满了,最先存到
    Figure PCTCN2020100753-appb-100008
    中的四元组会被删除,然后存进去最新的四元组;
    (3)每执行步骤(2)的资源分配K次,再利用
    Figure PCTCN2020100753-appb-100009
    中储存的四元组训练生成器网络G和鉴别器网络D的权重;
    (3.1)首先训练鉴别器网络D,具体过程为:
    Figure PCTCN2020100753-appb-100010
    中随机选取m个四元组(s t,a t,r t,s t+1)作为训练数据;
    将m个四元组中的t时刻网络状态的观测值s t组合成m×N s的矩阵[s 1,s 2,…s m] T,s m表示第m个t时刻网络状态的观测值s t,将组合成的矩阵输入生成器网络G,得到m个t时刻网络状态的观测值s t下每个动作的累 计回报分布的采样值,然后保留a 1,a 2,…a m对应的采样值,记作G(s 1),G(s 2),…G(s m);N s为服务类型的数量,G(s m)为在第m个t时刻网络状态的观测值s t下采取动作a m,得到的N个回报采样值,记G(s m)为估计动作值分布的采样值;
    将训练数据中的m个t+1时刻网络状态的观测值s t+1组合成m×N s的矩阵[s′ 1,s′ 2,…s′ m] T,并将其输入
    Figure PCTCN2020100753-appb-100011
    网络,得到m个t+1时刻网络状态的观测值s t+1下每个动作的累计回报分布的采样值,然后保留每个t+1时刻网络状态的观测值s t+1下产生的最大累计回报均值的采样值,记作
    Figure PCTCN2020100753-appb-100012
    s m’表示第m个t+1时刻网络状态的观测值s t+1
    Figure PCTCN2020100753-appb-100013
    其中,y i为目标动作值分布的采样值,r i为即时回报,γ为折扣因子;
    从(0,1)均匀分布中随机获取m个样本,记作ε 1,ε 2,…ε m
    Figure PCTCN2020100753-appb-100014
    其中,
    Figure PCTCN2020100753-appb-100015
    为目标动作值分步采样值和估计动作值分布采样值的加权和;
    鉴别器网络D的损失函数L D为:
    Figure PCTCN2020100753-appb-100016
    其中,D(G(s i))表示输入为G(s i)时,鉴别器网络D的输出;D(y i)表示输入为y i时,鉴别器网络D的输出;
    Figure PCTCN2020100753-appb-100017
    表示输入为
    Figure PCTCN2020100753-appb-100018
    时,鉴别器网络D的输出;
    Figure PCTCN2020100753-appb-100019
    表示
    Figure PCTCN2020100753-appb-100020
    关于
    Figure PCTCN2020100753-appb-100021
    求导得到的梯度值,λ为惩罚因子;然后 用梯度下降算法训练鉴别器网络D的权重,完成一次鉴别器网络D的训练;
    (3.2)训练鉴别器网络D n d次后,获得鉴别器网络D的最新权重值,参与训练生成器网络G;
    生成器网络G的损失函数L G
    Figure PCTCN2020100753-appb-100022
    然后应用梯度下降算法训练生成器网络G的权重;
    (3.3)每完成上述训练过程(3.1)和(3.2)C次,将生成器网络G的权重复制给
    Figure PCTCN2020100753-appb-100023
    网络,实现
    Figure PCTCN2020100753-appb-100024
    网络权重的更新;
    (4)步骤(3)执行N train次后,完成对鉴别器网络D和生成器网络G的训练。无线资源管理器将当前网络状态输入生成器网络G,生成器网络G输出每个资源分配策略对应的累计回报分布的采样,然后分别计算每个资源分配策略回报采样值的均值,取最大均值对应的动作作为无线资源管理器对应的资源分配策略。
  2. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,折扣因子γ为0.75~0.9。
  3. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,N的取值为30~55。
  4. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,∈初始值为0.9,每执行100次资源分配的步骤(2)∈减少0.05,当∈减小到0.05时保持不变;ξ为0.8~1.5。
  5. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,缓存区
    Figure PCTCN2020100753-appb-100025
    的大小N B为3000~10000。
  6. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分 配方法,其特征在于,n d的取值为1~10;四元组个数m为32或64。
  7. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,惩罚因子λ为10,20或30。
  8. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,训练生成器网络G和鉴别器网络D所使用的梯度下降算法均为Adam,学习率均为0.001。
  9. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,执行资源分配次数K为10~50。
  10. 根据权利要求1所述的基于生成对抗强化学习的无线网络资源分配方法,其特征在于,N train的取值为2000~3000。
PCT/CN2020/100753 2019-12-24 2020-07-08 一种基于生成对抗强化学习的无线网络资源分配方法 WO2021128805A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/708,059 US11452077B2 (en) 2019-12-24 2022-03-30 Wireless network resource allocation method employing generative adversarial reinforcement learning

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911347500.1 2019-12-24
CN201911347500.1A CN111182637B (zh) 2019-12-24 2019-12-24 一种基于生成对抗强化学习的无线网络资源分配方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/708,059 Continuation US11452077B2 (en) 2019-12-24 2022-03-30 Wireless network resource allocation method employing generative adversarial reinforcement learning

Publications (1)

Publication Number Publication Date
WO2021128805A1 true WO2021128805A1 (zh) 2021-07-01

Family

ID=70657430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/100753 WO2021128805A1 (zh) 2019-12-24 2020-07-08 一种基于生成对抗强化学习的无线网络资源分配方法

Country Status (3)

Country Link
US (1) US11452077B2 (zh)
CN (1) CN111182637B (zh)
WO (1) WO2021128805A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885426A (zh) * 2022-05-05 2022-08-09 南京航空航天大学 一种基于联邦学习和深度q网络的5g车联网资源分配方法
CN115022231A (zh) * 2022-06-30 2022-09-06 武汉烽火技术服务有限公司 一种基于深度强化学习的最优路径规划的方法和系统

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111182637B (zh) * 2019-12-24 2022-06-21 浙江大学 一种基于生成对抗强化学习的无线网络资源分配方法
US20210350223A1 (en) * 2020-05-07 2021-11-11 International Business Machines Corporation Digital content variations via external reaction
CN111795700A (zh) * 2020-06-30 2020-10-20 浙江大学 一种无人车强化学习训练环境构建方法及其训练系统
US20220051106A1 (en) * 2020-08-12 2022-02-17 Inventec (Pudong) Technology Corporation Method for training virtual animal to move based on control parameters
CN112702760B (zh) * 2020-12-16 2022-03-15 西安电子科技大学 一种估计小区负载方法、系统、介质、设备、终端及应用
CN112512070B (zh) * 2021-02-05 2021-05-11 之江实验室 一种基于图注意力机制强化学习的多基站协同无线网络资源分配方法
CN113473498B (zh) * 2021-06-15 2023-05-19 中国联合网络通信集团有限公司 网络切片资源编排方法、切片编排器及编排系统
US20230102494A1 (en) * 2021-09-24 2023-03-30 Hexagon Technology Center Gmbh Ai training to produce task schedules
CN113811009B (zh) * 2021-09-24 2022-04-12 之江实验室 一种基于时空特征提取的多基站网络资源智能分配方法
CN115118780B (zh) * 2022-06-06 2023-12-01 支付宝(杭州)信息技术有限公司 获取资源分配模型的方法、资源分配方法及对应装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130253A1 (en) * 2017-10-31 2019-05-02 Levi Strauss & Co. Using Neural Networks in Laser Finishing of Apparel
US20190130266A1 (en) * 2017-10-27 2019-05-02 Royal Bank Of Canada System and method for improved neural network training
CN110046712A (zh) * 2019-04-04 2019-07-23 天津科技大学 基于生成模型的隐空间模型化策略搜索学习方法
CN110533221A (zh) * 2019-07-29 2019-12-03 西安电子科技大学 基于生成式对抗网络的多目标优化方法
WO2019237860A1 (zh) * 2018-06-15 2019-12-19 腾讯科技(深圳)有限公司 一种图像标注方法和装置
CN111182637A (zh) * 2019-12-24 2020-05-19 浙江大学 一种基于生成对抗强化学习的无线网络资源分配方法

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102403494B1 (ko) * 2017-04-27 2022-05-27 에스케이텔레콤 주식회사 생성적 대립 네트워크에 기반한 도메인 간 관계를 학습하는 방법
US11038769B2 (en) * 2017-11-16 2021-06-15 Verizon Patent And Licensing Inc. Method and system for virtual network emulation and self-organizing network control using deep generative models
US11475607B2 (en) * 2017-12-19 2022-10-18 Telefonaktiebolaget Lm Ericsson (Publ) Radio coverage map generation
CN108401254A (zh) * 2018-02-27 2018-08-14 苏州经贸职业技术学院 一种基于强化学习的无线网络资源分配方法
US11048974B2 (en) * 2019-05-06 2021-06-29 Agora Lab, Inc. Effective structure keeping for generative adversarial networks for single image super resolution
CN110493826B (zh) * 2019-08-28 2022-04-12 重庆邮电大学 一种基于深度强化学习的异构云无线接入网资源分配方法
US11152785B1 (en) * 2019-09-17 2021-10-19 X Development Llc Power grid assets prediction using generative adversarial networks
TWI753325B (zh) * 2019-11-25 2022-01-21 國立中央大學 產生機器翻譯模型的計算裝置及方法及機器翻譯裝置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130266A1 (en) * 2017-10-27 2019-05-02 Royal Bank Of Canada System and method for improved neural network training
US20190130253A1 (en) * 2017-10-31 2019-05-02 Levi Strauss & Co. Using Neural Networks in Laser Finishing of Apparel
WO2019237860A1 (zh) * 2018-06-15 2019-12-19 腾讯科技(深圳)有限公司 一种图像标注方法和装置
CN110046712A (zh) * 2019-04-04 2019-07-23 天津科技大学 基于生成模型的隐空间模型化策略搜索学习方法
CN110533221A (zh) * 2019-07-29 2019-12-03 西安电子科技大学 基于生成式对抗网络的多目标优化方法
CN111182637A (zh) * 2019-12-24 2020-05-19 浙江大学 一种基于生成对抗强化学习的无线网络资源分配方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WU HONGJIE, DAI DADONG; FU QIMING; CHEN JIANPING; LU WEIZHONG: "Research on Combination of Reinforcement Learning and Generative Adversarial Networks", COMPUTER ENGINEERING AND APPLICATIONS, HUABEI JISUAN JISHU YANJIUSUO, CN, vol. 55, no. 10, 1 January 2019 (2019-01-01), CN, pages 36 - 44, XP055824262, ISSN: 1002-8331, DOI: 10.3778/j.issn.1002-8331.1812-0268 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885426A (zh) * 2022-05-05 2022-08-09 南京航空航天大学 一种基于联邦学习和深度q网络的5g车联网资源分配方法
CN114885426B (zh) * 2022-05-05 2024-04-16 南京航空航天大学 一种基于联邦学习和深度q网络的5g车联网资源分配方法
CN115022231A (zh) * 2022-06-30 2022-09-06 武汉烽火技术服务有限公司 一种基于深度强化学习的最优路径规划的方法和系统
CN115022231B (zh) * 2022-06-30 2023-11-03 武汉烽火技术服务有限公司 一种基于深度强化学习的最优路径规划的方法和系统

Also Published As

Publication number Publication date
US20220232531A1 (en) 2022-07-21
CN111182637A (zh) 2020-05-19
CN111182637B (zh) 2022-06-21
US11452077B2 (en) 2022-09-20

Similar Documents

Publication Publication Date Title
WO2021128805A1 (zh) 一种基于生成对抗强化学习的无线网络资源分配方法
Ma et al. FedSA: A semi-asynchronous federated learning mechanism in heterogeneous edge computing
Wei et al. Joint optimization of caching, computing, and radio resources for fog-enabled IoT using natural actor–critic deep reinforcement learning
CN109947545B (zh) 一种基于用户移动性的任务卸载及迁移的决策方法
Guo et al. Cloud resource scheduling with deep reinforcement learning and imitation learning
Zhang et al. A multi-agent reinforcement learning approach for efficient client selection in federated learning
CN111754000A (zh) 质量感知的边缘智能联邦学习方法及系统
CN111461226A (zh) 对抗样本生成方法、装置、终端及可读存储介质
CN111629380B (zh) 面向高并发多业务工业5g网络的动态资源分配方法
CN110213097B (zh) 一种基于资源动态分配的边缘服务供应优化方法
CN113784410B (zh) 基于强化学习td3算法的异构无线网络垂直切换方法
Wang et al. Dual-driven resource management for sustainable computing in the blockchain-supported digital twin IoT
CN114340016A (zh) 一种电网边缘计算卸载分配方法及系统
CN113407249B (zh) 一种面向位置隐私保护的任务卸载方法
Sun et al. Edge learning with timeliness constraints: Challenges and solutions
CN114546608A (zh) 一种基于边缘计算的任务调度方法
CN115879542A (zh) 一种面向非独立同分布异构数据的联邦学习方法
Hu et al. Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach
Wang et al. Joint service caching, resource allocation and computation offloading in three-tier cooperative mobile edge computing system
Cui et al. Multiagent reinforcement learning-based cooperative multitype task offloading strategy for internet of vehicles in B5G/6G network
Cheng et al. GFL: Federated learning on non-IID data via privacy-preserving synthetic data
Singhal et al. Greedy Shapley Client Selection for Communication-Efficient Federated Learning
CN113543160A (zh) 5g切片资源配置方法、装置、计算设备及计算机存储介质
CN110392377A (zh) 一种5g超密集组网资源分配方法及装置
CN115499876A (zh) Msde场景下基于dqn算法的计算卸载策略

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905987

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20905987

Country of ref document: EP

Kind code of ref document: A1