TWI812371B

TWI812371B - Resource allocation method in downlink pattern division multiple access system based on artificial intelligence

Info

Publication number: TWI812371B
Application number: TW111128417A
Authority: TW
Inventors: 陳曉華; 周廣誌
Original assignee: 國立成功大學
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2023-08-11
Also published as: TW202406390A

Abstract

A resource allocation method, a base station allocates N subcarriers to KUEs to obtain N× Kcurrent subcarrier allocation results, and then the base station obtains N× Kcurrent allocated powers, and obtains multiple action values by a action reinforcement learning network. The base station determines whether the action values are all less than or equal to 0. When the determination result is no, the base station selects a target allocation action to obtain a plurality of update allocation powers, so as to generates and stores a training data, and then trains at least one reinforcement learning network including the action reinforcement learning network according to a plurality of target training data. The above operations are repeated until the action values are all less than or equal to 0. When the determination result is yes, a candidate spectral efficiency is calculated according to the currently allocated powers by the base station. The above actions are repeated to obtain Pcandidate spectral efficiencies to select a target spectral efficiency.

Description

Resource allocation method for multiple access systems based on downlink mode differentiation based on artificial intelligence algorithm

本發明是有關於一種資源分配方法，特別是指一種基於人工智慧算法之下行模式區分多址接入系統資源分配方法。 The present invention relates to a resource allocation method, in particular to a resource allocation method for a downlink mode differentiated multiple access system based on artificial intelligence algorithms.

在現有的正交多重存取(Orthogonal multiple access,OMA)中，每一個用戶只能使用一個特定的資源塊，如頻帶、時隙、正交擴頻碼，但隨著行動通訊蓬的發展，對頻譜效率的需求也與日俱增，正交多重存取顯然已經無法滿足現今用戶的需求。 In the existing Orthogonal Multiple Access (OMA), each user can only use a specific resource block, such as frequency band, time slot, and orthogonal spreading code. However, with the development of mobile communications, The demand for spectrum efficiency is also increasing day by day, and orthogonal multiple access is obviously unable to meet the needs of today's users.

為因應頻譜效率提升的需求，非正交多重存取(Non-orthogonal multiple access,NOMA)技術，例如多用戶疊加傳輸(Multi-User Superposition Transmission,MUST)及模式區分多址接入(Pattern Division Multiple Access,PDMA)技術。 In response to the need to improve spectrum efficiency, non-orthogonal multiple access (NOMA) technologies, such as Multi-User Superposition Transmission (MUST) and Pattern Division Multiple Access (Pattern Division Multiple Access, PDMA) technology.

MUST技術是屬於單載波NOMA，在MUST技術中，通過功率域、碼域或星座域的疊加，允許多個用戶複用同一個資源塊，以提高頻譜效率和接入用戶數，且在MUST系統傳輸訊息時，重疊編碼將多用戶的訊號用不同的功率分配疊加在一起，傳送到接收端時再利用連續性干擾消除(successive interference cancellation,SIC)技術將多用戶的疊加訊號分離開來，這時如果用戶訊號間的能量差異越大，就越容易分辨出訊號，從而有較佳的錯誤率，故合理分配訊號的功率對MUST系統來說尤為重要。 MUST technology belongs to single-carrier NOMA. In MUST technology, multiple users are allowed to reuse the same resource block through the superposition of power domain, code domain or constellation domain to improve spectrum efficiency and the number of access users. In the MUST system When transmitting information, overlapping coding superimposes the signals of multiple users with different power allocation. When transmitted to the receiving end, the continuous interference cancellation (SIC) technology is used to separate the superimposed signals of multiple users. The greater the energy difference between user signals, the easier it is to distinguish the signals, resulting in a better error rate. Therefore, reasonable distribution of signal power is particularly important for the MUST system.

不同於MUST技術，PDMA技術是屬於多載波NOMA，在PDMA系統傳輸訊息時，除了重疊編碼將多用戶的訊號用不同的功率分配疊加在一起外，還通過模式矩陣設計將用戶的相同編碼位元映射到不同的子載波上，從而實現分集(Diversity)及多路複用(multiplexing)，故合理分配訊號的功率及子載波對PDMA系統來說尤為重要。 Different from MUST technology, PDMA technology is a multi-carrier NOMA. When the PDMA system transmits information, in addition to overlapping coding to superimpose the signals of multiple users with different power allocation, the same coding bits of the users are also combined through the pattern matrix design. Mapping to different subcarriers to achieve diversity and multiplexing, so rational allocation of signal power and subcarriers is particularly important for PDMA systems.

然而，現有的PDMA系統無法根據系統的動態場景進行最優功率及子載波分配。 However, existing PDMA systems cannot perform optimal power and subcarrier allocation according to the dynamic scenarios of the system.

因此，本發明的目的，即在提供一種根據系統的動態場景進行最優功率及子載波分配的基於人工智慧算法之下行模式區分多址接入系統資源分配方法。 Therefore, the purpose of the present invention is to provide an artificial intelligence algorithm-based downlink mode area for optimal power and subcarrier allocation according to the dynamic scenario of the system. Multiple access system resource allocation method.

於是，本發明基於人工智慧算法之下行模式區分多址接入系統資源分配方法，由一基站來實施，該基站經由一無線通道與K個用戶端通訊連接，該基站儲存多個子載波分配動作、多個功率分配動作，及一包括N×K個相關於該等用戶端分別在N個子載波的通道強度的通道狀態資訊，其中K>1，N>1，該方法包含一步驟(A)、一步驟(B)、一步驟(C)、一步驟(D)、一步驟(E)、一步驟(F)、一步驟(G)、一步驟(H)、一步驟(I)、一步驟(J)、一步驟(K)、一步驟(L)，及一步驟(M)。 Therefore, the present invention is based on the artificial intelligence algorithm for downlink mode differentiated multiple access system resource allocation method, which is implemented by a base station. The base station communicates with K users through a wireless channel. The base station stores multiple subcarrier allocation actions, A plurality of power allocation actions, and a channel status information including N × K related to the channel strength of the user terminals on N subcarriers respectively, where K> 1, N> 1, the method includes a step (A), One step (B), one step (C), one step (D), one step (E), one step (F), one step (G), one step (H), one step (I), one step (J), one step (K), one step (L), and one step (M).

在該步驟(A)中，該基站向該等用戶端分配該等子載波，以獲得N×K個指示出該等用戶端是否分配到該等子載波的當前子載波分配結果。 In this step (A), the base station allocates the subcarriers to the user terminals to obtain N × K current subcarrier allocation results indicating whether the user terminals are allocated the subcarriers.

在該步驟(B)中，該基站根據該等當前子載波分配結果及該通道狀態資訊獲得N×K個分別對應該等當前子載波分配結果的當前分配功率。 In this step (B), the base station obtains N × K current allocated powers respectively corresponding to the current subcarrier allocation results according to the current subcarrier allocation results and the channel status information.

在該步驟(C)中，該基站將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值。 In this step (C), the base station inputs the subcarrier allocation actions, the power allocation actions, the current subcarrier allocation results, and the current allocated powers to an action reinforcement learning network, so that the action The reinforcement learning network outputs a plurality of action values corresponding to the equal power allocation actions and the multiple subcarrier allocation actions respectively.

在該步驟(D)中，該基站判定該等動作值是否皆小於等於 0。 In this step (D), the base station determines whether the action values are all less than or equal to 0.

在該步驟(E)中，當判定出該等動作值之其中一者大於0時，該基站從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作。 In step (E), when it is determined that one of the action values is greater than 0, the base station selects a target allocation action from the subcarrier allocation actions and the power allocation actions.

在該步驟(F)中，該基站根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率。 In this step (F), the base station obtains multiple updated subcarrier allocation results and multiple updated subcarrier allocation results respectively corresponding to the current subcarrier allocation results based on the current subcarrier allocation results, the current allocated power, and the target allocation action. The updated allocated power respectively corresponds to the current allocated power.

在該步驟(G)中，該基站根據該等當前分配功率及該等更新分配功率計算出一獎勵值。 In this step (G), the base station calculates a reward value based on the current allocated powers and the updated allocated powers.

在該步驟(H)中，該基站產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料。 In this step (H), the base station generates and stores a data including the current subcarrier allocation results, the current allocation power, the target allocation action, the reward value, the updated subcarrier allocation results, and the updates. Training information for distributing power.

在該步驟(I)中，該基站從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練至少一強化學習網路，該至少一強化學習網路包括該動作強化學習網路。 In this step (I), the base station selects a plurality of target training data from the stored training data, and trains at least one reinforcement learning network based on the target training data. The at least one reinforcement learning network includes the action reinforcement learning Internet.

在該步驟(J)中，該基站將該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率重複步驟(C)~(I)直到該等動作值皆小於等於0。 In this step (J), the base station uses the updated subcarrier allocation results and the updated allocated power as the current subcarrier allocation results and the current allocated power respectively and repeats steps (C) ~ (I) until the The action values are all less than or equal to 0.

在該步驟(K)中，當判定出該等動作值皆小於等於0時，該基站根據該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率。 In this step (K), when it is determined that the action values are all less than or equal to 0, The base station calculates a candidate spectrum efficiency based on the current allocated powers, and stores the current subcarrier allocation results, the current allocated powers, and the candidate spectrum efficiency.

在該步驟(L)中，重複進行步驟(A)~(K)P次，以獲得P個候選頻譜效率，其中P>1。 In this step (L), steps (A) ~ (K) are repeated P times to obtain P candidate spectrum efficiencies, where P > 1.

在該步驟(M)中，該基站從該等候選頻譜效率中獲得一最高的目標頻譜效率。 In this step (M), the base station obtains a highest target spectrum efficiency from the candidate spectrum efficiencies.

本發明之功效在於：該基站利用該動作強化學習網路在不同場景記錄學習，以獲取具有最大的獎勵值之最佳分配動作，並進一步獲得該等候選頻譜效率，再從該等候選頻譜效率中獲得最高的該目標頻譜效率，其中，該目標頻譜效率對應的子載波分配及功率分配即為最優。 The effect of the present invention is that the base station uses the action reinforcement learning network to record learning in different scenarios to obtain the best allocation action with the largest reward value, and further obtain the candidate spectrum efficiencies, and then obtain the candidate spectrum efficiencies from the The highest target spectral efficiency is obtained, wherein the subcarrier allocation and power allocation corresponding to the target spectrum efficiency are optimal.

11:基站 11: Base station

12:用戶端 12: Client

100:無線通道 100:Wireless channel

21~34:步驟 21~34: Steps

241、242:子步驟 241, 242: Sub-steps

281~289:子步驟 281~289: Sub-steps

301~303:子步驟 301~303: Sub-steps

321~323:子步驟 321~323: Sub-steps

本發明的其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一方塊圖，說明用以實施本發明基於人工智慧算法之下行模式區分多址接入系統資源分配方法的一實施例的一基站；圖2是一流程圖，說明本發明基於人工智慧算法之下行模式區分多址接入系統資源分配方法的該實施例；圖3是一流程圖，輔助說明圖2步驟24的子步驟；圖4是一流程圖，輔助說明圖2步驟28的子步驟；圖5是一流程圖，輔助說明圖2步驟30的子步驟；及圖6是一流程圖，輔助說明圖2步驟32的子步驟。 Other features and effects of the present invention will be clearly presented in the embodiments with reference to the drawings, in which: Figure 1 is a block diagram illustrating the downlink mode differentiated multiple access system for implementing the present invention based on artificial intelligence algorithms. A base station according to an embodiment of the resource allocation method; Figure 2 is a flow chart illustrating the embodiment of the downlink mode differentiated multiple access system resource allocation method based on the artificial intelligence algorithm of the present invention; Figure 3 is a flow chart to assist Describe the sub-steps of step 24 in Figure 2; FIG. 4 is a flow chart to assist in explaining the sub-steps of step 28 of FIG. 2; FIG. 5 is a flow chart to assist in explaining the sub-steps of step 30 of FIG. 2; and FIG. 6 is a flow chart to assist in explaining the sub-steps of step 32 of FIG. 2. steps.

在本發明被詳細描述之前，應當注意在以下的說明內容中，類似的元件是以相同的編號來表示。 Before the present invention is described in detail, it should be noted that in the following description, similar elements are designated with the same numbering.

參閱圖1，本發明基於人工智慧算法之下行模式區分多址接入系統資源分配方法的一實施例是由一基站11執行，該基站11支援下行功率域的模式區分多址接入技術，該基站11經由一無線通道100與K個用戶端12通訊連接，該基站11通過為每一用戶端12使用不同等級的功率將該等用戶端12的信號疊加在N個子載波上，其中K>1，N>1。值的注意的是，在本實施例中，該基站11例如為單天線基站(base station,BS)，該等用戶端12例如為智慧型手機，但不以此為限。 Referring to Figure 1, an embodiment of the downlink Mode Differentiated Multiple Access system resource allocation method based on artificial intelligence algorithms of the present invention is executed by a base station 11. The base station 11 supports Mode Differentiated Multiple Access technology in the downlink power domain. The base station 11 communicates with K user terminals 12 via a wireless channel 100. The base station 11 superimposes the signals of the user terminals 12 on N subcarriers by using different levels of power for each user terminal 12, where K> 1 , N> 1. It is worth noting that in this embodiment, the base station 11 is, for example, a single-antenna base station (BS), and the user terminals 12 are, for example, smart phones, but are not limited thereto.

該基站11儲存有多個子載波分配動作、多個功率分配動作，及一包括K個分別對應該等用戶端12的通道強度的通道狀態資訊，其中該通道狀態資訊係該基站11根據上行導頻估算出來的。 The base station 11 stores a plurality of subcarrier allocation actions, a plurality of power allocation actions, and a channel status information including K corresponding to the channel strengths of the user terminals 12 respectively, wherein the channel status information is obtained by the base station 11 according to the uplink pilot. Estimated.

參閱圖1、2展示了本發明基於人工智慧算法之下行模式區分多址接入系統資源分配方法的該實施例，以下詳述圖2所示的該實施例的各個步驟。 Referring to Figures 1 and 2, this embodiment of the downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithms of the present invention is shown. The method shown in Figure 2 will be described in detail below. Various steps of this example.

在步驟21中，該基站11初始化多個強化學習網路。 In step 21, the base station 11 initializes multiple reinforcement learning networks.

值得注意的是，在本實施例中，該等強化學習網路的類型例如為Q學習網路，且數量為二，該等強化學習網路分別為一更新網路和一目標網路，該等強化學習網路例如包括一具有五十個節點的全連階層，啟動函數例如為整流線性單位函數(Rectified Linear Unit,ReLU)，設定一學習演算法例如為自適應時刻估計方法(Adaptive Moment Estimation,Adam)，設定一損失函數例如為均方誤差(mean-square error,MSE)，在其他實施方式中，該等強化學習網路例如包括一對照表(Q表格)，該學習演算法可為隨機梯度下降法(Stochastic gradient descent,SGD)、動量梯度下降法(Momentum)、或Adagrad算法，損失函數可為平方損失函數或絕對值損失函數，此外，強化學習網路的類型不限於Q學習網路，同時該基站11亦可僅初始化一強化學習網路，但不以此為限。 It is worth noting that in this embodiment, the type of the reinforcement learning network is, for example, Q-learning network, and the number is two. The reinforcement learning networks are an update network and a target network respectively. The reinforcement learning network includes, for example, a fully connected layer with fifty nodes. The activation function is, for example, a rectified linear unit (ReLU) function, and a learning algorithm is set, such as an adaptive moment estimation method (Adaptive Moment Estimation). , Adam), setting a loss function such as mean-square error (MSE). In other implementations, the reinforcement learning networks include, for example, a comparison table (Q table), and the learning algorithm can be Stochastic gradient descent (SGD), momentum gradient descent (Momentum), or Adagrad algorithm, the loss function can be a square loss function or an absolute value loss function. In addition, the type of reinforcement learning network is not limited to Q learning network path, and the base station 11 can also only initialize a reinforcement learning network, but is not limited to this.

在步驟22中，該基站11判定是否已循環P次。當該基站11判定出未循環P次時，流程進行步驟23；而當該基站11判定出已循環P次時，流程進行步驟34。值得注意的是，在本實施例中，該基站是以一循環計數器(圖未示)計數循環次數，其中P=20000，但不以此為限。 In step 22, the base station 11 determines whether it has been cycled P times. When the base station 11 determines that it has not been cycled P times, the process proceeds to step 23; and when the base station 11 determines that it has been cycled P times, the process proceeds to step 34. It is worth noting that in this embodiment, the base station uses a cycle counter (not shown) to count the number of cycles, where P = 20000, but it is not limited to this.

在步驟23中，該基站11向該等用戶端12分配該等子載波，以獲得N×K個指示出該等用戶端12是否分配到該等子載波的當前子載波分配結果。 In step 23, the base station 11 allocates the subcarriers to the user terminals 12 to obtain N × K current subcarrier allocation results indicating whether the user terminals 12 are allocated the subcarriers.

值得注意的是，該基站11是按照一特徵模式矩陣(characteristic pattern matrix)M _PDMA=(m _n,k,t)_N×K向該等用戶端12分配該等子載波，該等當前子載波分配結果m _n,k,t滿足下列條件：

1，及1

N _max，其中，m _n,k,t為第k個的用戶端12在當前時刻t是否分配到第n個子載波的當前子載波分配結果，m _n,k,t

{0,1}，m _n,k,t=1為第k個的用戶端12在當前時刻t分配到第n個子載波，m _n,k,t=0為第k個的用戶端12在當前時刻t未分配到第n個子載波，N _max為每一子載波上的最大用戶端數，該特徵模式矩陣M _PDMA可表示為：

It is worth noting that the base station 11 allocates the subcarriers to the _{user terminals 12 according to a characteristic pattern matrix MPDMA} = ( m _{n , k , t} ) _{N × K.} The current subcarriers The distribution results m _{n , k , t} satisfy the following conditions:

1, and 1

N _max , where m _{n , k , t} are the current subcarrier allocation results of whether the kth user terminal 12 is allocated to the nth subcarrier at the current time t , m _{n , k , t}

{0,1}, m _{n , k , t} =1 means that the k -th user terminal 12 is allocated to the n -th subcarrier at the current time t , m _{n , k , t} =0 means that the k -th user terminal 12 is allocated to the n-th subcarrier at the current time t. The nth subcarrier is not allocated to the current time t , and N _max is the maximum number of users on each subcarrier. The characteristic pattern matrix M _PDMA can be expressed as:

要再注意的是，每一用戶端12分配的子載波數量和每一子載波上的用戶端數量需要考慮接收機正確檢測和用戶端12間干擾問題，以K=6，N=4為例，每一用戶端12分配的子載波數量L為 1

L

4，每一子載波上的用戶端數量U為2

U

5。 It should be noted that the number of subcarriers allocated to each user terminal 12 and the number of user terminals on each subcarrier need to consider the correct detection of the receiver and the interference between the user terminals 12. Take K = 6, N = 4 as an example. , the number L of subcarriers allocated to each user terminal 12 is 1

L

4. The number of users U on each subcarrier is 2

U

5.

在步驟24中，該基站11根據該等當前子載波分配結果及該通道狀態資訊獲得N×K個分別對應該等當前子載波分配結果的當前分配功率。 In step 24, the base station 11 obtains N × K current allocated powers respectively corresponding to the current subcarrier allocation results according to the current subcarrier allocation results and the channel status information.

搭配參閱圖3，步驟24包括子步驟241、242，以下說明步驟24所包括的子步驟。 Referring to FIG. 3 , step 24 includes sub-steps 241 and 242. The sub-steps included in step 24 are described below.

在子步驟241中，對於每一子載波，該基站11根據該通道狀態資訊中該等用戶端12在該子載波的通道強度由大至小排序該等用戶端12。該基站11排序方式以下式表示：|h _n,k,t|²>|h _n,k+1,t|²，其中k

{1,2,...K}，n

{1,2,...N}，|h _n,k,t|²為在第n個子載波上的第k個順序的用戶端12在當前時刻t的通道強度，|h _n,k+1,t|²為在第n個子載波上的第k+1個順序的用戶端12在當前時刻t的通道強度。 In sub-step 241, for each subcarrier, the base station 11 sorts the user terminals 12 from large to small according to the channel strength of the user terminals 12 in the subcarrier in the channel status information. The sorting method of the base station 11 is expressed by the following formula: | h _{n , k , t} | ² >| h _{n , k +1, t} | ² , where k

{1,2,... K }, n

{1,2,... N }, | h _{n , k , t} | ² is the channel strength of the k-th sequential user terminal 12 on the n -th subcarrier at the current time t , | h _{n , k + 1, t} | ² is the channel strength of the k + 1th sequential user terminal 12 on the nth subcarrier at the current time t .

值得注意的是，在本實施例中，由於在SIC技術中，為了信幹噪比(SINR)最大化，要求該等用戶端12的分配功率與通道強度成反比，且在解碼時，該基站11按照該等用戶端12的分配功率之係數由大至小進行解碼，故先行將該等用戶端12依照該等通道強度由大至小排序，方便該基站11後續依據排序由小至大分配功率以及解碼，但不以此為限。 It is worth noting that in this embodiment, in SIC technology, in order to maximize the signal-to-interference-to-noise ratio (SINR), the allocated power of the user terminals 12 is required to be inversely proportional to the channel strength, and during decoding, the base station 11 is decoded according to the coefficient of the allocated power of the user terminals 12 from large to small. Therefore, the user terminals 12 are first sorted from large to small according to the channel strength, so as to facilitate the subsequent allocation of the base station 11 according to the sorting from small to large. power and decoding, but not limited to this.

在子步驟242中，對於每一子載波，該基站11根據分配到該子載波的用戶端12的順序依序分配功率，未分配到該子載波的用戶端12則不分配功率，即功率為零，以獲得該等當前分配功率。 In sub-step 242, for each subcarrier, the base station 11 assigns The user terminals 12 assigned to the subcarrier are allocated power in sequence, and the user terminals 12 not assigned to the subcarrier are not allocated power, that is, the power is zero, so as to obtain the currently allocated power.

其中，該等當前分配功率ν _n,k,t滿足下列條件：

1，0

ν _n,k,t

1，若m _n,k,t,m _n,k',t=1則ν _n,k,t>ν _n,k',t，及若m _n,k,t=0則ν _n,k,t=0，n

{1,2,...,N}，k,k'

{1,2,...,K}，k>k'，m _n,k,t為第k個用戶端12在當前時刻t是否分配到第n個子載波的子載波分配結果，m _n,k,t

{0,1}，ν _n,k',t為在第n個子載波上的第k'個順序的用戶端12在當前時刻t分配到的當前分配功率之係數，ν _n,k,t為在第n個子載波上的第k個順序的用戶端12在當前時刻t分配到的當前分配功率之係數。 Among them, the current allocated powers ν _{n , k , t} satisfy the following conditions:

1,0

ν _{n , k , t}

1. If m _{n , k , t} , m _{n , k ', t} =1 then ν _{n , k , t} > ν _{n , k ', t} , and if m _{n , k , t} =0 then ν _{n , k , t} =0, n

{1,2,..., N }, k , k '

{1,2,..., K }, k > k ', m _{n , k , t} is the subcarrier allocation result of whether the kth user terminal 12 is allocated to the nth subcarrier at the current time t , m _{n , k , t}

{0,1}, ν _{n , k ', t} is the coefficient of the current allocated power allocated to the k 'th sequential user terminal 12 on the n -th subcarrier at the current time t , ν _{n , k , t} is The coefficient of the current allocated power allocated to the kth sequential user terminal 12 on the nth subcarrier at the current time t .

在步驟25中，該基站11將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至該等強化學習網路中之一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值。 In step 25, the base station 11 inputs the subcarrier allocation actions, the power allocation actions, the current subcarrier allocation results, and the current allocated power into one of the action reinforcement learning networks. network, so that the action reinforcement learning network outputs a plurality of action values respectively corresponding to the equal power allocation action and the multiple subcarrier allocation actions.

值得注意的是，在本實施例中，該動作強化學習網路為該更新網路，該等動作值為Q值，每一子載波分配動作一次只調整一個用戶端12的一個子載波，該等子載波分配動作可以下式表示：

其中n _n,k,t=1表示在當前時刻t第n個子載波被分配給第k個用戶端12，如第n個子載波在上一時刻已經被分配給第k個用戶端12，則保持子載波分配情況不變。n _n,k,t=0表示在當前時刻t第n個子載波未被分配給第k個用戶端12，如第n個子載波在上一時刻已經未被分配給第k個用戶端12，則保持子載波分配情況不變，該等子載波分配動作的數量為2×N×K個。此外，每一功率分配動作一次只調整一個功率係數，該等功率分配動作可以下式表示：

其中δ _n,k,t

{δ,0,-δ}，0<δ<1，δ _n,k,t=δ表示對功率係數ν _n,k,t增加δ，δ _n,k,t=0表示功率係數ν _n,k,t不變，δ _n,k,t=-δ表示對功率係數ν _n,k,t減少δ，該等功率分配動作的數量為3×N×K個，但不以此為限。 It is worth noting that in this embodiment, the action reinforcement learning network is the update network, the action values are Q values, and each subcarrier allocation action only adjusts one subcarrier of one user terminal 12 at a time. The subcarrier allocation action can be expressed as follows:

Where n _{n , k , t} =1 means that the n -th subcarrier is allocated to the k- th user terminal 12 at the current time t . If the n -th subcarrier has been allocated to the k- th user terminal 12 at the previous time, it remains The subcarrier allocation remains unchanged. n _{n , k , t} =0 means that the n -th subcarrier has not been allocated to the k -th user terminal 12 at the current time t . If the n -th subcarrier has not been allocated to the k -th user terminal 12 at the previous time, then Keeping the subcarrier allocation unchanged, the number of subcarrier allocation actions is 2 × N × K. In addition, each power allocation action only adjusts one power coefficient at a time. These power allocation actions can be expressed as follows:

where δ _{n , k , t}

{ δ ,0,- δ }, 0< δ <1, δ _{n , k , t} = δ means increasing the power coefficient ν _{n , k , t} by δ , δ _{n , k , t} =0 means the power coefficient ν _{n , k , t} remain unchanged, δ _{n , k , t} =- δ means that the power coefficient ν _{n , k , t} is reduced by δ . The number of such power allocation actions is 3 × N × K , but is not limited to this.

在步驟26中，該基站11判定該等動作值是否皆小於等於0。當該基站11判定出該等動作值之其中一者大於0時，流程進行步驟27；而當該基站11判定出該等動作值皆小於等於0時，則流程進行步驟33。 In step 26, the base station 11 determines whether the action values are all less than or equal to 0. When the base station 11 determines that one of the action values is greater than 0, the process proceeds to step 27; and when the base station 11 determines that all of the action values are less than or equal to 0, the process proceeds to step 33.

要特別注意的是，在本實施例的步驟26中，判定該等動作值是否皆小於等於0，只觀察適用於當前超載率的該更新網路輸出的動作值，並不借鑒當前超載率下的該目標網路的輸出值，因此，在步驟25中，該基站11只將該等子載波分配動作、該等功率分配動作，及該等當前分配功率輸入至該更新網路。 It should be particularly noted that in step 26 of this embodiment, it is determined whether the action values are all less than or equal to 0, and only the updated network output applicable to the current overload rate is observed. The action values output do not refer to the output values of the target network under the current overload rate. Therefore, in step 25, the base station 11 only performs the subcarrier allocation actions, the power allocation actions, and the current Allocate power input to the updated network.

要再特別注意的是，若該等動作值皆小於等於0，則認為在當前的狀態下採取任何功率分配動作都會使得長期預期獎勵變低，然而，獎勵需要越高越好，因此判定此時的功率分配動作為最優結果，不再進行功率分配動作，而進行步驟33。 It is important to note that if the action values are all less than or equal to 0, it is considered that taking any power allocation action in the current state will make the long-term expected reward lower. However, the reward needs to be as high as possible, so it is determined that at this time The power allocation action is the optimal result, and the power allocation action is no longer performed, but step 33 is performed.

在步驟27中，該基站11從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作。其中，該目標分配動作為隨機選取的機率為P ₁，該目標分配動作對應的動作值為該等動作值中最高的機率為P ₂，P ₁+P ₂=1且P ₁<P ₂。值得注意的是，在本實施例中，P ₁為10%，P ₂為90%，但不以此為限，在其他實施方式中，該目標分配動作亦可僅為根據當前狀態選取，或是選擇該等動作值中最高者所對應的動作。 In step 27, the base station 11 selects a target allocation action from the subcarrier allocation actions and the power allocation actions. Among them, the probability that the target allocation action is randomly selected is P ₁ , and the action value corresponding to the target allocation action is P 2 . The probability that the action value corresponding to the target allocation action is the highest among such action values is P ₂ , P ₁ + P ₂ =1 and P ₁ < P ₂ . It is worth noting that in this embodiment, P ₁ is 10% and P ₂ is 90%, but it is not limited to this. In other implementations, the target allocation action can also be selected only based on the current status, or The action corresponding to the highest action value is selected.

在步驟28中，該基站11根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率。 In step 28, the base station 11 obtains multiple updated subcarrier allocation results corresponding to the current subcarrier allocation results and multiple respective update subcarrier allocation results based on the current subcarrier allocation results, the current allocated power, and the target allocation action. Updated allocated power corresponding to the currently allocated power.

搭配參閱圖4，步驟28包括子步驟281~289，以下說明步驟28所包括的子步驟。 Referring to FIG. 4 , step 28 includes sub-steps 281 to 289. The sub-steps included in step 28 are described below.

在子步驟281中，該基站11判定該目標分配動作是否為子載波分配動作。當該基站11判定出該目標分配動作為子載波分配動作，流程進行子步驟282；而當該基站11判定出該目標分配動作不為子載波分配動作，表示目標分配動作為功率分配動作，則流程進行子步驟286。 In sub-step 281, the base station 11 determines whether the target allocation action is a subcarrier allocation action. When the base station 11 determines that the target allocation action is a subcarrier allocation action, the process proceeds to sub-step 282; and when the base station 11 determines that the target allocation action is not a subcarrier allocation action, indicating that the target allocation action is a power allocation action, then The process proceeds to sub-step 286.

在子步驟282中，該基站11根據該目標分配動作獲得N×K個分別對應該等當前子載波分配結果的替換子載波分配結果。 In sub-step 282, the base station 11 obtains N × K replacement subcarrier allocation results corresponding to the current subcarrier allocation results according to the target allocation action.

在子步驟283中，該基站11判定該等替換子載波分配結果是否滿足多個子載波分配條件。當該基站11判定出該等替換子載波分配結果不滿足該等子載波分配條件之其中一者時，流程進行子步驟284；而當該基站11判定出該等替換子載波分配結果滿足該等子載波分配條件時，則流程進行子步驟285。 In sub-step 283, the base station 11 determines whether the replacement subcarrier allocation results satisfy multiple subcarrier allocation conditions. When the base station 11 determines that the replacement subcarrier allocation results do not meet one of the subcarrier allocation conditions, the process proceeds to sub-step 284; and when the base station 11 determines that the replacement subcarrier allocation results meet the If the subcarrier allocation conditions are met, the process proceeds to sub-step 285.

值得注意的是，該等子載波分配條件包括：

1，及1

N _max，其中，m _n,k,t+1為在第n個子載波上的第k個順序的用戶端12在下一時刻t+1的替換子載波分配結果，m _n,k,t+1

{0,1}，m _n,k,t+1=1為第k個用戶端12在下一時刻t+1分配到第n個子載波，m _n,k,t+1=0為第k個用戶端12在下一時刻t+1未分配到第n個子載波，N _max為每一子載波上的最大用戶端數，但不以此為限。 It is worth noting that the subcarrier allocation conditions include:

1, and 1

N _max , where m _{n , k , t +1} are the replacement subcarrier allocation results of the kth sequential user terminal 12 on the nth subcarrier at the next time t +1, m _{n , k , t +1}

{0,1}, m _{n , k , t +1} =1 is the k -th user terminal 12 allocated to the n -th subcarrier at the next time t +1, m _{n , k , t +1} =0 is the k -th sub-carrier The user terminal 12 is not allocated to the n -th subcarrier at the next time t + 1, and N _max is the maximum number of user terminals on each subcarrier, but is not limited to this.

在子步驟284中，該基站11將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率，即子載波分配結果及分配功率保持不變。 In sub-step 284, the base station 11 uses the current subcarrier allocation results and the current allocated power as the updated subcarrier allocation results and the updated allocated power respectively, that is, the subcarrier allocation results and the allocated power remain unchanged. .

在子步驟285中，該基站11將該等替換子載波分配結果作為該等更新子載波分配結果，並根據該等更新子載波分配結果及該通道狀態資訊獲得該等更新分配功率。 In sub-step 285, the base station 11 uses the replacement subcarrier allocation results as the updated subcarrier allocation results, and obtains the updated allocation power according to the updated subcarrier allocation results and the channel status information.

要特別注意的是，在子步驟285中該基站11獲得該等更新分配功率的方式類似於子步驟241、242獲得該等當前分配功率的方式，故在此不加以贅述。 It should be particularly noted that the way in which the base station 11 obtains the updated allocated power in sub-step 285 is similar to the way in which the current allocated power is obtained in sub-steps 241 and 242, so it will not be described again here.

在子步驟286中，該基站11對該等當前分配功率進行該目標分配動作，以獲得多個分別對應該等當前分配功率的替換分配功率。 In sub-step 286, the base station 11 performs the target allocation action on the currently allocated powers to obtain a plurality of replacement allocated powers respectively corresponding to the current allocated powers.

要特別注意的是，若該目標分配動作為功率分配，不論是根據當前狀態選取的動作或是該等動作值中最高者所對應的動作δ _n,k,t，對應的當前子載波分配結果m _n,k,t=1，且要增加δ的ν _n,k",t或減少δ的ν _n,k',t對應的當前子載波分配結果m _n,k",t,m _n,k',t=1。 It is important to note that if the target allocation action is power allocation, whether it is the action selected based on the current status or the action corresponding to the highest action value δ _{n , k , t} , the corresponding current subcarrier allocation result m _{n , k , t} =1, and the current subcarrier allocation result m n _{, k ", t , m} n , corresponding to the ν n , k ", t of δ must be increased or the ν n _{, k '} _{, t of} δ should be reduced _{. k ', t} =1.

在子步驟287中，該基站11判定該等替換分配功率是否滿足多個功率分配條件。當該基站11判定出該等替換分配功率不滿足該等功率分配條件之其中一者時，流程進行子步驟288；而當該基站11判定出該等替換分配功率滿足該等功率分配條件時，則流程進行子步驟289。 In sub-step 287, the base station 11 determines whether the replacement allocated powers satisfy a plurality of power allocation conditions. When the base station 11 determines that the replacement allocated power does not meet one of the power allocation conditions, the process proceeds to sub-step 288; and when the When the base station 11 determines that the replacement allocated power satisfies the power allocation conditions, the process proceeds to sub-step 289.

值得注意的是，該等功率分配條件包括：

1，0

ν _n,k,t+1

1，若m _n,k,t+1,m _n,k',t+1=1則ν _n,k',t+1<ν _n,k,t+1，及若m _n,k,t+1=0則ν _n,k,t+1=0，其中，k,k'

{1,2,...,K}，k>k'，ν _n,k',t+1為在第n個子載波上的第k'個順序的用戶端12在下一時刻t+1分配到的替換分配功率之係數，ν _n,k,t+1為在第n個子載波上的第k個順序的用戶端12在下一時刻t+1分配到的替換分配功率之係數。 It is worth noting that such power allocation conditions include:

1,0

ν _{n , k , t +1}

1. If m _{n , k , t +1} , m _{n , k ', t +1} =1 then ν _{n , k ', t +1} < ν _{n , k , t +1} , and if m _{n , k , t +1} =0 then ν _{n , k , t +1} =0, where k , k '

{1,2,..., K }, k > k ', ν _{n , k ', t +1} is allocated to the k 'th sequential user terminal 12 on the nth subcarrier at the next time t +1 The coefficients of the replacement allocated power, ν _{n , k , t +1,} are the coefficients of the replacement allocated power allocated to the k-th sequential user terminal 12 on the n - th subcarrier at the next time t +1.

在子步驟288中，該基站11將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率，即子載波分配結果及分配功率保持不變。 In sub-step 288, the base station 11 uses the current subcarrier allocation results and the current allocated power as the updated subcarrier allocation results and the updated allocated power respectively, that is, the subcarrier allocation results and the allocated power remain unchanged. .

在子步驟289中，該基站11將該等當前子載波分配結果及該等替換分配功率分別作為該等更新子載波分配結果及該等更新分配功率。 In sub-step 289, the base station 11 uses the current subcarrier allocation results and the replacement allocation powers as the updated subcarrier allocation results and the updated allocation power respectively.

在步驟29中，該基站11判定一相關於當前該基站11通訊連接的用戶端12之數量與用戶端12的信號疊加到的子載波之數量的超載率是否為K/N。當該基站11判定出該超載率為K/N時，流程進行步驟30；而當該基站11判定出該超載率不為K/N時，則流程重複步驟21。 In step 29, the base station 11 determines whether an overload rate related to the number of users 12 currently connected to the base station 11 and the number of subcarriers to which the signals of the users 12 are superimposed is K / N . When the base station 11 determines that the overload rate is K / N , the process proceeds to step 30; and when the base station 11 determines that the overload rate is not K / N , the process repeats step 21.

要特別注意的是，該超載率為當前該基站11通訊連接的用戶端12之數量除以用戶端12的信號疊加到的子載波之數量，在本實施例中，該基站11通訊的用戶端數量和位置都是不固定的，該基站11會根據用戶端12的數量調整資源配置方案，故在通過上行導頻估計發現該超載率不為K/N時，即該超載率改變時(設改變後的超載率為K’/N，K’>1且K’≠K)，該基站11會儲存一包括該等强化學習網路且對應超載率為K/N的歷史強化學習網路資訊，並判定是否儲存有一對應超載率為K’/N的目標歷史強化學習網路資訊，若儲存有該目標歷史強化學習網路資訊，則載入該歷史強化學習網路資訊，並進行步驟22，否則流程回到步驟21，該基站11初始化該等强化學習網路，以作為適用於超載率為K’/N的强化學習網路。 It should be noted that the overload rate is the number of users 12 currently connected to the base station 11 divided by the number of subcarriers to which the signals of the users 12 are superimposed. In this embodiment, the number of users 12 communicating with the base station 11 The number and location are not fixed. The base station 11 will adjust the resource allocation plan according to the number of user terminals 12. Therefore, when it is found through uplink pilot estimation that the overload rate is not K / N , that is, when the overload rate changes (assuming The changed overload rate is K '/ N , K '>1 and K '≠ K ), the base station 11 will store a historical reinforcement learning network information including the reinforcement learning network and the corresponding overload rate K / N , and determine whether there is a target historical reinforcement learning network information corresponding to the overload rate K '/ N . If the target historical reinforcement learning network information is stored, load the historical reinforcement learning network information and proceed to step 22. , otherwise the process returns to step 21, and the base station 11 initializes the reinforcement learning networks as reinforcement learning networks suitable for the overload rate K '/ N .

在步驟30中，該基站11根據該等當前子載波分配結果、該等當前分配功率、該等更新子載波分配結果，及該等更新分配功率計算出一獎勵值。 In step 30, the base station 11 calculates a reward value based on the current subcarrier allocation results, the current allocated power, the updated subcarrier allocation results, and the updated allocated power.

搭配參閱圖5，步驟30包括子步驟301~303，以下說明步驟30所包括的子步驟。 Referring to FIG. 5 , step 30 includes sub-steps 301 to 303. The sub-steps included in step 30 are described below.

在子步驟301中，該基站11根據該等當前子載波分配結果及該等當前分配功率計算出一第一頻譜效率f _t。其中該第一頻譜效率f _t以下式表示：

，R _n,k,t=B _n log₂(1+ρ _n,k,t)，

其中，s _t={M _t,V _t}該等當前子載波分配結果及該等當前分配功率的集合，M _t={m _1,1,t,...,m _n,k,t,...,m _N,K,t}為該等當前子載波分配結果，V _t={ν _1,1,t,...,ν _n,k,t,...,ν _N,K,t}為該等當前分配功率，R _n,k,t為第k個用戶端12在第n個子載波及在當前時刻t的通道容量(Channel capacity)，B _n為第n個子載波頻寬，ρ _n,k,t為第k個用戶端12在第n個子載波及在當前時刻t的信幹噪比，ν _n,j,t為在第n個子載波上的第j個順序的用戶端12在當前時刻t分配到的當前分配功率之係數，m _n,j,t為第j個用戶端12在當前時刻t是否分配到第n個子載波的當前子載波分配結果，P _T為該基站11的分配的總功率，β為SIC殘留係數，σ ²為加性高斯白色雜訊(AWGN)。 In sub-step 301, the base station 11 calculates a first spectrum efficiency f _t according to the current subcarrier allocation results and the current allocated power. The first spectral efficiency f _t is expressed by the following formula:

, R _{n , k , t} = B _n log ₂ (1+ ρ _{n , k , t} ),

Among them, s _t ={ M _t , V _t } The set of the current subcarrier allocation results and the current allocated power, M _t ={ m _{1,1, t} ,..., m _{n , k , t} , ..., m _{N , K , t} } is the current subcarrier allocation result, V _t ={ ν _{1,1, t} ,..., ν _{n , k , t ,} ..., ν _{N , K , t} } is the current allocated power, R _{n , k , t} is the channel capacity (Channel capacity) of the k-th user terminal 12 on the n-th sub-carrier and at the current time t , B _n is the n-th sub-carrier bandwidth , ρ _{n , k , t} is the signal-to-interference-to-noise ratio of the k -th user terminal 12 on the n- th subcarrier and at the current time t , ν _{n , j , t} is the j -th sequential user on the n-th subcarrier. The coefficient of the current allocated power allocated to terminal 12 at the current time t , m _{n , j , t} is the current subcarrier allocation result of whether the jth user terminal 12 is allocated to the nth subcarrier at the current time t , P _T is The allocated total power of the base station 11, β is the SIC residual coefficient, and σ ² is the additive white Gaussian noise (AWGN).

要再特別注意的是，由於未分配到該子載波的用戶端12則不分配功率，因此在步驟30中，該基站11實際可僅根據該等當前分配功率及該等更新分配功率計算出該獎勵值，第k個用戶端 12在第n個子載波及在當前時刻t的信幹噪比ρ _n,k,t亦可表示為：

It should be noted that since the user terminal 12 that is not allocated to the subcarrier is not allocated power, in step 30, the base station 11 can actually calculate the current allocated power and the updated allocated power only. The reward value, the signal-to-interference-to-noise ratio ρ _{n , k , t} of the k- th user terminal 12 on the n -th subcarrier and at the current time t can also be expressed as:

在子步驟302中，該基站11根據該等更新子載波分配結果及該等更新分配功率計算出一第二頻譜效率f _t+1。該第二頻譜效率f _t+1算式與該第一頻譜效率f _t相同故在此不加以贅述。 In sub-step 302, the base station 11 calculates a second spectrum efficiency f _{t +1} based on the updated subcarrier allocation results and the updated allocation power. The calculation formula of the second spectral efficiency f _{t +1} is the same as the first spectral efficiency f _t and will not be described again here.

在子步驟303中，該基站11根據該第一頻譜效率f _t及該第二頻譜效率f _t+1計算出一獎勵值r(s _t,a _t)，a _t為在當前時刻t所選取的該目標分配動作。 In sub-step 303, the base station 11 calculates a reward value r ( s _t , a _t ) based on the first spectrum efficiency f _t and the second spectrum efficiency f _{t +1} , where a _t is selected at the current time t The target is assigned an action.

值得注意的是，在本實施例中，該獎勵值為該第二頻譜效率減去該第一頻譜效率，即該獎勵值r(s _t,a _t)=f _t+1-f _t，但不以此為限。 It is worth noting that in this embodiment, the reward value is the second spectral efficiency minus the first spectral efficiency, that is, the reward value r ( s _t , a _t ) = f _{t +1} - f _t , but Not limited to this.

在步驟31中，該基站11產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料。 In step 31, the base station 11 generates and stores a message including the current subcarrier allocation results, the current allocation power, the target allocation action, the reward value, the updated subcarrier allocation results, and the updated allocation power. training materials.

在步驟32中，該基站11從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練該等強化學習網路，並重複進行步驟25。 In step 32, the base station 11 selects a plurality of target training data from the stored training data, trains the reinforcement learning networks based on the target training data, and repeats step 25.

值得一提的是，在重複步驟25前，該基站11會先將在步驟28所獲得該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率，再重複進行步驟25。 It is worth mentioning that before repeating step 25, the base station 11 will first perform the updated subcarrier allocation results and the updated allocation power obtained in step 28. For the current subcarrier allocation results and the current allocated power, step 25 is repeated.

搭配參閱圖6，步驟32包括子步驟321~323，以下說明步驟32所包括的子步驟。 Referring to FIG. 6 , step 32 includes sub-steps 321 to 323. The sub-steps included in step 32 are described below.

在子步驟321中，該基站11從儲存的訓練資料中選取該等目標訓練資料。 In sub-step 321, the base station 11 selects the target training data from the stored training data.

值得注意的是，在本實施例中，該基站11例如隨機選取32筆目標訓練資料，而在循環開始初期，因為沒有儲存足夠的訓練資料，故32筆目標訓練資料中會有幾筆目標訓練資料為空，但不以此為限。 It is worth noting that in this embodiment, the base station 11 randomly selects 32 pieces of target training data, and at the beginning of the cycle, because there is not enough training data stored, there will be several target training data among the 32 pieces of target training data. The data is empty, but it is not limited to this.

在子步驟322中，該基站11將該等目標訓練資料的當前子載波分配結果、當前分配功率，及目標分配動作輸入至該動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等目標訓練資料的訓練動作值。 In sub-step 322, the base station 11 inputs the current subcarrier allocation results, current allocation power, and target allocation actions of the target training data to the action reinforcement learning network, so that the action reinforcement learning network outputs multiple respective The training action value corresponding to the equivalent target training data.

在子步驟323中，該基站11根據該等目標訓練資料及該等訓練動作值調整該等強化學習網路。 In sub-step 323, the base station 11 adjusts the reinforcement learning networks according to the target training data and the training action values.

值得注意的是，在本實施例中，該基站11根據該等目標訓練資料的獎勵值及該等訓練動作值利用該損失函數獲得一損失值，並根據該損失值利用該學習演算法將該等強化學習網路進行更新，以調整該等強化學習網路，亦即對於每一目標訓練資料，該基站11將該目標訓練資料中的當前子載波分配結果、當前分配功率，及目標分配動作輸入至該更新網路，使得該更新網路輸出Q(s _t,a _t)，再將該目標訓練資料中的獎勵值、更新子載波分配結果，及更新分配功率輸入至該目標網路，使得該目標網路輸出r(s _t,a _t)+γmaxQ(s _t+1,a _t+1)，並求得r(s _t,a _t)+γmaxQ(s _t+1,a _t+1)與Q(s _t,a _t)的均方誤差作為該損失值，其中γ

[0,1]為權衡即時獎勵和後續獎勵重要性的折現因數，Q(s _t,a _t)為該目標訓練資料對應的訓練動作值，maxQ(s _t+1,a _t+1)為該目標訓練資料的更新子載波分配結果及更新分配功率集合搭配所有子載波分配動作及功率分配動作能獲得的最大動作值，再根據該等目標訓練資料的損失值利用自適應時刻估計方法對該更新網路的參數進行更新，在多次更新之後，例如32次，再將該更新網路的參數複製到該目標網路，以更新該目標網路的參數，但不以此為限，在其他只有該更新網路的實施方式中，則不需要將該更新網路的參數複製到該目標網路。 It is worth noting that in this embodiment, the base station 11 uses the loss function to obtain a loss value based on the reward values of the target training data and the training action values, and uses the learning algorithm according to the loss value to Wait for the reinforcement learning network to be updated to adjust the reinforcement learning network, that is, for each target training data, the base station 11 will use the current subcarrier allocation result, the current allocation power, and the target allocation action in the target training data. Input to the update network, causing the update network to output Q ( s _t , a _t ), and then input the reward value, updated subcarrier allocation result, and updated allocation power in the target training data to the target network, Make the target network output r ( s _t , a _t ) + γ max Q ( s _{t +1} , a _{t +1} ), and obtain r ( s _t , a _t ) + γ max Q ( s _{t +1} , a _{t +1} ) and the mean square error of Q ( s _t , a _t ) as the loss value, where γ

[0,1] is the discount factor that weighs the importance of immediate rewards and subsequent rewards, Q ( s _t , a _t ) is the training action value corresponding to the target training material, max Q ( s _{t +1} , a _{t +1} ) is the maximum action value that can be obtained by combining the updated subcarrier allocation results of the target training data and the updated allocation power set with all subcarrier allocation actions and power allocation actions, and then uses the adaptive time estimation method based on the loss value of the target training data. Update the parameters of the update network. After multiple updates, such as 32 times, copy the parameters of the update network to the target network to update the parameters of the target network, but not limited to this. , in other implementations where there is only the update network, there is no need to copy the parameters of the update network to the target network.

要特別注意的是，在其他該等強化學習網路例如包括該對照表的實施方式中，該對照表具有多個表格動作值，每一表格動作值對應一子載波分配結果、一分配功率結果，及一分配動作，在步驟32中，該基站11根據該等目標訓練資料更新該對照表，以訓練該等強化學習網路。詳細而言，該基站11根據以下公式更新該對照表：

其中，s _i表示第i筆目標訓練資料的子載波分配集合及分配功率集合，a _i表示第i筆目標訓練資料的目標分配動作，r(s _i,a _i)表示第i筆目標訓練資料的獎勵值，s _i'表示第i筆目標訓練資料的更新子載波分配結果及更新分配功率集合，m表示Q(s _i,a _i)更新的次數，Q _m(s _i,a _i)為該對照表中對應該第i筆目標訓練資料的子載波分配結果、分配功率，及目標分配動作的一目標表格動作值，Q _m+1(s _i, a _i)表示該目標表格動作值更新後的值，α表示更新的學習率，

表示該對照表中對應該第i筆目標訓練資料的更新子載波分配結果及更新分配功率集合搭配所有子載波分配動作及功率分配動作能獲得的一最大表格動作值，

是由該等強化學習網路中的目標網路計算出來，Q _m(s _i,a _i)是由該等強化學習網路中的更新網路計算出來，因為PDMA技術子載波分配動作及功率分配動作較多，Q表格需要較多的儲存空間，故本實施例是採用含有一隱藏層的Q網路對Q表格進行擬合，即Q網路的輸入對應Q表格中的狀態矩陣，Q網路的輸出對應Q表格中該狀態的Q值，因為Q網路中參數的個數遠小於Q表格中Q值的個數，所以節省了該基站的儲存空間。 It should be particularly noted that in other implementations of the reinforcement learning network, such as the comparison table, the comparison table has multiple table action values, and each table action value corresponds to a subcarrier allocation result and an allocation power result. , and an allocation action. In step 32, the base station 11 updates the comparison table according to the target training data to train the reinforcement learning networks. In detail, the base station 11 updates the comparison table according to the following formula:

Among them, s _i represents the subcarrier allocation set and allocation power set of the i -th target training data, a _i represents the target allocation action of the i- th target training data, and r ( s _i , a _i ) represents the i- th target training data. The reward _value of _ _ _{_} _ _{_} _ _{_} _ _{_} _ _{_} In the comparison table, there is a target table action value corresponding to the subcarrier allocation result, allocation power, and target allocation action of the i -th target training data. Q _{m +1} ( s _{i ,} a _i ) represents the update of the target table action value. The value after , α represents the updated learning rate,

Indicates the maximum table action value that can be obtained by combining all subcarrier allocation actions and power allocation actions with the updated subcarrier allocation result and updated allocation power set corresponding to the i- th target training data in the comparison table,

is calculated from the target network in the reinforcement learning network, Q _m ( s _i , a _i ) is calculated from the update network in the reinforcement learning network, because the PDMA technology subcarrier allocation action and power There are many allocation actions, and the Q table requires more storage space. Therefore, in this embodiment, a Q network containing a hidden layer is used to fit the Q table, that is, the input of the Q network corresponds to the state matrix in the Q table, Q The output of the network corresponds to the Q value of the state in the Q table. Because the number of parameters in the Q network is much smaller than the number of Q values in the Q table, the storage space of the base station is saved.

在步驟33中，該基站11根據該等當前子載波分配結果及該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率，並重複進行步驟22。值得注意的是，在本實施例中，每當進行步驟33該循環計數器加1，但不以此為限，在其他實施方式中，該循環計數器亦可在步驟23或步驟24加1。 In step 33, the base station 11 determines the current subcarrier allocation results and The currently allocated powers calculate a candidate spectrum efficiency, and the current subcarrier allocation results, the current allocated powers, and the candidate spectrum efficiency are stored, and step 22 is repeated. It is worth noting that in this embodiment, the loop counter is incremented by 1 every time step 33 is performed, but it is not limited to this. In other embodiments, the loop counter can also be incremented by 1 in step 23 or step 24.

在步驟34中，該基站11從該等候選頻譜效率中獲得一最高的目標頻譜效率，該循環計數器清零並重複步驟22，其中，該目標頻譜效率對應的子載波分配結果及分配功率即為最佳的子載波分配結果及最佳的分配功率。 In step 34, the base station 11 obtains the highest target spectrum efficiency from the candidate spectrum efficiencies, clears the cycle counter and repeats step 22, where the subcarrier allocation result and allocation power corresponding to the target spectrum efficiency are The best subcarrier allocation results and the best allocated power.

綜上所述，本發明為基於人工智慧算法之下行模式區分多址接入系統資源分配方法，藉由該基站11利用該等強化學習網路在不同場景記錄學習，以獲取具有最大的獎勵值之最佳分配動作，並進一步獲得該等候選頻譜效率，再從該等候選頻譜效率中獲得最高的該目標頻譜效率，其中，該目標頻譜效率對應的子載波分配及功率分配即為最優，故確實能達成本發明的目的。 To sum up, the present invention is a downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithms. The base station 11 uses the reinforcement learning network to record learning in different scenarios to obtain the maximum reward value. perform the best allocation action, and further obtain the candidate spectrum efficiencies, and then obtain the highest target spectrum efficiency from the candidate spectrum efficiencies, in which the subcarrier allocation and power allocation corresponding to the target spectrum efficiency are optimal, Therefore, the purpose of the present invention can indeed be achieved.

惟以上所述者，僅為本發明的實施例而已，當不能以此限定本發明實施的範圍，凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾，皆仍屬本發明專利涵蓋的範圍內。 However, the above are only examples of the present invention and should not be used to limit the scope of the present invention. All simple equivalent changes and modifications made based on the patent scope of the present invention and the content of the patent specification are still within the scope of the present invention. within the scope covered by the patent of this invention.

21~34:步驟 21~34: Steps

Claims

A downlink mode differentiated multiple access system resource allocation method based on an artificial intelligence algorithm is implemented by a base station. The base station communicates with K users through a wireless channel. The base station stores multiple subcarrier allocation actions, multiple power allocation action , and a channel status information including N The user terminal allocates the subcarriers to obtain N × K current subcarrier allocation results indicating whether the user terminal is allocated the subcarriers; (B) According to the current subcarrier allocation results and the channel status information Obtain N × K current allocated powers corresponding to the current subcarrier allocation results; (C) Combine the subcarrier allocation actions, the power allocation actions, the current subcarrier allocation results, and the current allocated powers Input to an action reinforcement learning network, causing the action reinforcement learning network to output multiple action values corresponding to the power allocation action and the subcarrier allocation action respectively; (D) Determine whether the action values are all less than or equal to 0 ; (E) When it is determined that one of the action values is greater than 0, select a target allocation action from the subcarrier allocation actions and the power allocation actions; (F) According to the current subcarrier allocation results , the current allocated power and the target allocation action, obtain multiple updated subcarrier allocation results respectively corresponding to the current subcarrier allocation results and multiple updated allocated powers respectively corresponding to the current allocated power; (G) According to the Calculate a reward value from the current allocated power and the updated allocated power; (H) generate and store a reward value including the current subcarrier allocation results, the current allocated power, the target allocation action, and the reward value , the updated subcarrier allocation results, and the training data of the updated allocated power; (I) select a plurality of target training data from the stored training data, and train at least one reinforcement learning network based on the target training data, The at least one reinforcement learning network includes the action reinforcement learning network; (J) using the updated subcarrier allocation results and the updated allocation power as the current subcarrier allocation results and the current allocation power, respectively, and repeating the steps ( C)~(I) until the action values are all less than or equal to 0; (K) when it is determined that the action values are all less than or equal to 0, calculate a candidate based on the current subcarrier allocation results and the current allocated power Spectral efficiency, and store the current subcarrier allocation results, the current allocation power, and the candidate spectrum efficiency; (L) Repeat steps (A) ~ (K) P times to obtain P candidate spectrum efficiencies, where P >1; and (M) obtain a highest target spectrum efficiency from the candidate spectrum efficiencies.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in request item 1, wherein step (B) includes the following sub-steps: (B-1) For each subcarrier, according to the channel status In the information, the channel strengths of the user terminals on the subcarrier are sorted from large to small; and (B-2) for each subcarrier, power is allocated sequentially according to the order of the user terminals allocated to the subcarrier, User terminals that are not allocated to this subcarrier will not be allocated power to obtain the current allocated power. The current allocated power ν _{n , k , t} satisfies the following conditions:

1,0

ν _{n , k , t}

1. If m _{n , k , t} , m _{n , k ', t} =1 then ν _{n , k , t} > ν _{n , k ', t} , and if m _{n , k , t} =0 then ν _{n , k , t} =0, where, n

{1,2,..., N }, k , k '

{1,2,..., K }, k > k ', m _{n , k , t} is the current subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the current time t , m _{n , k , t}

{0,1}, m _{n , k , t} =1 means that the k -th user terminal is allocated to the n -th subcarrier at the current time t , m _{n , k , t} =0 means that the k-th user terminal has not been allocated to the n -th subcarrier at the current time t. Allocated to the nth subcarrier, ν _{n , k ', t} is the coefficient of the current allocated power allocated to the k'th sequential user terminal on the nth subcarrier at the current time t , ν _{n , k , t} is The coefficient of the current allocated power allocated to the kth sequential user terminal on the nth subcarrier at the current time t .

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 1, wherein step (F) includes the following sub-steps: (F-1) Determine whether the target allocation action is subcarrier allocation Action; (F-2) When it is determined that the target allocation action is a subcarrier allocation action, N × K replacement subcarrier allocation results corresponding to the current subcarrier allocation results are obtained according to the target allocation action; (F- 3) Determine whether the replacement subcarrier allocation results satisfy multiple subcarrier allocation conditions; (F-4) When it is determined that one of the subcarrier allocation conditions is not met, combine the current subcarrier allocation results and the The current allocated powers are respectively used as the updated subcarrier allocation results and the updated allocated power; and (F-5) when it is determined that the subcarrier allocation conditions are met, the replacement subcarrier allocation results are used as the updated The subcarrier allocation results are obtained, and the updated allocation power is obtained based on the updated subcarrier allocation results and the channel status information.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in request item 3, wherein in step (A), the current subcarrier allocation results m _{n , k , t} satisfy the following conditions:

1, and 1

N _max , where m _{n , k , t} are the current subcarrier allocation results of whether the kth user terminal is allocated to the nth subcarrier at the current time t , m n _{, k , t}

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }, m _{n , k , t} =1 is the k -th user terminal allocated to the n -th subcarrier at the current time t , m _{n , k , t} =0 is the k -th user The terminal is not allocated to the n- th subcarrier at the current time t , and N _max is the maximum number of user terminals on each subcarrier. In step (F-3), the subcarrier allocation conditions include:

1, and 1

N _max , where m _{n , k , t +1} are the replacement subcarrier allocation results of whether the kth user terminal is allocated to the nth subcarrier at the next time t +1, m _{n , k , t +1}

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }, m _{n , k , t +1} =1 means that the k -th user terminal is allocated to the n -th subcarrier at the next time t +1, m _{n , k , t +1} = 0 means that the k- th user terminal is not allocated to the n- th subcarrier at the next time t + 1, and N _max is the maximum number of user terminals on each sub-carrier.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 3, wherein the following sub-steps are included after the sub-step (F-1): (F-6) When the target is determined When the allocation action is not a subcarrier allocation action, perform the target allocation action on the currently allocated power to obtain multiple replacement allocation powers corresponding to the current allocation power; (F-7) Determine whether the replacement allocation power Meet multiple power allocation conditions; (F-8) When it is determined that one of the power allocation conditions is not met, use the current subcarrier allocation results and the current allocated power as the updated subcarrier allocations results and the updated allocated power; and (F-9) when it is determined that the power allocation conditions are met, the current subcarrier allocation results and the replacement allocated power are used as the updated subcarrier allocation results and the updated subcarrier allocation results respectively. Wait for updates to allocate power.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 5, wherein, in sub-step (F-7), the power allocation conditions include:

1,0

ν _{n , k , t +1}

1. If m _{n , k , t +1} , m _{n , k ', t +1} =1 then ν _{n , k ', t +1} < ν _{n , k , t +1} , and if m _{n , k , t +1} =0 then ν _{n , k , t +1} =0, where n

{1,2,..., N }, k , k '

{1,2,..., K }, k > k ', m _{n , k , t +1} is the current subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the next time t +1 , m _{n , k , t +1}

{0,1}, m _{n , k , t +1} =1 is the k -th user terminal allocated to the n -th subcarrier at the next time t +1, m _{n , k , t +1} =0 is the k -th user The terminal is not assigned to the n- th subcarrier at the next time t +1, ν _{n , k ', t +1} is the replacement assigned to the k 'th sequential user terminal on the n -th subcarrier at the next time t +1 The coefficient of allocation power, ν _{n , k , t +1,} is the coefficient of replacement allocation power allocated to the k-th sequential user terminal on the n - th subcarrier at the next time t +1.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in request item 1 also includes the following steps between steps (F) and (G): (M) Determine whether the current Whether the overload rate between the number of users connected to the base station and the number of subcarriers to which the user's signal is superimposed is K / N ; when it is determined that the overload rate is K / N , proceed to step (G).

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in request item 7 also includes the following steps before step (A): (N) initializing multiple reinforcement learning networks; in step (M) ) also includes the following steps: (O) When it is determined that the overload rate is not K / N , store a historical reinforcement learning network information including the reinforcement learning network and corresponding overload rate K / N , and determine Whether there is a target historical reinforcement learning network information corresponding to the overload rate K '/ N stored, K '>1 and K '≠ K ; and (P) when it is determined that the target historical reinforcement learning network information is stored, contain Enter the target historical reinforcement learning network information, and repeat steps (A) ~ (F), (M); when it is determined that the target historical reinforcement learning network information is not stored, repeat steps (N), (A) ~(F), (M).

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in request item 1, wherein in step (E), the probability of the target allocation action being randomly selected is P ₁ , and the target allocation action The corresponding action value is that the highest probability among such action values is P ₂ , P ₁ + P ₂ =1 and P ₁ < P ₂ .

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 1, wherein step (G) includes the following sub-steps: (G-1) Calculate a first-order resource allocation method based on the current allocated power. a spectrum efficiency; (G-2) calculate a second spectrum efficiency based on the updated allocated power; and (G-3) calculate the reward value based on the first spectrum efficiency and the second spectrum efficiency.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 1, wherein step (I) includes the following sub-steps: (I-1) Select the target training data from the stored training data; (I-2) Input the subcarrier allocation results, allocation power, and target allocation actions of the target training data into the action target reinforcement learning network path, so that the action target reinforcement learning network outputs a plurality of training action values corresponding to the target training data; and (I-3) adjust the at least one reinforcement learning network according to the target training data and the training action values. road.

The downlink mode differentiated multiple access system resource allocation method based on artificial intelligence algorithm as described in claim 11, wherein in step (I-3), according to the reward value of the target training data and the training action value A loss value is obtained using a loss function, and a learning algorithm is used to update the at least one reinforcement learning network based on the loss value to adjust the at least one reinforcement learning network.