CN114615744A - Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method - Google Patents
Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method Download PDFInfo
- Publication number
- CN114615744A CN114615744A CN202210185185.2A CN202210185185A CN114615744A CN 114615744 A CN114615744 A CN 114615744A CN 202210185185 A CN202210185185 A CN 202210185185A CN 114615744 A CN114615744 A CN 114615744A
- Authority
- CN
- China
- Prior art keywords
- base station
- network
- time
- resource
- student
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000002787 reinforcement Effects 0.000 title claims abstract description 18
- 238000013508 migration Methods 0.000 title claims abstract description 12
- 230000005012 migration Effects 0.000 title claims abstract description 11
- 238000004364 calculation method Methods 0.000 title claims description 26
- 230000008569 process Effects 0.000 claims abstract description 15
- 238000005265 energy consumption Methods 0.000 claims abstract description 13
- 238000012546 transfer Methods 0.000 claims abstract description 9
- 239000003795 chemical substances by application Substances 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 21
- 238000004891 communication Methods 0.000 claims description 17
- 230000004927 fusion Effects 0.000 claims description 16
- 238000013468 resource allocation Methods 0.000 claims description 16
- 230000009471 action Effects 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000006399 behavior Effects 0.000 claims description 6
- 230000001934 delay Effects 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 6
- 230000004069 differentiation Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000007786 learning performance Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/53—Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/70—Services for machine-to-machine communication [M2M] or machine type communication [MTC]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/54—Allocation or scheduling criteria for wireless resources based on quality criteria
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a knowledge migration reinforcement learning network slice general sensing resource collaborative optimization method, aiming at establishing a network slice general sensing resource collaborative optimization problem model by taking the maximum total throughput of user equipment as an optimization target on the basis of considering the constraint limits of network slice general sensing resource, user equipment time delay, energy consumption and the like according to the differentiated service requirements of mobile edge network slice users. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.
Description
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a knowledge migration reinforcement learning network slice general sensing calculation resource collaborative optimization method.
Background
With the rapid development of 5G mobile communication and the Internet of things, the mobile communication is being driven to be intelligently evolved by the massive Internet of things equipment and ubiquitous connection requirements. As a new network paradigm, the mobile edge network can significantly reduce transmission energy consumption, network congestion and processing delay by migrating communication resources and computing resources to the edge of the network, and promotes deep fusion of communication, perception and computation.
In a mobile edge network, a network slice can meet the diversified requirements of low time delay and high reliability by sharing resources such as communication, perception, calculation, infrastructure and the like. The existing network slice resource allocation method is roughly researched from the following two aspects. Firstly, the resource optimization type is adopted, and compared with a single resource optimization strategy, the multi-dimensional resource collaborative optimization method is complex, mainly adopts a centralized mode, and has more communication and control overhead. And secondly, a resource allocation mode is adopted, and compared with a static resource allocation mode, the dynamic allocation strategy is changed according to the network environment, so that real-time dynamic optimization of resources can be realized. In an actual mobile edge network, the network slice performance is often related to various types of resources, and the complex dependence relationship of the resources is difficult to be represented by an accurate mathematical model; high-dimensional dynamic network states such as randomness of wireless channel states and time-varying of service flow of sliced users also restrict improvement of network slicing service quality.
Reinforcement learning, a model-free method, with its powerful decision-making capability in a high-dimensional space, is considered as one of promising solutions to solve the above problems. However, the method is less concerned about the safe and efficient optimization problem of the multidimensional resources under the general perception and calculation fusion.
Disclosure of Invention
In order to overcome the problem that a high-dimensional dynamic network restricts the service quality of a network slice, the invention aims to provide a safe and efficient optimization method of network slice multi-dimensional resources under the condition of general sensing calculation fusion.
In order to achieve the purpose, the invention adopts the technical scheme that: a knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M base stations share K resource blocks and F computing resources, and N edge sensing devices can be supported to be accessed; the ith base station owns at time tA resource block,A computing resource andedge Devices (ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile broadband network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type.
At time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device asIf it isIndicating that the base station i allocates the kth resource block for the EDj; if it isIt indicates that the base station i does not allocate the kth resource block for EDj. Consider that each resource block is allocated to at most one edge device, with
Binary computation for defining that the ith base station accesses the jth edge device at the time tResource allocation variableIf it isIndicating that the base station i distributes the calculation resource f for the EDj; if it isIt indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, of
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, eMBB slice concerns the sum of throughputs of all EDsuRLLC section emphasizes the sum of time delays of all EDsConsidering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDsTherefore, in order to balance the slice differentiation requirements, under the limitations of communication resources, sensing resources, computing resources, total user time delay, total energy consumption and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing computation fusion multidimensional resource collaborative optimization model in step 1 is as follows:
the limiting conditions C1, C2, C3, C4 and C5 are respectively the limiting conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and a total energy consumption E;indicating the number of resource blocks owned by base station i at time t,indicating the computational resources that base station i owns at time t,represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;andrespectively representing the binary resource block allocation variable of the ith base station for allocating the kth resource block to the jth edge device at the time t and the binary computing resource of the computing resource fDistributing variables;representing the sum of the throughputs of all EDs at time t,representing the sum of the time delays of all EDs at time t,and M is the total number of base stations, and represents the sum of energy consumption of all EDs at the time t.
Step 2, according to the network slice general sensing calculation fusion resource collaborative optimization model in the step 1, the number of resource blocks owned by the base station i at the time t is optimized through a multi-agent reinforcement learning optimization method based on knowledge migrationComputing resourcesAnd number of edge devicesAnd binary resource block allocation variableAnd binary computing resource allocation variablesCarrying out optimization solution to obtain the sum of all optimized EDs throughput
Step 2.1, modeling the multi-agent random game process: the optimization problem is modeled into a multi-agent random game process, and each base station is equivalent to an agent.
The state of each base station is defined as:
wherein,indicating the state of the ith base station at time t,and representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t.
The action of each base station is defined as:
wherein,showing the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station iCounting of resourcesNumber of edge sensing devicesAnd the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge deviceAnd a computing resource allocation policy when allocating computing resources f
The reward function of each base station is defined as:
wherein,and (4) representing the reward function of the ith base station at the time t, and reflecting the sum of the throughputs of all EDs in the ith base station.
Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in fig. 1, wherein a base station i observes a state from a network environment at a time tUnder the framework of the Actor-Critic algorithm, an Actor network takes actionSelecting a student or self-learning behavior mode before, then training the network model in each behavior mode, and updating the network parameters, thereby obtaining the optimal user resource and computing resource allocation strategy.
At time t, base station i uses long-short term memory network unit biWill continue for z statesAnd actionsWait for historical knowledgeAs its hidden state
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs the student modeProbability P'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssAnd when G is less than or equal to G, the base station i selects a self-learning mode.
Designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time tAnd actionsWaiting for history informationAnd a policy network parameter θn. Learning parameter P in consideration of multi-head attention mechanism1、P2And P3The assigned attention weight can be obtained:
where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have the weight sum of the linear transformation:
wherein, PsIs the learning parameter for the strategy parameter decoding.
Thus, the student base station i utilizes the hidden stateThe motion at this time is obtained using parameters from a multi-head attention mechanism model:
the gain in base station learning performance from student mode is defined herein as a student rewardThe student Actor-criticic network is trained using a trained attention selection model. By minimizing the student's loss functionUpdating student Critic network parameters of base station i
Wherein,is given by the parameterThe student goal Critic network of (1),andrespectively representing the hidden state at the current time t and the hidden state at the next time t,andrespectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,andrepresenting the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,γ is the discount factor for the student reward function.
Student Actor network passing strategy with mu parameterA policy gradient update is performed as follows:
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station iAnd sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method.
The DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weightsState-action value ofFunction to approximate an optimal state-action value function, whereinFor the hidden state of base station i at time t,is as followsAn action of previous value network generation; targeted value network usage with weightingState-action value ofFunction to improve the performance of the whole network, whereinFor the hidden state of base station i at the next time t,an action generated for the target value network. After a certain number of rounds, the weights of the current value network are copiedTo update the weights of the target value networkWeighting of current value network by gradient descent methodThe update is performed to obtain the minimum loss function:
Meanwhile, in order to reduce the correlation of the empirical data, the algorithm adopts an empirical playback strategy. In a hidden stateNext, base station i acts by performingEarning rewardsThen hide the stateTransition to the hidden state at the next instant tThe deep neural network transfers the state to informationStored in the empirical playback memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory BTo train the neural network. By continuously reducing the correlation among training samples, the base station can be helped to learn and train better so as to avoid the problem that the optimal strategy falls into the local minimum. In addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batchThe problem of overfitting can be effectively reduced.
The method has the advantages that the constructed network slice general sensing calculation fusion resource collaborative optimization model is complete, and the base station finds the optimal allocation strategy of user resources and calculation resources through the continuous interaction between the base station and the environment due to the unknown network environment of the model, so that the method has high practical application value; the transfer learning and the deep reinforcement learning are combined, the existing knowledge domain data of a trained base station can be reused, existing large amount of work is reserved, data are transferred and applied to the training process of other base stations rapidly, the timeliness advantage is embodied, the decision efficiency of the deep reinforcement learning in a high-dimensional space is improved, the learning capacity and the generalization capacity of the base stations are also improved effectively, the complexity and the sparseness when resources are allocated for operating the base stations manually in an uncertain environment are avoided, and the base stations can complete resource collaborative optimization allocation more safely and efficiently.
Drawings
FIG. 1: multi-agent knowledge migration reinforcement learning framework
Detailed Description
The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.
Aiming at the differentiated service requirements of mobile edge network slice users, on the basis of considering the constraint limits of network slice general sensing resources, user equipment time delay, energy consumption and the like, a network slice general sensing resource collaborative optimization problem model is established by taking the total throughput of the user equipment as an optimization target to be maximized. On the basis, the optimization problem is modeled into a multi-agent random game process, a knowledge transfer reinforcement learning algorithm among the multi-agents is researched, the exploration efficiency and expandability of a collaborative optimization strategy are improved, and collaborative optimization of network slicing general-purpose computing resources under diversified service scenes is realized.
Step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M-5 base stations share K-100 resource blocks and F-100 computing resources, and N-75 edge aware devices can be supported to access the resource blocks; the ith base station owns at time tA resource block,A computing resource andedge Devices (ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile broadband network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type.
At time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device asIf it isIndicating that the base station i allocates the kth resource block for the EDj; if it isIt indicates that the base station i does not allocate the kth resource block for EDj. Consider that each resource block is allocated to at most one edge device, with
Defining a binary computing resource allocation variable for an ith base station to access a jth edge device at time tIf it isIndicating that the base station i distributes the calculation resource f for the EDj; if it isIt indicates that the base station i allocates the calculation resource f for EDj. Consider that each computing resource is allocated to at most one edge device, there
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, sum of throughput of all EDs of attention by eMBB sliceuRLLC section emphasizes the sum of time delays of all EDsConsidering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDsThen, in order to balance the slice differentiation requirements, under the restrictions of communication resources, sensing resources, computing resources, total user time delay, total energy consumption, and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing fusion multidimensional resource collaborative optimization model in step 1 is as follows:
the limiting conditions C1, C2, C3, C4 and C5 are respectively constraint conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and total energy consumption E, and the user total time delay T and the total energy consumption E are determined according to actual engineering conditions;indicating the number of resource blocks owned by base station i at time t,indicating the computational resources that base station i owns at time t,represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;andrespectively representing a binary resource block allocation variable of the ith base station for allocating the kth resource block and a binary computing resource allocation variable of the computing resource f for the jth edge device at the moment t;representing the sum of the throughputs of all EDs at time t,representing the sum of the time delays of all EDs at time t,and M is the total number of base stations, and represents the sum of energy consumptions of all EDs at the time t.
Step 2, the network slice general sensing calculation fusion according to the step 1A resource collaborative optimization model, which is used for optimizing the number of resource blocks owned by a base station i at the time t by a knowledge migration-based multi-agent deep reinforcement learning optimization methodComputing resourcesAnd number of edge devicesAnd binary resource block allocation variableAnd binary computing resource allocation variablesCarrying out optimization solution to obtain the sum of all optimized EDs throughput
Step 2.1, modeling the multi-agent random game process: the optimization problem is modeled into a multi-agent random game process, and each base station is equivalent to an agent.
The state of each base station is defined as:
wherein,indicating the state of the ith base station at time t,and representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t.
The action of each base station is defined as:
wherein,showing the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station iCounting of resourcesNumber of edge sensing devicesAnd the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge deviceAnd a computing resource allocation policy when allocating computing resources f
The reward function of each base station is defined as:
wherein,and (4) representing the reward function of the ith base station at the time t, and reflecting the sum of the throughputs of all EDs in the ith base station.
Step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in the figure1, base station i observes the state from the network environment at time tUnder the framework of the Actor-Critic algorithm, an Actor network takes actionSelecting a student or self-learning behavior mode before, then training the network model in each behavior mode, and updating the network parameters, thereby obtaining the optimal user resource and computing resource allocation strategy.
At time t, base station i uses long-short term memory network unit biWill continue for z statesAnd actionsWait for historical knowledgeAs its hidden stateIn actual engineering, too few z state settings can lead to insufficient data volume of network training, and too many settings can lead to low system training efficiency.
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssWhen G is less than or equal to G, the base station i selects a self-learning mode; when the threshold value G is less than 0.5, the base station i is more inclined to select the student mode; on the contrary, when the threshold value G is greater than 0.5, the base station i is more inclined to select the self-learning mode.
Designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time tAnd actionsWaiting for history informationAnd a policy network parameter θn. Considering a multi-head attention mechanism learning parameter P1、P2And P3An assigned attention weight can be obtained:
where D is the dimension of the history information vector for teacher base station n. The final strategy proposal is to have a linear transformed weight sum:
wherein, PsIs the learning parameter for the strategy parameter decoding.
Thus, the student base station i utilizes the hidden stateThe motion at this time is obtained using parameters from a multi-head attention mechanism model:
the gain in base station learning performance from student mode is defined herein as student rewardsThe student Actor-criticic network is trained using a trained attention selection model. By minimizing the student's loss functionUpdating student Critic network parameters of base station i
Wherein,is given by the parameterThe student goal Critic network of (1),andrespectively representing the hidden state at the current time t and the hidden state at the next time t,andrespectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,andrepresenting the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,γ is the discount factor for the student reward function.
Student Actor network passing strategy with mu parameterA policy gradient update is performed as follows:
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station iAnd sending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method.
The DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weightsState-action value ofFunction to approximate an optimal state-action value function, whereinFor the hidden state of base station i at time t,an action generated for the nonce network; targeted value network usage with weightingState-action value ofFunction to improve the performance of the whole network, whereinFor the hidden state of base station i at the next time t,an action generated for the target value network. After a certain number of rounds, the weights of the current value network are copiedTo update the weights of the target value networkWeighting of current value network by gradient descent methodThe update is performed to obtain the minimum loss function:
Meanwhile, in order to reduce the correlation of the empirical data, the algorithm adopts an empirical playback strategy. In a hidden stateNext, base station i acts by performingEarning rewardsThen hide the stateTransition to the hidden state at the next instant tThe deep neural network transfers the state to informationStored in the empirical replay memory B. In the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory BTo train the neural network. By continuously reducing the correlation among training samples, the base station can be helped to learn and train better so as to avoid the problem that the optimal strategy falls into the local minimum. In addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batchThe problem of overfitting can be effectively reduced.
Claims (1)
1. A knowledge migration reinforcement learning network slice general-purpose computation resource collaborative optimization method is characterized by comprising the following steps:
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model;
step 1, constructing a network slice general sensing calculation fusion resource collaborative optimization model, specifically as follows:
assuming that M base stations share K resource blocks and F computing resources, and N edge sensing devices can be supported to be accessed; the ith base station owns at time tResource block, Yi l(t) computing resources andis provided at one edge withSpare (Edge Device, ED); wherein l is ∈ [ e, u, m ∈ [ ]]Where l ═ e denotes an enhanced Mobile BroadBand network slice Type (eMBB), l ═ u denotes a massive Machine Type Communication network slice Type (mtc), and l ═ m denotes an ultra-Reliable Low-Latency Communication (urrllc) network slice Type;
at time t, defining the binary resource block allocation variable of the ith base station accessing the jth edge device asI is more than or equal to 1 and less than or equal to M, j is more than or equal to 1 and less than or equal to N, and K is more than or equal to 1 and less than or equal to K; if it isIndicating that the base station i allocates the kth resource block for the EDj; if it isIndicating that the base station i is the k-th resource block which is not allocated to the EDj; consider that each resource block is allocated to at most one edge device, with
Defining a binary computing resource allocation variable for an ith base station to access a jth edge device at time tF is more than or equal to 1 and less than or equal to F; if it isIndicating that the base station i distributes the calculation resource f for the EDj; if it isIndicating that the base station i distributes the calculation resource f for the EDj; consider that each computing resource is allocated to at most one edge device, there
At time t, consider l ∈ [ e, u, m]Performance differences of three network slice types, eMBB slice concerns the sum of throughputs of all EDsuRLLC section emphasizes the sum of time delays of all EDsConsidering that most mMTC slice devices are in a dormant state at the same time, mMTC slices only pay attention to the sum of throughputs of all EDsThen, in order to balance the slice differentiation requirements, under the restrictions of communication resources, sensing resources, computing resources, total user time delay, total energy consumption, and the like, the sum of the throughputs of all users is maximized as an optimization target, and the network slice general sensing fusion multidimensional resource collaborative optimization model in step 1 is as follows:
the limiting conditions C1, C2, C3, C4 and C5 are respectively the limiting conditions of a communication resource K, a sensing resource N, a computing resource F, a user total time delay T and a total energy consumption E;indicates the number of resource blocks, Y, owned by the base station i at time ti l(t) represents the computational resources owned by base station i at time t,represents the edge aware device owned by base station i at time t, where l ∈ [ e, u, m]Where, l ═ e denotes an eMBB network slice type, l ═ u denotes an mtc network slice type, and l ═ m denotes a urrllc network slice type;andrespectively representing a binary resource block allocation variable of the ith base station for allocating the kth resource block and a binary computing resource allocation variable of the computing resource f for the jth edge device at the moment t;representing the sum of the throughputs of all EDs at time t,representing the sum of the time delays of all EDs at time t,representing the sum of energy consumption of all EDs at the time t, wherein M is the total number of base stations;
step 2, according to step 1The network slice general sensing calculation fusion resource collaborative optimization model carries out the reinforcement learning optimization method of the number of the resource blocks owned by the base station i at the time t through the knowledge transfer based multi-agentComputing resource Yi l(t) and number of edge devicesAnd binary resource block allocation variableAnd binary computing resource allocation variablesCarrying out optimization solution to obtain the sum of all optimized EDs throughput
Step 2.1, modeling the multi-agent random game process: modeling the optimization problem into a multi-agent random game process, and equivalently using each base station as an agent;
the state of each base station is defined as:
wherein,indicating the state of the ith base station at time t,representing user perception calculation tasks in three network slices when the ith base station is accessed to the jth edge device at the moment t;
the action of each base station is defined as:
wherein,indicating the action of the ith base station at the time t, including the number of three network slice resource blocks owned by the base station iNumber of computing resources Yi l(t) number of edge sensing devicesAnd the user resource block allocation strategy when the ith base station allocates the kth resource block for the jth edge deviceAnd a computing resource allocation policy when allocating computing resources f
The reward function of each base station is defined as:
wherein r isi tThe reward function of the ith base station at the moment t is represented, and the sum of the throughputs of all EDs in the ith base station is reflected;
step 2.2, modeling according to the multi-agent random game process in the step 2.1, and designing a multi-agent knowledge migration reinforcement learning model, as shown in fig. 1, wherein a base station i observes a state from a network environment at a time tUnder the framework of the Actor-Critic algorithm, an Actor network takes actionSelecting a student or self-learning behavior mode, training network models in respective behavior modes, and updating network parameters so as to obtain an optimal user resource and computing resource allocation strategy;
at time t, base station i uses long-short term memory network unit biWill continue for z statesAnd actionsWait for historical knowledgeAs its hidden state
The student mode is based on a depth determination strategy gradient model, and the Actor network outputs a probability P 'of selecting the student mode'ssWhen the probability exceeds a threshold G, i.e. P'ssIf the number of the base stations is more than G, the base station i selects a student mode and sends suggestion requests to other base stations; conversely, when P'ssWhen G is less than or equal to G, the base station i selects a self-learning mode;
designing a multi-head attention mechanism model, wherein each base station takes other base stations as teacher base stations and receives the state of the teacher base station n (n is more than or equal to 1 and less than or equal to M) at time tAnd actionsWaiting for history informationAnd a policy network parameter θn(ii) a Learning parameter P in consideration of multi-head attention mechanism1、P2And P3The assigned attention weight can be obtained:
wherein D is the dimension of the historical information vector of the teacher base station n; the final strategy proposal is to have the weight sum of the linear transformation:
wherein, PsIs a learning parameter for policy parameter decoding;
thus, the student base station i utilizes the hidden stateThe motion at this time is obtained using parameters from a multi-head attention mechanism model:
the gain in base station learning performance from student mode is defined herein as a student rewardThe student Actor-criticic network is trained by using a trained attention selection model; by minimizing the student's loss functionUpdating student Critic network parameters of base station i
Wherein,is given by the parameterThe student goal Critic network of (1),andrespectively representing the hidden state at the current time t and the hidden state at the next time t,andrespectively representing the student strategy of the current time t base station i and the student strategy of the next time t base station i,andrepresenting the state-behavior value functions, E [. cndot. ] of the student Critic network and the student target Critic network, respectively]In the interest of expectation,reward function for students, gammaIs a discount factor;
student Actor network passing strategy with mu parameterA policy gradient update is performed as follows:
in the self-learning mode, if the student Actor of the base station i selects the self-learning mode, the student Actor will hide the base station iSending the data to a self-learning network module, wherein each base station independently performs resource optimization action decision by adopting a Deep Q Network (DQN) method;
the DQN algorithm framework consists of a current value network and a target value network, wherein the current value network is used with weightsState-action value ofFunction to approximate an optimal state-action value function, whereinFor the hidden state of base station i at time t,an action generated for the nonce network; targeted value network usage with weightingState-action value ofFunction to improve the performance of the whole network, whereinFor the hidden state of base station i at the next time t,an action generated for the target value network; after a certain number of rounds, the weights of the current value network are copiedTo update the weights of the target value networkWeighting of current value network by gradient descent methodThe update is performed to obtain the minimum loss function:
wherein r isi tGamma is a discount factor for the self-learning reward function;
meanwhile, in order to reduce the correlation of the empirical data, an algorithm adopts an empirical playback strategy; in a hidden stateNext, base station i acts by performingEarning a reward ri tThen hide the stateTransition to the hidden state at the next instant tThe deep neural network transfers the state to informationStored in the experience playback memory B; in the learning process, the mini-batch transfer information sample is randomly extracted from the experience playback memory BTo train the neural network; by continuously reducing the correlation among training samples, the base station can be helped to better learn and train so as to avoid the problem that the optimal strategy falls into the local minimum; in addition, neural networks often overfit some empirical data, transferring information samples by randomly drawing mini-batchThe problem of overfitting can be effectively reduced.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210185185.2A CN114615744A (en) | 2022-02-28 | 2022-02-28 | Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210185185.2A CN114615744A (en) | 2022-02-28 | 2022-02-28 | Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114615744A true CN114615744A (en) | 2022-06-10 |
Family
ID=81858654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210185185.2A Withdrawn CN114615744A (en) | 2022-02-28 | 2022-02-28 | Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114615744A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115190633A (en) * | 2022-07-14 | 2022-10-14 | 南京航空航天大学 | Intelligent frequency spectrum on-line anti-interference frequency point allocation method for communication between devices |
CN115361301A (en) * | 2022-10-09 | 2022-11-18 | 之江实验室 | Distributed computing network cooperative traffic scheduling system and method based on DQN |
CN115496208A (en) * | 2022-11-15 | 2022-12-20 | 清华大学 | Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance |
-
2022
- 2022-02-28 CN CN202210185185.2A patent/CN114615744A/en not_active Withdrawn
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115190633A (en) * | 2022-07-14 | 2022-10-14 | 南京航空航天大学 | Intelligent frequency spectrum on-line anti-interference frequency point allocation method for communication between devices |
CN115361301A (en) * | 2022-10-09 | 2022-11-18 | 之江实验室 | Distributed computing network cooperative traffic scheduling system and method based on DQN |
US12021751B2 (en) | 2022-10-09 | 2024-06-25 | Zhejiang Lab | DQN-based distributed computing network coordinate flow scheduling system and method |
CN115496208A (en) * | 2022-11-15 | 2022-12-20 | 清华大学 | Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Smart resource allocation for mobile edge computing: A deep reinforcement learning approach | |
Yu et al. | Toward resource-efficient federated learning in mobile edge computing | |
Huang et al. | Deep reinforcement learning-based joint task offloading and bandwidth allocation for multi-user mobile edge computing | |
Chen et al. | iRAF: A deep reinforcement learning approach for collaborative mobile edge computing IoT networks | |
CN114615744A (en) | Knowledge migration reinforcement learning network slice general-purpose sensing calculation resource collaborative optimization method | |
CN111405569A (en) | Calculation unloading and resource allocation method and device based on deep reinforcement learning | |
Jiang et al. | Distributed resource scheduling for large-scale MEC systems: A multiagent ensemble deep reinforcement learning with imitation acceleration | |
Chen et al. | Multiuser computation offloading and resource allocation for cloud–edge heterogeneous network | |
Wang et al. | Joint resource allocation and power control for D2D communication with deep reinforcement learning in MCC | |
CN111611062B (en) | Cloud-edge collaborative hierarchical computing method and cloud-edge collaborative hierarchical computing system | |
CN113364859B (en) | MEC-oriented joint computing resource allocation and unloading decision optimization method in Internet of vehicles | |
Meng et al. | Deep reinforcement learning based task offloading algorithm for mobile-edge computing systems | |
CN113573363B (en) | MEC calculation unloading and resource allocation method based on deep reinforcement learning | |
Wang et al. | Optimization for computational offloading in multi-access edge computing: A deep reinforcement learning scheme | |
Fang et al. | Smart collaborative optimizations strategy for mobile edge computing based on deep reinforcement learning | |
Li et al. | DQN-enabled content caching and quantum ant colony-based computation offloading in MEC | |
KR20230007941A (en) | Edge computational task offloading scheme using reinforcement learning for IIoT scenario | |
Li et al. | Task computation offloading for multi-access edge computing via attention communication deep reinforcement learning | |
Li et al. | Computation offloading with reinforcement learning in d2d-mec network | |
Iqbal et al. | Convolutional neural network-based deep Q-network (CNN-DQN) resource management in cloud radio access network | |
Zhu et al. | Computing offloading decision based on multi-objective immune algorithm in mobile edge computing scenario | |
CN110392377A (en) | A kind of 5G super-intensive networking resources distribution method and device | |
Cui et al. | Resource-Efficient DNN Training and Inference for Heterogeneous Edge Intelligence in 6G | |
CN115250156A (en) | Wireless network multichannel frequency spectrum access method based on federal learning | |
Guo et al. | MADRLOM: A Computation offloading mechanism for software-defined cloud-edge computing power network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220610 |
|
WW01 | Invention patent application withdrawn after publication |