CN108966325A

CN108966325A - A kind of optimal decoding sequence uplink transmission time optimization method of nonopiate access based on depth deterministic policy gradient

Info

Publication number: CN108966325A
Application number: CN201810668879.5A
Authority: CN
Inventors: 吴远; 张�成; 倪克杰; 石佳俊; 钱丽萍; 黄亮
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2018-12-07
Anticipated expiration: 2038-06-25
Also published as: CN108966325B

Abstract

The optimal decoding sequence uplink transmission time optimization method of a kind of nonopiate access based on depth deterministic policy gradient, comprising the following steps: (1) giving definite decoding sequence π^mUnder conditions of, optimization problem is described as nonconvex property optimization problem；(P1-m) problem is in given intelligent terminal upload amountIn the case where find optimal whole radio resource consumption, observation (P1-m) problem knows only one variable of its objective function；(2) and (3) find optimal uplink transmission time by depth deterministic policy method, so that there is optimal whole radio resource consumption；(4) optimal decoding is found using algorithm OptOrder-Algorithm to sort, then combined depth nitrification enhancement, finally export global minima entirety radio resource consumption and global optimum's uplink transmission time.The present invention improves system efficiency of transmission, more good wireless network Quality of experience is obtained, so that there is optimal whole radio resource consumption.

Description

In a kind of optimal decoding sequence of nonopiate access based on depth deterministic policy gradient Row transmission time optimization method

Technical field

The invention belongs to the communications field, a kind of optimal decoding sequence of nonopiate access based on depth deterministic policy gradient Uplink transmission time optimization method.

Background technique

The extensive connection for adapting to Internet of Things (Internet of Thing, IoT) application has been considered as following 5G honeycomb system The important goal of system.Non-orthogonal multiple access (Non-orthogonal Multiple Access, NOMA) makes one group of intelligence eventually End (Smart Terminal, ST) can share identical spectral channel simultaneously and be transmitted, to realize that spectrum efficient data pass Defeated target provides a kind of effective method.It is contemplated that the uplink in wireless network is transmitted, wherein intelligent terminal (example Such as smartwatch) using NOMA technology send their data to access hot spot.We are intended to reduce to the maximum extent whole wireless Consumed resource, including uplink transmission time and uplink gross energy.

Summary of the invention

The uplink transmission time of the prior art is longer, the biggish deficiency of intelligent terminal energy consumption in order to overcome, the present invention There is provided a kind of minimum uplink transmission time and all intelligent terminal total power consumptions based on depth deterministic policy gradient The nonopiate optimal decoding sequence uplink transmission time optimization method of access, the present invention difficult point excessive for uplink transmission time, Primary concern is that transmitting data using nonopiate access technology, have studied a kind of based on the non-of depth deterministic policy gradient The orthogonal optimal decoding sequence uplink transmission time optimization method of access.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of nonopiate access based on depth deterministic policy gradient optimal decoding sequence uplink transmission time optimization side Method, comprising the following steps:

(1) a total of I intelligent terminal under the coverage area of access hot spot, intelligent terminal setIt indicates, that is to say, that give one group of intelligent terminalJust there is I！Kind decoding sequence, Intelligent terminal sends data to access hot spot simultaneously using nonopiate access technology, and wherein intelligent terminal i needs the data sent Amount is usedIt indicates；

Guaranteeing to be sent completely the data volume of all intelligent terminals and is giving a kind of decoding sequence π^m, wherein m=1, 2,…,I！Under conditions of, minimize uplink transmission time and all intelligent terminal total power consumptions optimization problem be described as Optimization problem (P1-m) problem shown in lower:

0≤t^m≤T^max (1-3)

Variables:t^m

Each variable in problem is done into an explanation below, as follows:

π^m(i): giving definite decoding sequence π^mUnder conditions of, the decoding order of intelligent terminal i；

α: the weight factor of uplink transmission time；

β: the weight factor of uplink total power consumption；

t^m: intelligent terminal sends data to the uplink transmission time of access hot spot, and unit is the second；

It is about t^mFunction, indicate m kind decode sequence π^mIn the case where, intelligent terminal i is on given Row transmission time t^mInterior completion sends data volumeRequired minimum emissive power, unit are watts；

W: for intelligent terminal to the channel width of access hot spot, unit is hertz；

n₀: the spectral power density of channel background noise；

g_iA: channel power gain of the intelligent terminal i to access hot spot；

Intelligent terminal i needs to be sent to the data volume of access hot spot, and unit is megabit；

Intelligent terminal i maximum uploads energy consumption, and unit is joule；

T^max: intelligent terminal sends data to the maximum uplink transmission time of access hot spot, and unit is the second；

(P1-m) problem is in given intelligent terminal upload amountIn the case where find the smallest entirety and wirelessly provide Source consumption (including uplink transmission time and all intelligent terminal total power consumptions), observation (P1-m) problem know its target Only one variable of function t^*,m；

(2) an optimal uplink transmission time is found by depth deterministic policy gradient method is denoted as t^*,m, the depth Deterministic policy gradient method is spent by execution unit, and scoring unit and environment are formed；When the uplink of all intelligent terminals Between t^mWith the minimum emissive power of each intelligent terminalIt is all compiled into state x needed for execution unit_T, execute Unit takes movement a to uplink transmission time t under current state^mIt is modified and enters next state x_T+1, obtain simultaneously Reward r (the x that environment returns_T, a), score unit bonding state x_T, act the reward r (x that a and environment return_T, a) to execution Unit marking, that is, show execution unit in state x_TUnder take movement a be it is bad, the target of execution unit is exactly to allow judge paper Member make point the higher the better, and the target for the unit that scores is to allow oneself to get every time point all close to really, passing through reward r (x_T, a) adjust；In execution unit, score under unit and the continuous interactive refreshing of environment, t^mIt will be constantly optimised whole until finding The minimum value of body radio resource consumption, the update mode for the unit that scores are as follows:

S(x_T, a)=r (x_T,a)+γS′(x_T+1,a′) (2-1)

Wherein, each parameter definition is as follows:

x_T: in moment T, system status；

x_T+1: in moment T+1, system status；

A: in the movement that current state execution unit is taken；

A ': in the movement that NextState execution unit is taken；

S(x_T, a): the assessment network in execution unit is in state x_TUnder take movement the obtained score value of a；

S′(x_T+1, a '): the target network in execution unit is in state x_T+1Under take movement the obtained score value of a '；

r(x_T, a): in state x_TUnder take movement the obtained reward of a；

γ: reward decaying specific gravity；

(3) the uplink transmission time t of all intelligent terminals^mWith the minimum emissive power of each intelligent terminalState x as depth deterministic policy gradient method_T, movement a is then to state x_TChange, be after change The total losses of system can with one set standard value be compared, if than this standard value greatly if make currently to reward r (x_T,a) It is set as negative value, otherwise is set as positive value, simultaneity factor enters NextState x_T+1；

The iterative process of depth deterministic policy gradient method are as follows:

Step 3.1: the execution unit in initialization depth deterministic policy gradient method, score unit and data base, when Preceding system mode is x_T, T is initialized as 1, and the number of iterations k is initialized as 1；

Step 3.2: when k is less than or equal to given the number of iterations K, in state x_TUnder, execution unit predicts one and moves Make a；

Step 3.3: a is to state x for movement_TIt is modified, it is made to become NextState x_T+1And obtain the prize that environment is fed back Encourage r (x_T,a)；

Step 3.4: according to format (x_T,a,r(xT,a),x_T+1) historical experience is stored in data base；

Step 3.5: scoring unit reception acts a, state x_tWith reward r (x_T, a), score S (x is got to execution unit_T, a)；

Step 3.6: execution unit constantly goes to maximize score S (x by updating inherent parameters_T, a), allow as much as possible certainly Oneself can make high score movement in next time；

Step 3.7: scoring unit extracts the historical experience in data base, constantly learns, and undated parameter oneself to beat Point as far as possible accurate, while k=k+1, return to step 3.2；

Step 3.8: when k is greater than given the number of iterations K, learning process terminates, and obtains optimal uplink transmission time t^*,mAnd optimal whole radio resource consumption

(4) it obtains giving a kind of decoding sequence π^mUnder conditions of optimal uplink transmission time after, then propose algorithm OptOrder-Algorithm sorts to find optimal decoding, namely finds global optimum's uplink transmission time, so that having complete The minimum whole radio resource consumption of office；

The solution procedure of algorithm OptOrder-Algorithm is: setting intelligent terminal collection is combined into I^all={ g_1A,g_2A,…, g_IA, | I^all| indicate set I^allBase, initialize current optional set I^cur={ g_1A,g_2A,…,g_IA, | I^cur| indicate set I^curBase, current optimal decoding sortsCurrent optimal solution CBV is a sufficiently large number, current test setFirstly, first time iterative process, from I^curIn successively select element to inject I^cur,testIn, pass through calling Algorithm P2-Algorithm finds out current optimal I^cur,test, i.e., so that there is the I of current minimum whole radio resource consumption^cur ^,test, update I^cur, i.e., I^allRemove I^cur,testSet later is to I^cur, while updating CBS, i.e., current optimal I^cur ^,testTo CBS；Then in second of iterative process, from current I^curIn successively select element to inject I^cur,testIn (at this time I^cur,testOnly one element is inserted in the element left side or the right), by calling algorithm P2-Algorithm to find out currently most Excellent I^cur,test, i.e., so that there is the I of current minimum whole radio resource consumption^cur,test, update I^cur, i.e., I^allRemove I^cur ^,testSet later is to I^cur, while updating CBS, i.e., current optimal I^cur,testTo CBS；Every time from current I^curIn successively An element is selected to inject I^cur,testWhen, fixed I cannot be changed^cur,testElement position arrangement in set, such iteration To the last an iteration finds decoding the sequence CBS, global minima entirety radio resource consumption θ of global optimum^*, the overall situation is most Excellent uplink transmission time t^*；

Finally, the θ of algorithm OptOrder-Algorithm output^*Required global minima is whole in (P1-m) problem of representative Radio resource consumption, global optimum uplink transmission time t to be asked in (P1-m) problem^*。

Further, in the step (4), the solution procedure of algorithm OptOrder-Algorithm is as follows:

Step 4.1: setting I^all=I^cur={ g_1A,g_2A,…,g_IA},

Step 4.2: starting while circulation

Step 4.3: setting CBV is a sufficiently large number；

Step 4.4: starting for and recycle m=1:1:| I^cur|；

Step 4.5: starting for and recycle h=0:1:| CBS |；

Step 4.6: setting

Step 4.7: if h=0, setting I^cur,test={ I^cur(m),CBS}

Step 4.8: else if h ≠ 0, sets I^cur,test={ CBS (1:h), I^cur(m),CBS(h+1:|CBS|)}；

Step 4.9: obtaining I^cur,testAfterwards, joint (2) and (3) depth deterministic policy gradient method calculates θ^*,cur,test And t^*,m；

Step 4.10: if θ^*,cur,test< CBV sets CBV=θ^*,cur,test, t^*=t^*,m, concurrently set CBS=I^cur ^,test；

Step 4.11: as h=| CBS | when, for circulation of end step 4.5；

Step 4.12: working as m=| I^cur| when, for circulation of end step 4.4；

Step 4.13: setting I^cur=I^all\CBS；

Step 4.14: whenWhen, the while circulation of end step 4.2；

Step 4.15: output θ^*=CBV and t^*。

Technical concept of the invention are as follows: firstly, considering that mobile subscriber passes through nonopiate access skill in cellular radio networks Art transmission data, which are realized, minimizes uplink transmission time and all mobile subscriber's total power consumptions to obtain certain economic benefit And service quality.Here, the premise of consideration is the upload energy consumption and the limitation of uplink transmission time of mobile subscriber.It is protecting Card is sent completely under conditions of all mobile user data amounts, is realized and is minimized whole radio resource consumption and all intelligent terminals Total power consumption amount；Then algorithm OptOrder-Algorithm is proposed to find optimal decoding sequence, calculates the overall situation most Excellent uplink transmission time and global minima entirety radio resource consumption.

Beneficial effects of the present invention are mainly manifested in: 1, generally speaking for uplink, significantly using nonopiate access technology Improve system efficiency of transmission；2, more good wireless network generally speaking for uplink, is obtained by nonopiate access technology Quality of experience；3, optimal uplink transmission time is obtained by depth deterministic policy gradient method, so that there is optimal entirety Radio resource consumption (including uplink transmission time and all intelligent terminal total power consumptions).

Detailed description of the invention

Fig. 1 is the uplink schematic diagram of a scenario of multiple intelligent terminals and access hot spot in wireless network；

Fig. 2 is all ordering scenario schematic diagrames of 3 STs；

5 STs that Fig. 3 corresponds to algorithm OptOrder-Algorithm illustrate schematic diagram；

Fig. 4 is the method flow diagram for finding optimal uplink transmission time.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawing.

Referring to Fig.1, Fig. 2, Fig. 3 and Fig. 4, a kind of optimal decoding row of nonopiate access based on depth deterministic policy gradient Sequence uplink transmission time optimization method, the condition for being sent completely all Intelligent terminal datas can be guaranteed at the same time by carrying out this method Under, so that uplink transmission time and all intelligent terminal total power consumptions minimize, improve the wireless network experience of whole system Quality.The present invention is applied to wireless network, in scene as shown in Figure 1.Include for optimization method of the target design to problem Following steps:

0≤t^m≤T^max (1-3)

Variables:t^m

Each variable in problem is done into an explanation below, as follows:

α: the weight factor of uplink transmission time；

β: the weight factor of uplink total power consumption；

n₀: the spectral power density of channel background noise；

g_iA: channel power gain of the intelligent terminal i to access hot spot；

Intelligent terminal i maximum uploads energy consumption, and unit is joule；

S(x_T, a)=r (x_T,a)+γS′(x_T+1,a′) (2-1)

Wherein, each parameter definition is as follows:

x_T: in moment T, system status；

x_T+1: in moment T+1, system status；

A: in the movement that current state execution unit is taken；

A ': in the movement that NextState execution unit is taken；

r(x_T, a): in state x_TUnder take movement the obtained reward of a；

γ: reward decaying specific gravity；

(3) the uplink transmission time t of all intelligent terminals^mWith the minimum emissive power of each intelligent terminal State x as depth deterministic policy gradient method_T, movement a is then to state x_TChange, the total losses of system after change Can with one set standard value be compared, if than this standard value greatly if make currently to reward r (x_T, it a) is set as negative value, instead Be set as positive value, simultaneity factor enters NextState x_T+1；

Step 3.4: according to format (x_T,a_,r(x_T,a),x_T+1) historical experience is stored in data base；

The solution procedure of algorithm OptOrder-Algorithm is: setting intelligent terminal collection is combined into I^all={ g_1A,g_2A,…, g_IA, | I^all| indicate set I^allBase, initialize current optional set I^cur={ g_1A,g_2A,…,g_IA, | I^cur| indicate set I^curBase, current optimal decoding sortsCurrent optimal solution CBV is a sufficiently large number, current test setFirstly, first time iterative process, from I^curIn successively select element to inject I^cur,testIn, pass through calling Algorithm P2-Algorithm finds out current optimal I^cur,test, i.e., so that there is the I of current minimum whole radio resource consumption^cur ^,test, update I^cur, i.e., I^allRemove I^cur,testSet later is to I^cur, while updating CBS, i.e., current optimal I^cur ^,testTo CBS；Then in second of iterative process, from current I^curIn successively select element to inject I^cur,testIn (at this time I^cur,testOnly one element is inserted in the element left side or the right), by calling algorithm P2-Algorithm to find out currently most Excellent I^cur,test, i.e., so that there is the I of current minimum whole radio resource consumption^cur,test, update I^cur, i.e., I^allRemove I^cur ^,testSet later is to I^cur, while updating CBS, i.e., current optimal I^cur,testTo CBS；Every time from current I^curIn successively An element is selected to inject I^cur,testWhen, fixed I cannot be changed^cur,testElement position arrangement in set, such iteration To the last an iteration finds decoding the sequence CBS, global minima entirety radio resource consumption θ of global optimum^*, the overall situation is most Excellent uplink transmission time t^*；The solution procedure of algorithm OptOrder-Algorithm is as follows:

Step 4.1: setting I^all=I^cur={ g_1A,g_2A,…,g_IA},

Step 4.2: starting while circulation

Step 4.3: setting CBV is a sufficiently large number；

Step 4.4: starting for and recycle m=1:1:| I^cur|；

Step 4.5: starting for and recycle h=0:1:| CBS |；

Step 4.6: setting

Step 4.7: if h=0, setting I^cur,test={ I^cur(m),CBS}

Step 4.11: as h=| CBS | when, for circulation of end step 4.5；

Step 4.12: working as m=| I^cur| when, for circulation of end step 4.4；

Step 4.13: setting I^cur=I^all\CBS；

Step 4.14: whenWhen, the while circulation of end step 4.2；

Step 4.15: output θ^*=CBV and t^*；

Claims

The uplink transmission time optimization method 1. a kind of optimal decoding of nonopiate access based on depth deterministic policy gradient is sorted, It is characterized in that, the described method comprises the following steps:

(1) a total of I intelligent terminal under the coverage area of access hot spot, intelligent terminal setTable Show, that is to say, that give one group of intelligent terminalJust there is I！Kind decoding sequence, intelligent terminal using it is non-just Access technology is handed over to send data to access hot spot simultaneously, the data volume that wherein intelligent terminal i needs to send is usedIt indicates；

Guaranteeing to be sent completely the data volume of all intelligent terminals and is giving a kind of decoding sequence π^m, wherein m=1,2 ..., I！ Under conditions of, what the optimization problem description of minimum uplink transmission time and all intelligent terminal total power consumptions was as follows Optimization problem (P1-m) problem:

0≤t^m≤T^max (1-3)

Variables:t^mm

Each variable in problem is done into an explanation below, as follows:

π^m(i): giving definite decoding sequence π^mUnder conditions of, the decoding order of intelligent terminal i；

α: the weight factor of uplink transmission time；

β: the weight factor of uplink total power consumption；

t^m: intelligent terminal sends data to the uplink transmission time of access hot spot, and unit is the second；

It is about t^mFunction, indicate m kind decode sequence π^mIn the case where, intelligent terminal i is passed in given uplink Defeated time t^mInterior completion sends data volumeRequired minimum emissive power, unit are watts；

W: for intelligent terminal to the channel width of access hot spot, unit is hertz；

n₀: the spectral power density of channel background noise；

g_iA: channel power gain of the intelligent terminal i to access hot spot；

Intelligent terminal i needs to be sent to the data volume of access hot spot, and unit is megabit；

Intelligent terminal i maximum uploads energy consumption, and unit is joule；

T^max: intelligent terminal sends data to the maximum uplink transmission time of access hot spot, and unit is the second；

(P1-m) problem is in given intelligent terminal upload amountIn the case where find the smallest whole radio resource and disappear Consumption (including uplink transmission time and all intelligent terminal total power consumptions), observation (P1-m) problem know its objective function Only one variable t^{*, m}；

(2) an optimal uplink transmission time is found by depth deterministic policy gradient method is denoted as t^{*, m}, the depth is true Qualitative Policy-Gradient method is made of execution unit, scoring unit and environment；The uplink transmission time t of all intelligent terminals^m With the minimum emissive power of each intelligent terminalIt is all compiled into state x needed for execution unit_T, execute list Member takes movement a to uplink transmission time t under current state^mIt is modified and enters next state x_T+1, while obtaining ring Reward r (the x that border returns_T, a), score unit bonding state x_T, act the reward r (x that a and environment return_T, a) executed list Member marking, that is, show execution unit in state x_TUnder take movement a be it is bad, the target of execution unit be exactly allow scoring unit Make score the higher the better, and the target for the unit that scores is that oneself is allowed to get every time point all close to true, passes through reward r (x_T, A) it adjusts；In execution unit, score under unit and the continuous interactive refreshing of environment, t^mIt will be constantly optimised until finding whole nothing The minimum value of line resource consumption, the update mode for the unit that scores are as follows:

S(x_T, a)=r (x_T, a)+γ S ' (x_T+1, a ') and (2-1)

Wherein, each parameter definition is as follows:

x_T: in moment T, system status；

x_T+1: in moment T+1, system status；

A: in the movement that current state execution unit is taken；

A ': in the movement that NextState execution unit is taken；

S(x_T, a): the assessment network in execution unit is in state x_TUnder take movement the obtained score value of a；

S′(x_T+1, a '): the target network in execution unit is in state x_T+1Under take movement the obtained score value of a '；

r(x_T, a): in state x_TUnder take movement the obtained reward of a；

γ: reward decaying specific gravity；

(3) the uplink transmission time t of all intelligent terminals^mWith the minimum emissive power of each intelligent terminalAs The state x of depth deterministic policy gradient method_T, movement a is then to state x_TChange, the total losses of system can be with after change One setting standard value be compared, if than this standard value greatly if make currently to reward r (x_T, it a) is set as negative value, otherwise is set For positive value, simultaneity factor enters NextState x_T+1；

The iterative process of depth deterministic policy gradient method are as follows:

Step 3.1: the execution unit in initialization depth deterministic policy gradient method, score unit and data base, current to be System state is x_T, T is initialized as 1, and the number of iterations k is initialized as 1；

Step 3.2: when k is less than or equal to given the number of iterations K, in state x_TUnder, execution unit predicts a movement a；

Step 3.3: a is to state x for movement_TIt is modified, it is made to become NextState x_T+1And obtain the reward r that environment is fed back (x_T, a)；

Step 3.4: according to format (x_T, a, r (x_T, a), x_T+1) historical experience is stored in data base；

Step 3.5: scoring unit reception acts a, state x_tWith reward r (x_T, a), score S (x is got to execution unit_T, a)；

Step 3.6: execution unit constantly goes to maximize score S (x by updating inherent parameters_T, a), allow as much as possible oneself under It is secondary to make high score movement；

Step 3.7: scoring unit extracts the historical experience in data base, constantly learns, and undated parameter makes score that oneself is made It is as accurate as possible, while k=k+1, return to step 3.2；

Step 3.8: when k is greater than given the number of iterations K, learning process terminates, and obtains optimal uplink transmission time t^{*, m}, and Optimal whole radio resource consumption

(4) it obtains giving a kind of decoding sequence π^mUnder conditions of optimal uplink transmission time after, then propose algorithm OptOrder-Algorithm sorts to find optimal decoding, namely finds global optimum's uplink transmission time, so that having complete The minimum whole radio resource consumption of office；

The solution throughway of algorithm OptOrder-Algorithm is: setting intelligent terminal collection is combined into I^all={ g_1A, g_2A..., g_IA, | I^all| indicate set I^allBase, initialize current optional set I^cur={ g_1A, g_2A..., g_IA, | I^cur| indicate set I^curBase, current optimal decoding sortsCurrent optimal solution CBV is a sufficiently large number, current test setFirstly, first time iterative process, from I^curIn successively select element to inject I^{Cur, test}In, pass through calling Algorithm P2-Algorithm finds out current optimal I^{Cur, test}, i.e., so that there is the I of current minimum whole radio resource consumption^cur ^{, test}, update I^cur, i.e., I^allRemove I^{Cur, test}Set later is to I^cur, while updating CBS, i.e., current optimal I^cur ^{, test}To CBS；Then in second of iterative process, from current I^curIn successively select element to inject I^{Cur, test}In (at this time I^{Cur, test}Only one element is inserted in the element left side or the right), by calling algorithm P2-Algorithm to find out currently most Excellent I^{Cur, test}, i.e., so that there is the I of current minimum whole radio resource consumption^{Cur, test}, update I^cur, i.e., I^allRemove I^cur ^{, test}Set later is to I^cur, while updating CBS, i.e., current optimal I^{Cur, test}To CBS；Every time from current I^curIn successively An element is selected to inject I^{Cur, test}When, fixed I cannot be changed^{Cur, test}Element position arrangement in set, such iteration To the last an iteration finds decoding the sequence CBS, global minima entirety radio resource consumption θ of global optimum^*, the overall situation is most Excellent uplink transmission time t^*；The solution procedure of algorithm OptOrder-Algorithm is as follows:

Step 4.1: setting

Step 4.2: starting while circulation

Step 4.3: setting CBV is a sufficiently large number；

Step 4.4: starting for and recycle m=1:1:| I^cur|；

Step 4.5: starting for and recycle h=0:1:| CBS |；

Step 4.6: setting

Step 4.7: if h=0, setting I^{Cur, test}={ I^cur(m), CBS }

Step 4.8: else if h ≠ 0, sets I^{Cur, test}={ CBS (1:h), I^cur(m), CBS (h+1:| CBS |) }；

Step 4.9: obtaining I^{Cur, test}Afterwards, joint (2) and (3) depth deterministic policy gradient method calculates θ^{*, cur, test}And t^{*, m}；

Step 4.10: if θ^{*, cur, test}< CBV sets CBV=θ^{*, cur, test}, t^*=t^{*, m}, concurrently set CBS=I^{Cur, test}；

Step 4.11: as h=| CBS | when, for circulation of end step 4.5；

Step 4.12: working as m=| I^cur| when, for circulation of end step 4.4；

Step 4.13: setting I^cur=I^all\CBS；

Step 4.14: whenWhen, the while circulation of end step 4.2；

Step 4.15: output θ^*=CBV and t^*；

Finally, the θ of algorithm OptOrder-Algorithm output^*Required global minima is integrally wireless in (P1-m) problem of representative Resource consumption, global optimum uplink transmission time t to be asked in (P1-m) problem^*。