CN109412971B - Data distribution method based on action value function learning and electronic equipment - Google Patents

Data distribution method based on action value function learning and electronic equipment Download PDF

Info

Publication number
CN109412971B
CN109412971B CN201811178951.2A CN201811178951A CN109412971B CN 109412971 B CN109412971 B CN 109412971B CN 201811178951 A CN201811178951 A CN 201811178951A CN 109412971 B CN109412971 B CN 109412971B
Authority
CN
China
Prior art keywords
time
cost
vector
user
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811178951.2A
Other languages
Chinese (zh)
Other versions
CN109412971A (en
Inventor
张�成
张险峰
陈庆武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou City Zhilan E Commerce Co ltd
Original Assignee
Guangzhou City Zhilan E Commerce Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou City Zhilan E Commerce Co ltd filed Critical Guangzhou City Zhilan E Commerce Co ltd
Publication of CN109412971A publication Critical patent/CN109412971A/en
Application granted granted Critical
Publication of CN109412971B publication Critical patent/CN109412971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a data distribution method based on action value function learning, which comprises the following steps: acquiring a vector set of all data streams and a position corresponding to each data stream at any time; calculating the currency cost and the energy consumption cost of the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user at the time t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function. The invention realizes data distribution and reduces the expenditure by strengthening learning and selecting the strategy with the minimum cost.

Description

Data distribution method based on action value function learning and electronic equipment
Technical Field
The present invention relates to data distribution technologies, and in particular, to a data distribution method and an electronic device based on action value function learning.
Background
Patent No. CN102821424A discloses a method and a communication device for assisting mobile data offloading, and a mobile device, in which, in a mobile communication mode, a signal is verified by an auxiliary communication device, a first communication link is constructed, and data offloading is performed on a first communication. Although this way can realize the shunting, the construction way is too complex and energy-consuming. US patent No. US20110317571A1 discloses a Method and Apparatus for Data Offloading, which monitors the network environment by setting a plurality of devices, compares the Data usage of the devices in the network, and selects whether to offload to other networks. However, this method is not practical and has a large overhead. U.S. Pat. No. US20120230191A1 discloses a METHOD AND SYSTEM FOR DATA offload IN MOBILE COMMUNICATIONS, which forms a DATA offload SYSTEM with a DATA offload controller through a basic device, AND monitors DATA exchange through the basic device to determine whether DATA offload is required, thereby controlling the signal sent by the DATA offload controller. This method does not consider energy consumption and efficiency, and is highly complex and not practical. Most of the existing patents consider data offloading from network operators, and these policies are to consider quality of service (QoS) of mobile users. The existing strategy considered from the mobile user has high requirements on energy consumption of the whole system and network cost, and the efficiency is not obvious.
Disclosure of Invention
In order to overcome the defects of the prior art, an object of the present invention is to provide a data offloading method based on action cost function learning, which can solve the problem that the prior art cannot better offload data.
Another object of the present invention is to provide an electronic device, which can solve the problem that the prior art cannot better distribute data.
One of the purposes of the invention is realized by adopting the following technical scheme:
a data distribution method based on action value function learning is applied to a network system and comprises the following steps:
s1: setting a random parameter, and initializing an action value function through the random parameter;
s2: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;
s4: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating a target action value function.
Preferably, before S1, further comprising S0: the replay memory D is initialized to capacity N.
Preferably, the random parameter in S1 is defined as θ, the action cost function is Q, and the method further includes the following steps between S1 and S2: sa: using a random parameter theta - Initializing a target action cost function
Figure BDA0001824501870000031
Preferably, in S3, the state of the network system at time t is defined as: s is t ={l t ,b t },
Figure BDA0001824501870000032
With the setting of t =1, the number of the segments,
Figure BDA0001824501870000033
l 1 random, then s 1 =(l 1 ,b 1 ) Where M is the total vector set of all data streams, bt is the vector that retains the file size of all data streams at time t, l 1 The position of the corresponding data stream at time t = 1; and calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t.
Preferably, S3 specifically includes the following steps:
s31: with the setting of t =1, the system is,
Figure BDA0001824501870000034
l 1 random, then s 1 =(l 1 ,b 1 );
S32: judging that when T is less than or equal to T and b>At 0, in [0,1 ]]Randomly selecting a random number rnd, judging whether rnd is less than E, if so, randomly selecting an action from the action vector of the user, otherwise, selecting an action according to a formula
Figure BDA0001824501870000035
An action of the user is obtained, wherein,
Figure BDA0001824501870000036
is the ideal action value function in this step
Figure BDA0001824501870000037
Is equal to
Figure BDA0001824501870000038
a t A motion vector for the user;
s33: definition s t+1 =(l t ,[b t -a t,c -a t,w ] + ) Wherein l is t Is the location of the corresponding data stream at time t, a t,c Vector of data rates allocated for cellular networks, a t,w Allocating a vector of data rates for the wireless network;
s34: by the formula r t (s t ,a t )=c t (s t ,a t )+ε t (s t ,a t ) Calculating the sum of the monetary cost and the energy cost at time t, where r t (s t ,a t ) As a sum of monetary and energy costs, c t (s t ,a t ) To the cost of money,. Epsilon t (s t ,a t ) At the cost of energy consumption.
Preferably, the step S4 specifically includes:
s41: will(s) t ,a t ,r t ,s t+1 ) Storing the experience into a memory D;
s42: randomly sampling samples(s) from memory D j ,a j ,r j ,s j+1 ) If j = j +1, determining whether the operation is terminated, if yes, setting z j =r j Otherwise, set
Figure BDA0001824501870000041
And executing S43; wherein, z j Is a target action value function, r j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;
s43: by (z) j -Q t (s t ,a t ;θ) 2 Gradient descent is performed, t is set to = t +1, and when t is = t +1, resetting is performed
Figure BDA0001824501870000042
The second purpose of the invention is realized by adopting the following technical scheme:
an electronic device having a memory and a processor, the memory having stored therein a computer program executable by the processor, the computer program when executed by the processor implementing the steps of:
s1: setting a random parameter, and initializing an action value function through the random parameter;
s2: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user;
s4: and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function.
Preferably, before S1, further comprising S0: the replay memory D is initialized to capacity N.
Preferably, the random parameter in S1 is defined as θ, the action cost function is Q, and the method further includes the following steps between S1 and S2: sa: using a random parameter theta - Initializing a target action cost function
Figure BDA0001824501870000051
Compared with the prior art, the invention has the beneficial effects that:
by reinforcement learning, the user can respond in time in an unknown environment, and the strategy with the minimum cost is selected to realize data distribution and reduce the overall overhead of the system.
Drawings
FIG. 1 is a flowchart of a data distribution method based on action cost function learning according to the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and the detailed description below:
as shown in fig. 1, the present invention provides a data distribution method based on action cost function learning, which includes the following steps:
s1: initializing a replay memory D to a capacity N;
s2: setting a random parameter theta, and initializing an action value function Q through the random parameter theta;
s3: using a random parameter theta - Initializing a target action cost function
Figure BDA0001824501870000052
S4: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s5: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;
s6: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function.
For S5, the state of the network system at time t is defined as: s t ={l t ,b t },
Figure BDA0001824501870000061
With the setting of t =1, the number of the segments,
Figure BDA0001824501870000062
l 1 random, then s 1 =(l 1 ,b 1 ) Where M is the total vector set of all data streams, bt is the vector that retains the file size of all data streams at time t, l 1 The position of the corresponding data stream at time t = 1; and calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t.
The step S6 specifically comprises the following steps:
s61: will(s) t ,a t ,r t ,s t+1 ) Storing the experience into a memory D;
s62: randomly sampling samples(s) from memory D j ,a j ,r j ,s j+1 ) Judging whether j = j +1 is terminated, if yes, setting z j =r j Otherwise, set
Figure BDA0001824501870000063
And executing S63; wherein, z j Is a target action cost function, r j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;
s63: by (z) j -Q t (s t ,a t ;θ) 2 Performing a gradient descent, setting t: = t +1, and resetting when t: = t +1
Figure BDA0001824501870000064
Due to the wide coverage of cellular networks, it is assumed that mobile users can access cellular networks, but cannot always access wireless local area networks. The access point of the wireless lan is usually set at home, a station, a supermarket, or the like. Therefore, we assume that wireless local area network access is based on geographic location. We are mainly concerned with the application of data size and delay tolerance. The mobile user has M files to be downloaded from the remote server, each file constituting a stream, the collection of data streams being represented as
Figure BDA0001824501870000071
Each stream
Figure BDA00018245018700000713
Having a time limit T j 。T=(T 1 ,T 2 ,…,T M ) Is the set of vectors for the mobile user M stream. Neglecting losses, T is usually assumed 1 ≤T 2 ≤…≤T M . We set the tracking time system
Figure BDA0001824501870000072
To simplify the formulation, we replace the infinite continuous positions with finite discrete positions. Suppose a user can reach L possible locations, expressed as a set
Figure BDA0001824501870000073
When the cellular network can reach all locations, the availability of the wireless lan is location dependent
Figure BDA0001824501870000074
By considering the total generationPrice, energy consumption, data transmission time, user selection, decision which network and how to allocate data rate for M data streams at time t position l. The decision of the user is similar to a finite-dimensional markov decision process. We define the state of the system at time t as: s is t ={l t ,b t }
Figure BDA0001824501870000075
Is a location parameter for the mobile user at time t, which can be obtained from GPS.
Figure BDA0001824501870000076
Is a set of locations.
Figure BDA0001824501870000077
Is a vector of the reserved file sizes of all data streams M at time t. For all of
Figure BDA0001824501870000078
For the stream j it is preferred that,
Figure BDA0001824501870000079
is the total reserved data size.
Figure BDA00018245018700000710
Is a set of vectors that hold data.
Action a of a Mobile user t At decision time t, it is decided to select a wireless local area network or a cellular network for transmitting data, or to remain idle and decide how to allocate a network data rate to the M data streams. Thus, the motion vector of the mobile user is expressed as follows: a is t =(a t,c ,a t,w ),
Figure BDA00018245018700000711
A vector representing the allocated data rate of the cellular network,
Figure BDA00018245018700000712
representing assignments to streams
Figure BDA0001824501870000081
The rate of the cellular data is such that,
Figure BDA0001824501870000082
a vector representing the wireless local area network to which the data rate is assigned. The symbols c and w represent cellular networks and wireless local area networks, respectively. If the user is not within the access node range of the wireless local area network,
Figure BDA0001824501870000083
may all be 0.
But for two reasons: (1) By limiting the use of only one network, the user equipment can operate longer on the remaining capabilities. (2) today's smart machines can only use one type of network. By this assumption, the algorithm can be applied to the device without changing hardware and operating system. When the mobile user selects the wireless local area network, the data rate is assigned to be j,
Figure BDA0001824501870000084
much greater than or equal to 0, without the cellular network,
Figure BDA0001824501870000085
when the mobile user selects the cellular network, a data rate of j is allocated,
Figure BDA0001824501870000086
much greater than or equal to 0, when no wlan is used,
Figure BDA0001824501870000087
Figure BDA0001824501870000088
n ∈ { c, w } should not be larger than the reserved file size
Figure BDA0001824501870000089
Is large.
The data rates of the total flows of the wireless local area network and the cellular network are respectively expressed as
Figure BDA00018245018700000810
Figure BDA00018245018700000811
And
Figure BDA00018245018700000812
a t,w and a t,c The following conditions are to be satisfied,
Figure BDA00018245018700000813
Figure BDA00018245018700000814
Figure BDA00018245018700000815
and
Figure BDA00018245018700000816
which represent the maximum data rates of the wireless local area network and the cellular network, respectively, at location i.
At each time t, three factors, namely currency cost, energy consumption and penalty, influence the action decision of the mobile user.
Monetary cost is a fee from the user to the network operator. We define that network operators employ usage-based policies that are widely used by various countries. The price of the mobile network operator is defined as p c . Wireless local area networks are free. Monetary cost c t (s t ,a t ) Is represented as follows:
Figure BDA00018245018700000817
Figure BDA00018245018700000818
the energy consumption penalty is the energy required to transmit data over a wireless local area network or a cellular network,
Figure BDA0001824501870000091
Figure BDA0001824501870000092
is the energy cost rate in joules or bits when using the cellular network at location l,
Figure BDA0001824501870000093
is the energy cost rate in joules or bits when using the wireless local area network at location i.
Figure BDA0001824501870000094
And
Figure BDA0001824501870000095
both as throughput decreases, lower data transfer speeds consume more energy when transmitting an equal amount of data. The energy consumption of the data transmission of uploading and downloading is different. Thus, energy consumption parameters
Figure BDA0001824501870000096
And
Figure BDA0001824501870000097
should be different for upload and download. In this study we only consider the case of downloading, so we ignore the case of uploading and downloading being different. Although, the algorithm we propose can also be applied to energy consumption in the upload scenario. Theta t Is the energy consumption preference of the mobile user at time t. Theta t Is an energy consumption weight defined by the user. Small theta t Meaning that the user is less interested in energy consumption. For example, if the user can charge the smart machine immediately, he can set θ t At a very small value, or if the user is unable to immediately charge the smart machine in an emergency, the user will give θ t A larger value is assigned. Theta.theta. t =0 means that the user does not consider the energy loss when shunting data. If the data transmission cannot be at the deadline T j The completion of the above-mentioned operation is completed,
Figure BDA0001824501870000098
the penalty is defined by the following equation:
Figure BDA0001824501870000099
g (-) is a non-negative and non-decreasing function. T is j +1 means that the penalty is at the cutoff time T j And (4) post-calculating.
The policy of the mobile user is from time T =0 to T = T M Is defined by the following equation:
Figure BDA00018245018700000910
φ t (l t ,b t ) Is a slave state s t =(l t ,b t ) The mapping is a function of the decision action. The set of pi is denoted by pi. If the decision is taken, then the state is represented by
Figure BDA00018245018700000911
And (4) showing. The aim of the mobile user is to minimize the time T =0 to T = T M Best strategy pi * Lower T = T M The total loss of the penalty of +1,
Figure BDA0001824501870000101
r t (s t ,a t ) Is the sum of the total monetary and energy costs, r t (s t ,a t )=c t (s t ,a t )+ε t (s t ,a t ). To address this problem, the optimal action does not lead to an optimal solution at each time t. At each time t, not only the loss at the present time but also the expected loss later is considered.
The ideal cost of action function is as follows:
Figure BDA0001824501870000102
γ is a discounting factor between (0, 1). The value of the action cost function is called the Q value. The optimal strategy can be easily derived from the optimal Q value,
Figure BDA0001824501870000103
from the formula
Figure BDA0001824501870000104
The activity is calculated.
In the above process, the state dispersion includes an error. The state of the retained data is continuous, but discrete in the formula. One way to reduce errors is to de-scatter the retained data with a small granularity, increasing the number of states in which the data is retained. Excessive data size results in the use of a two-dimensional table to store Q value data, state and action respectively. With the increase in state and data, this approach also becomes infeasible. The convergence rate is too low. If the user experiences many states, the agent will not be able to compute experience from the unknown state and the algorithm will begin to converge.
Therefore, the DQN algorithm is used in the present invention to solve the above-mentioned problems. The layer neural network is used to compute the experience of the mobile user to predict the Q value for the unknown state. Further, continuously retained data without dispersion errors is input into the deep neural network.
In DQN, the action cost function is approximated by a function with a parameter θ to Q t (s, a; θ) estimation. The optimal strategy of the mobile user is obtained by the following formula
Figure BDA0001824501870000105
A weighted theta neural network function approximation is called a Q-network. Q network can be controlled by varying the parameter theta t And the number of iterations i is trained to reduce the root mean square error in the bellman equation. Ideal target value
Figure BDA0001824501870000111
Replaced by an approximated target value.
Figure BDA0001824501870000112
Figure BDA0001824501870000113
Figure BDA0001824501870000114
Is a parameter of past iteration, and the mean square error is represented by a formula
Figure BDA0001824501870000115
By definition, the gradient of the loss function can be defined by
Figure BDA0001824501870000116
To obtain a gradient of
Figure BDA0001824501870000117
The direction of the decreasing gradient function is given. The parameters are updated by the following rules.
Figure BDA0001824501870000118
α is the learning rate.
The present invention also provides an electronic device, which is provided with a memory and a processor, wherein the memory stores a computer program executable by the processor, and the computer program implements the following steps when executed by the processor:
s1: initializing a replay memory D to a capacity N;
s2: setting a random parameter theta, and initializing an action value function Q through the random parameter theta;
s3: using a random parameter theta - Initializing a target action cost function
Figure BDA0001824501870000119
S4: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s5: calculating to obtain the currency cost and the energy consumption cost when the time is t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all data streams when the time is t, any data stream in the vector set of all data streams and the energy consumption of the user when the time is t;
s6: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating a target action value function.
Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims (4)

1. A data distribution method based on action value function learning is applied to a network system and is characterized by comprising the following steps:
s0: initializing a replay memory D to a capacity N;
s1: setting a random parameter, and initializing an action value function through the random parameter; defining the random parameter as theta and the action value function as Q;
sa: using a random parameter theta - Initializing a target action cost function
Figure RE-FDA0003728436370000013
S2: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s3: calculating to obtain the currency cost and the energy consumption cost when the time is t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all data streams when the time is t, any data stream in the vector set of all data streams and the energy consumption of the user when the time is t; defining the state of the network system at time t as follows: s is t ={l t ,b t },
Figure RE-FDA0003728436370000011
With the setting of t =1, the system is,
Figure RE-FDA0003728436370000012
l 1 random, then s 1 =(l 1 ,b 1 ) Where M is the total vector set of all data streams, bt is the vector that retains the file size of all data streams at time t, l 1 The position of the corresponding data stream when the time t = 1; calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;
s4: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating a target action value function.
2. The data offloading method of claim 1, wherein S3 specifically includes the following steps:
s31: with the setting of t =1, the system is,
Figure RE-FDA0003728436370000021
l 1 random, then s 1 =(l 1 ,b 1 );
S32: judging that when T is less than or equal to T and b>At 0, in [0,1 ]]Randomly selecting a random number rnd, judging whether rnd is less than E, if so, randomly selecting an action from the action vector of the user, otherwise, selecting an action according to a formula
Figure RE-FDA0003728436370000022
An action of the user is obtained, wherein,
Figure RE-FDA0003728436370000023
is the ideal action value function in this step
Figure RE-FDA0003728436370000024
Is equal to
Figure RE-FDA0003728436370000025
a t A motion vector for the user;
s33: definition s t+1 =(l t ,[b t -a t,c -a t,w ] + ) Wherein l is t Is the location of the corresponding data stream at time t, a t,c Vector of data rates allocated for cellular network, a t,w Allocating a vector of data rates for the wireless network;
s34: by the formula r t (s t ,a t )=c t (s t ,a t )+ε t (s t ,a t ) Calculating the sum of the monetary cost and the energy cost at time t, where r t (s t ,a t ) As a sum of monetary and energy costs, c t (s t ,a t ) To the cost of money,. Epsilon t (s t ,a t ) At the cost of energy consumption.
3. The data offloading method of claim 2, wherein the step S4 specifically is:
s41: will(s) t ,a t ,r t ,s t+1 ) Storing the experience into a memory D;
s42: randomly sampling samples(s) from memory D j ,a j ,r j ,s j+1 ) If j = j +1, determining whether the operation is terminated, if yes, setting z j =r j Otherwise, set
Figure RE-FDA0003728436370000026
And executing S43; wherein z is j Is a target action cost function, r j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;
s43: by (z) j -Q t (s t ,a t ;θ) 2 Gradient descent is performed, t is set to = t +1, and when t is = t +1, resetting is performed
Figure RE-FDA0003728436370000031
4. An electronic device having a memory and a processor, the memory having a computer program stored therein that is executable by the processor, wherein the computer program when executed by the processor performs the steps of:
s0: initializing a replay memory D to a capacity N;
s1: setting a random parameter, and initializing an action value function through the random parameter; defining a random parameter as theta and an action value function as Q;
sa: using a random parameter theta - Initializing a target action cost function
Figure RE-FDA0003728436370000032
S2: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;
s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; defining the state of the network system at the time t as follows: s t ={l t ,b t },
Figure RE-FDA0003728436370000033
With the setting of t =1, the system is,
Figure RE-FDA0003728436370000034
l 1 random, then s 1 =(l 1 ,b 1 ) Wherein M is the vector set total amount of all data streams, bt is the reserved text of all data streams when the time is tVector of piece size, l 1 The position of the corresponding data stream at time t = 1; calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;
s4: and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function.
CN201811178951.2A 2018-06-22 2018-10-10 Data distribution method based on action value function learning and electronic equipment Active CN109412971B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018106513150 2018-06-22
CN201810651315 2018-06-22

Publications (2)

Publication Number Publication Date
CN109412971A CN109412971A (en) 2019-03-01
CN109412971B true CN109412971B (en) 2023-01-20

Family

ID=65466935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811178951.2A Active CN109412971B (en) 2018-06-22 2018-10-10 Data distribution method based on action value function learning and electronic equipment

Country Status (1)

Country Link
CN (1) CN109412971B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567356B (en) * 2022-03-08 2023-03-24 中电科思仪科技股份有限公司 MU-MIMO space-time data stream distribution method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108093425A (en) * 2017-12-19 2018-05-29 中山米来机器人科技有限公司 A kind of mobile data shunt method based on markov decision process

Also Published As

Publication number Publication date
CN109412971A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN111586696B (en) Resource allocation and unloading decision method based on multi-agent architecture reinforcement learning
CN108920279B (en) Mobile edge computing task unloading method under multi-user scene
Fadlullah et al. HCP: Heterogeneous computing platform for federated learning based collaborative content caching towards 6G networks
CN111414252B (en) Task unloading method based on deep reinforcement learning
Nassar et al. Reinforcement learning for adaptive resource allocation in fog RAN for IoT with heterogeneous latency requirements
CN109947545B (en) Task unloading and migration decision method based on user mobility
CN111182570B (en) User association and edge computing unloading method for improving utility of operator
CN110098969B (en) Fog computing task unloading method for Internet of things
EP2962427B1 (en) Method and system to represent the impact of load variation on service outage over multiple links
CN109802998B (en) Game-based fog network cooperative scheduling excitation method and system
US10819760B2 (en) Method and apparatus for streaming video applications in cellular networks
CN109951869A (en) A kind of car networking resource allocation methods calculated based on cloud and mist mixing
CN112422644B (en) Method and system for unloading computing tasks, electronic device and storage medium
CN107949025A (en) A kind of network selecting method based on non-cooperative game
CN110519849B (en) Communication and computing resource joint allocation method for mobile edge computing
CN104919830A (en) Service preferences for multiple-carrier-enabled devices
JP6724641B2 (en) Management device, communication system, and allocation method
Li et al. DQN-enabled content caching and quantum ant colony-based computation offloading in MEC
CN109412971B (en) Data distribution method based on action value function learning and electronic equipment
Ma et al. Socially aware distributed caching in device-to-device communication networks
WO2011087935A2 (en) Method of controlling resource usage in communication systems
CN110224861A (en) The implementation method of adaptive dynamic heterogeneous network selection policies based on study
CN101448322A (en) Communication apparatus, communication system, and method and program for judging reservation acceptance
Nassar et al. Reinforcement learning-based resource allocation in fog RAN for IoT with heterogeneous latency requirements
US11290917B2 (en) Apparatuses and methods for estimating throughput in accordance with quality of service prioritization and carrier aggregation to facilitate network resource dimensioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190416

Address after: 528400 No. 1 College Road, Shiqi District, Zhongshan City, Guangdong Province

Applicant after: University OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, ZHONGSHAN INSTITUTE

Address before: 510100 No. 6 Number Trade Building, Xiangxing Road, Zhongshan Torch Development Zone, Guangdong Province, 504 and 505 cards on the 5th floor of South Hebei Province

Applicant before: ZHONGSHAN MILAI ROBOT TECHNOLOGY CO.,LTD.

TA01 Transfer of patent application right
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20221212

Address after: 510000 floor 26, No. 268, 270, 272, 274, Sanyuanli Avenue, Baiyun District, Guangzhou, Guangdong

Applicant after: GUANGZHOU CITY ZHILAN E-COMMERCE Co.,Ltd.

Address before: 528400, Xueyuan Road, 1, Shiqi District, Guangdong, Zhongshan

Applicant before: University OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, ZHONGSHAN INSTITUTE

GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190301

Assignee: Guangzhou Aoxibin Trade Co.,Ltd.

Assignor: GUANGZHOU CITY ZHILAN E-COMMERCE Co.,Ltd.

Contract record no.: X2022980029987

Denomination of invention: Data diversion method and electronic equipment based on action value function learning

License type: Common License

Record date: 20230109

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190301

Assignee: Guangzhou jichaole Intelligent Logistics Co.,Ltd.

Assignor: GUANGZHOU CITY ZHILAN E-COMMERCE Co.,Ltd.

Contract record no.: X2023980031688

Denomination of invention: Data diversion method and electronic equipment based on action value function learning

Granted publication date: 20230120

License type: Common License

Record date: 20230207

EE01 Entry into force of recordation of patent licensing contract