CN109412971B

CN109412971B - Data distribution method based on action value function learning and electronic equipment

Info

Publication number: CN109412971B
Application number: CN201811178951.2A
Authority: CN
Inventors: 张�成; 张险峰; 陈庆武
Original assignee: Guangzhou City Zhilan E Commerce Co ltd
Current assignee: Guangzhou City Zhilan E Commerce Co ltd
Priority date: 2018-06-22
Filing date: 2018-10-10
Publication date: 2023-01-20
Anticipated expiration: 2038-10-10
Also published as: CN109412971A

Abstract

The invention discloses a data distribution method based on action value function learning, which comprises the following steps: acquiring a vector set of all data streams and a position corresponding to each data stream at any time; calculating the currency cost and the energy consumption cost of the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user at the time t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function. The invention realizes data distribution and reduces the expenditure by strengthening learning and selecting the strategy with the minimum cost.

Description

Data distribution method based on action value function learning and electronic equipment

Technical Field

The present invention relates to data distribution technologies, and in particular, to a data distribution method and an electronic device based on action value function learning.

Background

Patent No. CN102821424A discloses a method and a communication device for assisting mobile data offloading, and a mobile device, in which, in a mobile communication mode, a signal is verified by an auxiliary communication device, a first communication link is constructed, and data offloading is performed on a first communication. Although this way can realize the shunting, the construction way is too complex and energy-consuming. US patent No. US20110317571A1 discloses a Method and Apparatus for Data Offloading, which monitors the network environment by setting a plurality of devices, compares the Data usage of the devices in the network, and selects whether to offload to other networks. However, this method is not practical and has a large overhead. U.S. Pat. No. US20120230191A1 discloses a METHOD AND SYSTEM FOR DATA offload IN MOBILE COMMUNICATIONS, which forms a DATA offload SYSTEM with a DATA offload controller through a basic device, AND monitors DATA exchange through the basic device to determine whether DATA offload is required, thereby controlling the signal sent by the DATA offload controller. This method does not consider energy consumption and efficiency, and is highly complex and not practical. Most of the existing patents consider data offloading from network operators, and these policies are to consider quality of service (QoS) of mobile users. The existing strategy considered from the mobile user has high requirements on energy consumption of the whole system and network cost, and the efficiency is not obvious.

Disclosure of Invention

In order to overcome the defects of the prior art, an object of the present invention is to provide a data offloading method based on action cost function learning, which can solve the problem that the prior art cannot better offload data.

Another object of the present invention is to provide an electronic device, which can solve the problem that the prior art cannot better distribute data.

One of the purposes of the invention is realized by adopting the following technical scheme:

a data distribution method based on action value function learning is applied to a network system and comprises the following steps:

s1: setting a random parameter, and initializing an action value function through the random parameter;

s2: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;

s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;

s4: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating a target action value function.

Preferably, before S1, further comprising S0: the replay memory D is initialized to capacity N.

Preferably, the random parameter in S1 is defined as θ, the action cost function is Q, and the method further includes the following steps between S1 and S2: sa: using a random parameter theta ^- Initializing a target action cost function

Preferably, in S3, the state of the network system at time t is defined as: s is _t ＝{l _t ,b _t }，

With the setting of t =1, the number of the segments,

l ₁ random, then s ₁ ＝(l ₁ ,b ₁ ) Where M is the total vector set of all data streams, bt is the vector that retains the file size of all data streams at time t, l ₁ The position of the corresponding data stream at time t = 1; and calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t.

Preferably, S3 specifically includes the following steps:

s31: with the setting of t =1, the system is,

l ₁ random, then s ₁ ＝(l ₁ ,b ₁ )；

S32: judging that when T is less than or equal to T and b>At 0, in [0,1 ]]Randomly selecting a random number rnd, judging whether rnd is less than E, if so, randomly selecting an action from the action vector of the user, otherwise, selecting an action according to a formula

An action of the user is obtained, wherein,

is the ideal action value function in this step

Is equal to

a _t A motion vector for the user;

s33: definition s _t+1 ＝(l _t ,[b _t -a _t,c -a _t,w ] ⁺ ) Wherein l is _t Is the location of the corresponding data stream at time t, a _t,c Vector of data rates allocated for cellular networks, a _t,w Allocating a vector of data rates for the wireless network;

s34: by the formula r _t (s _t ,a _t )＝c _t (s _t ,a _t )+ε _t (s _t ,a _t ) Calculating the sum of the monetary cost and the energy cost at time t, where r _t (s _t ,a _t ) As a sum of monetary and energy costs, c _t (s _t ,a _t ) To the cost of money,. Epsilon _t (s _t ,a _t ) At the cost of energy consumption.

Preferably, the step S4 specifically includes:

s41: will(s) _t ,a _t ,r _t ,s _t+1 ) Storing the experience into a memory D;

s42: randomly sampling samples(s) from memory D _j ,a _j ,r _j ,s _j+1 ) If j = j +1, determining whether the operation is terminated, if yes, setting z _j ＝r _j Otherwise, set

And executing S43; wherein, z _j Is a target action value function, r _j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;

s43: by (z) _j -Q _t (s _t ,a _t ；θ) ² Gradient descent is performed, t is set to = t +1, and when t is = t +1, resetting is performed

The second purpose of the invention is realized by adopting the following technical scheme:

an electronic device having a memory and a processor, the memory having stored therein a computer program executable by the processor, the computer program when executed by the processor implementing the steps of:

s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user;

s4: and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function.

Compared with the prior art, the invention has the beneficial effects that:

by reinforcement learning, the user can respond in time in an unknown environment, and the strategy with the minimum cost is selected to realize data distribution and reduce the overall overhead of the system.

Drawings

FIG. 1 is a flowchart of a data distribution method based on action cost function learning according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and the detailed description below:

as shown in fig. 1, the present invention provides a data distribution method based on action cost function learning, which includes the following steps:

s1: initializing a replay memory D to a capacity N;

s2: setting a random parameter theta, and initializing an action value function Q through the random parameter theta;

s3: using a random parameter theta ^- Initializing a target action cost function

S4: acquiring a vector set of all data streams and a position corresponding to each data stream at any time;

s5: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;

s6: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating the target action value function.

For S5, the state of the network system at time t is defined as: s _t ＝{l _t ,b _t }，

With the setting of t =1, the number of the segments,

The step S6 specifically comprises the following steps:

s61: will(s) _t ,a _t ,r _t ,s _t+1 ) Storing the experience into a memory D;

s62: randomly sampling samples(s) from memory D _j ,a _j ,r _j ,s _j+1 ) Judging whether j = j +1 is terminated, if yes, setting z _j ＝r _j Otherwise, set

And executing S63; wherein, z _j Is a target action cost function, r _j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;

s63: by (z) _j -Q _t (s _t ,a _t ；θ) ² Performing a gradient descent, setting t: = t +1, and resetting when t: = t +1

Due to the wide coverage of cellular networks, it is assumed that mobile users can access cellular networks, but cannot always access wireless local area networks. The access point of the wireless lan is usually set at home, a station, a supermarket, or the like. Therefore, we assume that wireless local area network access is based on geographic location. We are mainly concerned with the application of data size and delay tolerance. The mobile user has M files to be downloaded from the remote server, each file constituting a stream, the collection of data streams being represented as

Each stream

Having a time limit T ^j 。T＝(T ¹ ，T ² ，…，T ^M ) Is the set of vectors for the mobile user M stream. Neglecting losses, T is usually assumed ¹ ≤T ² ≤…≤T ^M . We set the tracking time system

To simplify the formulation, we replace the infinite continuous positions with finite discrete positions. Suppose a user can reach L possible locations, expressed as a set

When the cellular network can reach all locations, the availability of the wireless lan is location dependent

By considering the total generationPrice, energy consumption, data transmission time, user selection, decision which network and how to allocate data rate for M data streams at time t position l. The decision of the user is similar to a finite-dimensional markov decision process. We define the state of the system at time t as: s is _t ＝{l _t ,b _t }

Is a location parameter for the mobile user at time t, which can be obtained from GPS.

Is a set of locations.

Is a vector of the reserved file sizes of all data streams M at time t. For all of

For the stream j it is preferred that,

is the total reserved data size.

Is a set of vectors that hold data.

Action a of a Mobile user _t At decision time t, it is decided to select a wireless local area network or a cellular network for transmitting data, or to remain idle and decide how to allocate a network data rate to the M data streams. Thus, the motion vector of the mobile user is expressed as follows: a is _t ＝(a _t,c ,a _t,w )，

A vector representing the allocated data rate of the cellular network,

representing assignments to streams

The rate of the cellular data is such that,

a vector representing the wireless local area network to which the data rate is assigned. The symbols c and w represent cellular networks and wireless local area networks, respectively. If the user is not within the access node range of the wireless local area network,

may all be 0.

But for two reasons: (1) By limiting the use of only one network, the user equipment can operate longer on the remaining capabilities. (2) today's smart machines can only use one type of network. By this assumption, the algorithm can be applied to the device without changing hardware and operating system. When the mobile user selects the wireless local area network, the data rate is assigned to be j,

much greater than or equal to 0, without the cellular network,

when the mobile user selects the cellular network, a data rate of j is allocated,

much greater than or equal to 0, when no wlan is used,

n ∈ { c, w } should not be larger than the reserved file size

Is large.

The data rates of the total flows of the wireless local area network and the cellular network are respectively expressed as

And

a _t,w and a _t,c The following conditions are to be satisfied,

and

which represent the maximum data rates of the wireless local area network and the cellular network, respectively, at location i.

At each time t, three factors, namely currency cost, energy consumption and penalty, influence the action decision of the mobile user.

Monetary cost is a fee from the user to the network operator. We define that network operators employ usage-based policies that are widely used by various countries. The price of the mobile network operator is defined as p _c . Wireless local area networks are free. Monetary cost c _t (s _t ,a _t ) Is represented as follows:

the energy consumption penalty is the energy required to transmit data over a wireless local area network or a cellular network,

is the energy cost rate in joules or bits when using the cellular network at location l,

is the energy cost rate in joules or bits when using the wireless local area network at location i.

And

both as throughput decreases, lower data transfer speeds consume more energy when transmitting an equal amount of data. The energy consumption of the data transmission of uploading and downloading is different. Thus, energy consumption parameters

And

should be different for upload and download. In this study we only consider the case of downloading, so we ignore the case of uploading and downloading being different. Although, the algorithm we propose can also be applied to energy consumption in the upload scenario. Theta _t Is the energy consumption preference of the mobile user at time t. Theta _t Is an energy consumption weight defined by the user. Small theta _t Meaning that the user is less interested in energy consumption. For example, if the user can charge the smart machine immediately, he can set θ _t At a very small value, or if the user is unable to immediately charge the smart machine in an emergency, the user will give θ _t A larger value is assigned. Theta.theta. _t =0 means that the user does not consider the energy loss when shunting data. If the data transmission cannot be at the deadline T ^j The completion of the above-mentioned operation is completed,

the penalty is defined by the following equation:

g (-) is a non-negative and non-decreasing function. T is ^j +1 means that the penalty is at the cutoff time T _j And (4) post-calculating.

The policy of the mobile user is from time T =0 to T = T ^M Is defined by the following equation:

φ _t (l _t ,b _t ) Is a slave state s _t ＝(l _t ,b _t ) The mapping is a function of the decision action. The set of pi is denoted by pi. If the decision is taken, then the state is represented by

And (4) showing. The aim of the mobile user is to minimize the time T =0 to T = T ^M Best strategy pi ^* Lower T = T ^M The total loss of the penalty of +1,

r _t (s _t ,a _t ) Is the sum of the total monetary and energy costs, r _t (s _t ,a _t )＝c _t (s _t ,a _t )+ε _t (s _t ,a _t ). To address this problem, the optimal action does not lead to an optimal solution at each time t. At each time t, not only the loss at the present time but also the expected loss later is considered.

The ideal cost of action function is as follows:

γ is a discounting factor between (0, 1). The value of the action cost function is called the Q value. The optimal strategy can be easily derived from the optimal Q value,

from the formula

The activity is calculated.

In the above process, the state dispersion includes an error. The state of the retained data is continuous, but discrete in the formula. One way to reduce errors is to de-scatter the retained data with a small granularity, increasing the number of states in which the data is retained. Excessive data size results in the use of a two-dimensional table to store Q value data, state and action respectively. With the increase in state and data, this approach also becomes infeasible. The convergence rate is too low. If the user experiences many states, the agent will not be able to compute experience from the unknown state and the algorithm will begin to converge.

Therefore, the DQN algorithm is used in the present invention to solve the above-mentioned problems. The layer neural network is used to compute the experience of the mobile user to predict the Q value for the unknown state. Further, continuously retained data without dispersion errors is input into the deep neural network.

In DQN, the action cost function is approximated by a function with a parameter θ to Q _t (s, a; θ) estimation. The optimal strategy of the mobile user is obtained by the following formula

A weighted theta neural network function approximation is called a Q-network. Q network can be controlled by varying the parameter theta _t And the number of iterations i is trained to reduce the root mean square error in the bellman equation. Ideal target value

Replaced by an approximated target value.

Is a parameter of past iteration, and the mean square error is represented by a formula

By definition, the gradient of the loss function can be defined by

To obtain a gradient of

The direction of the decreasing gradient function is given. The parameters are updated by the following rules.

α is the learning rate.

The present invention also provides an electronic device, which is provided with a memory and a processor, wherein the memory stores a computer program executable by the processor, and the computer program implements the following steps when executed by the processor:

s1: initializing a replay memory D to a capacity N;

s5: calculating to obtain the currency cost and the energy consumption cost when the time is t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all data streams when the time is t, any data stream in the vector set of all data streams and the energy consumption of the user when the time is t;

s6: calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; and updating the state of the network system, recalculating the monetary cost and the energy consumption, storing the current updated state of the network system, and calculating a target action value function.

Various other modifications and changes may be made by those skilled in the art based on the above-described technical solutions and concepts, and all such modifications and changes should fall within the scope of the claims of the present invention.

Claims

1. A data distribution method based on action value function learning is applied to a network system and is characterized by comprising the following steps:

s0: initializing a replay memory D to a capacity N;

s1: setting a random parameter, and initializing an action value function through the random parameter; defining the random parameter as theta and the action value function as Q;

sa: using a random parameter theta ^- Initializing a target action cost function

s3: calculating to obtain the currency cost and the energy consumption cost when the time is t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all data streams when the time is t, any data stream in the vector set of all data streams and the energy consumption of the user when the time is t; defining the state of the network system at time t as follows: s is _t ＝{l _t ,b _t }，

With the setting of t =1, the system is,

l ₁ random, then s ₁ ＝(l ₁ ,b ₁ ) Where M is the total vector set of all data streams, bt is the vector that retains the file size of all data streams at time t, l ₁ The position of the corresponding data stream when the time t = 1; calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;

2. The data offloading method of claim 1, wherein S3 specifically includes the following steps:

s31: with the setting of t =1, the system is,

l ₁ random, then s ₁ ＝(l ₁ ,b ₁ )；

An action of the user is obtained, wherein,

is the ideal action value function in this step

Is equal to

a _t A motion vector for the user;

s33: definition s _t+1 ＝(l _t ,[b _t -a _t,c -a _t,w ] ⁺ ) Wherein l is _t Is the location of the corresponding data stream at time t, a _t,c Vector of data rates allocated for cellular network, a _t,w Allocating a vector of data rates for the wireless network;

3. The data offloading method of claim 2, wherein the step S4 specifically is:

s41: will(s) _t ,a _t ,r _t ,s _t+1 ) Storing the experience into a memory D;

And executing S43; wherein z is _j Is a target action cost function, r _j The sum of the currency cost and the energy consumption cost of the data stream j is obtained;

4. An electronic device having a memory and a processor, the memory having a computer program stored therein that is executable by the processor, wherein the computer program when executed by the processor performs the steps of:

s0: initializing a replay memory D to a capacity N;

s1: setting a random parameter, and initializing an action value function through the random parameter; defining a random parameter as theta and an action value function as Q;

s3: calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at any time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t; calculating an ideal action value function according to the currency cost, the energy consumption of the user when the time is t and the action vector of the user; defining the state of the network system at the time t as follows: s _t ＝{l _t ,b _t }，

With the setting of t =1, the system is,

l ₁ random, then s ₁ ＝(l ₁ ,b ₁ ) Wherein M is the vector set total amount of all data streams, bt is the reserved text of all data streams when the time is tVector of piece size, l ₁ The position of the corresponding data stream at time t = 1; calculating the currency cost and the energy consumption cost at the time t according to the state of the network system at the time t, the action vector of the user, the vector of the size of the reserved file of all the data streams at the time t, any data stream in the vector set of all the data streams and the energy consumption of the user at the time t;