CN112367683A

CN112367683A - Network selection method based on improved deep Q learning

Info

Publication number: CN112367683A
Application number: CN202011286673.XA
Authority: CN
Inventors: 马彬; 陈海波; 张超
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-12
Anticipated expiration: 2040-11-17
Also published as: CN112367683B

Abstract

The invention requests to protect a network selection method based on improved deep Q learning. In a super-dense heterogeneous wireless network with a dormancy mechanism, a network selection algorithm based on improved deep Q learning is provided aiming at the problem that the handover performance is reduced due to the fact that the network dynamics is enhanced. Firstly, constructing a deep Q learning network selection model according to the dynamic analysis of the network; secondly, transferring training samples and weights of an offline training module in the deep Q learning net selection model to an online decision module through transfer learning; and finally, accelerating the training of the neural network by using the migrated training samples and the weight to obtain an optimal network selection strategy. Experimental results show that the method provided by the invention obviously improves the problem of high-dynamic network switching performance reduction caused by a dormancy mechanism, and simultaneously reduces the time complexity of the traditional deep Q learning algorithm in the online network selection process.

Description

Network selection method based on improved deep Q learning

Technical Field

The invention belongs to a network selection method in a super-dense heterogeneous wireless network, and belongs to the field of mobile communication. In particular to a method for network selection by using an improved deep Q learning algorithm.

Background

With the development of wireless mobile communication, a super-dense heterogeneous wireless network formed by multiple access technologies such as a 5G heterogeneous cellular network and a wireless local area network can provide multiple access modes for a terminal and support seamless movement of the terminal. The ultra-dense networking can bring about a higher energy consumption problem, the introduction of the dormancy mechanism can reduce the energy consumption to a certain extent, and meanwhile, the dynamic performance of the network can be further enhanced, so that the service quality of the terminal and the throughput performance of the network are reduced. How to guarantee the throughput obtained by the terminal in the highly dynamic ultra-dense heterogeneous wireless network and improve the comprehensive switching performance of the network system becomes an important subject to be solved by the current research. In network selection, because the artificial intelligence algorithm has strong learning ability and can be adjusted adaptively according to the environment, many researchers in recent years apply the artificial intelligence algorithm to a network selection method.

The document [ Bin MA, Shanru LI, Xiaonzhong XIE.A Adaptive Vertical Handover Network in Heterogeneous Networks [ J ]. Journal of Electronics and Information Technology,2019,41(5):1210 and 1216] trains the classified parameters according to different service types Based on Neural Networks, thereby performing Network selection. The documents [ MA B, ZHANG W J, and XIE X Z. Ind. virtualization Service ordered Fuzzy Vertical Handover Algorithm [ J ]. Journal of Electronics & Information Technology,2017,39(6):1284 and 1290] adopt Fuzzy logic Algorithm, design different membership functions according to the requirement of the terminal application on QoS parameters, and then reasonably select the network according to the current Service type of the terminal. The algorithm has high efficiency and can select the network efficiently, but a corresponding fuzzy inference rule base needs to be established in advance, and under the condition that input parameters are increased, the number of the fuzzy inference rule base is increased rapidly, so that the complexity of inference time is overlarge. A fuzzy neural network algorithm is proposed in a document [ [9] Nurjahan, Rahman S, Sharma T, et al, PSO-NF based vertical handoff determination for ubiquitous speech wireless network (UHWN) [ C ]//2016International work on Computational interest (IWCI) ]. IEEE,2016], an output value of a fuzzy logic is obtained through neural network training, and a network is selected to be accessed according to the output value. The algorithm combines the accuracy of the fuzzy logic algorithm with the self-adaptive capacity of the neural network algorithm, thereby improving the robustness of the algorithm. A network selection Scheme based on Quality of Experience (QoE) perception is provided in the document [ [11] Jianmei C, Yao W, Yufeng L, et al.QoE-aware Vertical Handoff Scheme over Heterogeneous Access Networks [ J ]. IEEE Access,2018:1-1], a QoS network parameter is mapped into a QoE parameter, then a return function is constructed by utilizing the QoE parameter, and finally a Q learning algorithm is adopted for network selection. The algorithm can strengthen the existing benefits through continuous learning, so that a high-benefit network is selected; however, if the network environment is too complex, the learning effect of the network control module is reduced, and the optimal network cannot be selected. In addition, the above methods all solve the problem of network selection under the conventional heterogeneous wireless network, and do not consider such a high dynamic network environment, so that after the terminal is switched to the target network through the existing network selection algorithm, the obtained throughput may rapidly slide down due to sudden dormancy of the target network, a continuous and stable throughput cannot be provided for the terminal, and finally the problem of serious reduction of system switching performance occurs. Therefore, the network selection method based on the traditional heterogeneous wireless network environment cannot effectively improve the service quality of the terminal after accessing the network.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A network selection method based on improved deep Q learning is provided. The technical scheme of the invention is as follows:

a method for network selection based on improved deep Q learning, comprising the steps of:

101. initializing a deep Q learning network selection model by periodically sampling values of ultra-dense heterogeneous wireless network parameters, wherein the network parameter values comprise sampled received signal strength, throughput and dormancy probability, and constructing an action space, a state space and a return function of deep Q learning by the network parameter values, the deep Q learning network selection model is composed of an offline training module and an online decision-making module, the offline training module is used for training samples and weights of a neural network, the online decision-making module is used for obtaining an optimal network selection strategy, and the two modules are both constructed by adopting a deep Q network;

102. according to the deep Q learning network selection model obtained in the step 101, collaborative interaction is carried out on the offline training module and the online decision module by utilizing transfer learning, the neural network training process of the online decision module is accelerated according to a transfer learning algorithm, the training samples of the offline training module are transferred to the online decision module, training errors generated by the two modules after transfer are corrected through the training samples and the weight of the transferred offline training module until the errors approach to 0, the whole transfer learning process is finished, and an optimal strategy is obtained through the deep Q learning network selection model, so that network selection is completed.

2. The method according to claim 1, wherein the step 101 initializes a deep Q learning network selection model, and constructs an action space, a state space and a reward function of deep Q learning by using network parameter values, and specifically includes the steps of:

401. candidate networks which can be accessed by a terminal in a super-dense heterogeneous wireless network environment, namely a set N ═ N for a base station and an access point₁,n₂,...,n_iRepresents; wherein n is_iIndicating the ith candidate network, the terminal accesses the candidate network n at the time t_iIs denoted by a_t(n_i) Then the motion space can be defined as A_t＝{a_t,a_t∈{a_t(n₁),a_t(n₂),...,a_t(n_i)}}；

Defining a state space as S_t＝(rss_t,c_t,p_t) Wherein rs is_tSet constructed to represent received signal strength of each candidate network at time t, c_tSet constructed to represent the throughput of each candidate network at time t, p_tRepresenting a set constructed by the sleep probabilities of the candidate networks at time t;

to maximize the throughput obtained by the terminal, the reward function is defined by considering the throughput and the sleep probability of the network as:

wherein, C_t(n_i) Indicating access of a terminal to a candidate network n at time t_iObtained throughput, P_t(n_i) Representing candidate networks n at time t_iThe sleep probability of (a);

402. the Q function represents the expectation of performing action a in state S, and taking the cumulative reward value resulting from the subsequent action, defined as:

wherein t represents moment in the operation process, γ_t∈[0，1]For the discount factor, for adjusting the degree of importance to the future return, a value of 0 means that only the short-term return is considered, otherwise the long-term return is more important, y being greater with increasing time t_tDecreasing gradually, E (·) is the desired function;

the deep Q learning algorithm utilizes a neural network to construct Q (S, a; theta), wherein theta is a weight value, so that Q (S, a; theta) is approximately equal to max (Q (S, a)) to carry out approximate solution, meanwhile, a target Q value of a target network is utilized to prevent an estimated Q value generated by an estimation network from being out of control, and errors between the two are adjusted through a loss function to relieve the problem of iteration instability in the training process.

Further, the step 102 of generating training samples and weights includes the following steps:

the training sample of the neural network is composed of the current state, the action, the return value and the future state at different time in the historical information database, namely (S)_t,a_t,R_t,S_t+1) And in the deep Q network, in order to train the neural network, an experience playback pool is set for storing training samples at multiple moments, the correlation degree between the training samples is reduced by randomly extracting partial samples, the training samples of the offline training module are migrated into the online decision module, and the migrated offline training samples and online learning samples are utilized to construct the experience playback pool of the online decision module, which is expressed as:

D_sum＝D_on+ξD_off (3)

wherein D is_sumFor empirical playback of the total number of samples deposited in the pool, D_onFor the total number of on-line learning samples, the initial value is 0, D_offFor the total amount of offline training samples, ξ ∈ [0, 1]]As the sample mobility, xi is gradually reduced along with the increase of the iteration times in the training process;

after the experience playback pool of the on-line decision module is constructed, the weight theta of the neural network obtained by off-line training is used_offMigrating to an online decision module as an initial weight value of neural network training, namely theta_on＝θ_off。

Further, the weight theta of the neural network obtained by offline training is obtained_offAfter the online decision-making module is migrated, the neural network starts iterative training, in the process that the offline training module and the online decision-making module are cooperatively matched through migration learning, a training error generated between the offline training module and the online decision-making module is defined as a strategy loss, a strategy simulation mechanism is adopted, and a Q value Q is estimated in the offline training module_off(S_t,a_t；θ_off) Converting the estimation network of the offline training module into an offline strategy network pi_off(S_t,a_t；θ_off)；

Similarly, the predicted Q value Q of the on-line decision module is utilized_on(S_t,a_t；θ_on) Converting the evaluation network of the on-line decision module into an on-line strategy network pi_on(S_t,a_t；θ_on) The strategy loss between the offline training and online decision modules is measured by cross entropy,

further, the under-line policy network pi_off(S_t,a_t；θ_off) Expressed as:

wherein, T tableA parameter expressed in a Boltzmann distribution, and the larger the value of the parameter, the action a_tThe less affected the selection of (A) is by the Q value, i.e. all actions are selected with nearly the same probability, A_offAn action space for deep Q learning during offline training;

on-line policy network pi_on(S_t,a_t；θ_on) Expressed as:

the strategy loss between the offline training and the online decision module is measured by cross entropy, and then the strategy simulation loss function is expressed as:

under the condition that the strategy loss exists, the on-line decision module predicts the Q value Q_on(S_t,a_t；θ_on) The gradient update of (a) is expressed as:

wherein Q is_π(S_t,a_t；θ_on) Representing the non-deviation estimation value of the estimated Q value under the strategy pi;

when Q is_π(S_t,a_t；θ_on)≈Q_on(S_t,a_t；θ_on) I.e. the strategy loss between the offline training and the online decision module approaches 0, at this time, the transfer learning process is finished.

Further, in the moving process of the terminal, when the terminal is about to enter or leave a certain base station, a network selection decision moment occurs, at this time, the terminal needs to perform network selection, and in order to obtain the network selection decision moment to be faced by the terminal, prediction is performed according to the received signal strength of the network and the moving speed of the terminal.

Further, the predicting step according to the received signal strength of the network and the moving speed of the terminal specifically includes: assuming that a mobile model of the terminal in the coverage area of the base station moves from a point A to a point C, a point B represents the position of the terminal after moving by delta l from the point A, and predicting the network selection decision moment t to appear at the point C according to the current motion trend of the terminal_CThen, the relationship between Δ OAM and Δ OBM is expressed as:

where r represents the radius of the network coverage, Δ l represents the distance that the terminal moves, and l_BMIndicating the current distance of the terminal from the midpoint M of the chord AC,

by detecting the received signal strength value of the B point, the distance l from the base station to the B point can be obtained_OBThe average moving speed of the terminal in the coverage area of the base station can be expressed as

The network selection decision time t_CExpressed as:

suppose that at the time t of the network selection decision, the network corresponding to the maximum Q value in the candidate networks is n_mIf the terminal selects the network action best at the decision time t as a_t(n_m) By analogy, the optimal network selection action set formed by the terminal at different network selection decision moments is defined as an optimal strategy pi^*Optimum strategy pi^*The method is characterized in that the terminal and the candidate network realize the best matching at different network selection decision moments under the ultra-dense heterogeneous wireless network environment with the introduced dormancy mechanism.

Further, the deep Q network specifically includes:

first, an estimation network is constructed using a fully-connected neural network. Evaluation network Q (S, a)_i(ii) a θ) is defined as follows:

Q(S,a_i；θ)＝f_DNN(S,a_i；θ)a_i∈A (10)

wherein f is_DNN(. to) represents a nonlinear mapping function of the fully-connected neural network, θ represents a weight, Q (S, a)_i(ii) a Theta) represents that the action a is selected when the state space S is input on the premise of the weight theta_iThe Q value of (1).

In the evaluation of the network Q (S) by gradient descent_i+1,a_i(ii) a θ) to prevent evaluation of the network Q (S, a)_i(ii) a Theta) occurrence of runaway of the generated value by defining the target network

Make the training more stable, the target network

And an evaluation network Q (S, a)_i(ii) a Theta) are consistent, and the evaluation network Q (S, a) is simultaneously evaluated_i(ii) a Theta) to the weight value theta

Thereby to pair

And (6) updating. The difference between the two is gradually reduced by setting a loss function in the updating process, before the loss function is constructed, an empirical playback pool D needs to be constructed, which is defined as follows:

D＝{(S₁,a₁,R₁,S₂),…,(S_i,a_i,R_i,S_i+1),…,(S_m,a_m,R_m,S_m+1)} (11)

where m is the maximum capacity of the empirical playback pool, (S)_i,a_i,R_i,S_i+1) Data representing the ith time instant.

Defining a loss function L (θ) by a reward value R and an empirical replay pool D:

wherein γ is the discount factor for long-term return value, and E [. cndot. ] is the expectation function.

The invention has the following advantages and beneficial effects:

1. the method comprises the steps of carrying out dynamic analysis on a super-dense heterogeneous wireless network environment formed by heterogeneous wireless local area networks and super-dense cellular networks introducing dormancy mechanisms, initializing a deep Q learning network selection model according to the step 101, and obtaining a return function considering network dormancy conditions according to the step 401, so that the possibility of selecting a high-dynamic network by a terminal is greatly reduced, and the problem of reduced system switching performance is effectively solved.

2. The deep Q learning algorithm is improved by adopting transfer learning, a network selection algorithm based on improved deep Q learning is provided, and the training process of the neural network in the online decision module is accelerated by transferring the training samples and the weights in the step 102, so that the time complexity of the traditional deep Q learning algorithm in the online network selection process is reduced.

Drawings

FIG. 1 is a diagram of a simulation scenario for a very dense heterogeneous wireless network according to a preferred embodiment of the present invention;

FIG. 2 is a flow chart of an improved deep Q learning method;

FIG. 3 is a comparison of time complexity for different methods;

FIG. 4 is a comparison of throughput for different methods;

fig. 5 is a comparison of access blocking rates for different methods;

fig. 6 is a comparison of packet loss ratios of different methods;

FIG. 7 is a comparison of call drop rates for different methods;

fig. 8 is a table-tennis effect comparison of different methods.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the method comprehensively considers the situation that the network dynamics is enhanced and the time-varying property of a network topological structure is improved in the ultra-dense heterogeneous wireless network with the sleep mechanism, can remarkably improve the problem of the reduction of the switching performance of the high-dynamics network caused by the sleep mechanism, and simultaneously reduces the time complexity of the traditional deep Q learning algorithm in the online network selection process.

The network selection method provided by the invention comprises the following steps:

step one, initializing a deep Q learning network selection model by periodically sampling values of network parameters, and setting a set N ═ N for candidate networks (base stations and access points) which can be accessed by a terminal in a super-dense heterogeneous wireless network environment₁,n₂,...,n_iRepresents; wherein n is_iIndicating the ith candidate network, the terminal accesses the candidate network n at the time t_iIs denoted by a_t(n_i) Then, the motion space of the present invention can be defined as A_t＝{a_t,a_t∈{a_t(n₁),a_t(n₂),...,a_t(n_i)}}。

The present invention defines the state space as S_t＝(rss_t,c_t,p_t) Wherein rs is_tSet constructed to represent received signal strength of each candidate network at time t, c_tSet constructed to represent the throughput of each candidate network at time t, p_tRepresenting a constructed set of sleep probabilities for each candidate network at time t.

In order to maximize the throughput obtained by the terminal, the invention defines the return function as follows by considering the throughput and the dormancy probability of the network:

wherein, C_t(n_i) Indicating access of a terminal to a candidate network n at time t_iObtained throughput, P_t(n_i) Representing candidate networks n at time t_iThe probability of dormancy of.

The Q function represents the expectation of performing action a in state S, and taking the cumulative reward value resulting from the subsequent action, defined as:

wherein t represents moment in the operation process, γ_t∈[0，1]For the discount factor, for adjusting the degree of importance to the future return, a value of 0 means that only the short-term return is considered, otherwise the long-term return is more important, y being greater with increasing time t_tDecreasing gradually, E (-) is the desired function.

After the Q function is iterated for many times, when all Q values do not change greatly any more, the Q function is converged, and the deep Q learning process is finished. However, Q (S, a) can converge to an optimum Q value only when t → + ∞, and thus it is difficult to realize in an actual network selection process. Thus, the deep Q learning algorithm utilizes a neural network to construct Q (S, a; θ), where θ is a weight such that Q (S, a; θ) is approximately equal to max (Q (S, a)) for the approximate solution. Meanwhile, the target Q value of the target network is utilized to prevent the situation that the estimated Q value generated by the estimation network is out of control, and the error between the estimated Q value and the target Q value is adjusted through a loss function, so that the problem of unstable iteration in the training process is solved.

And step two, according to the depth Q learning network selection model, dividing the depth Q learning network selection model into an offline training module and an online decision-making module, wherein both the offline training module and the online decision-making module are constructed by a depth Q network. And accelerating the neural network training process of the on-line decision module according to the transfer learning algorithm. And correcting training errors generated by the two modules after the migration through the training samples and the weights of the training modules under the migration line until the errors approach to 0, and finishing the whole migration learning process. The generation and migration steps of the training samples and the weights are as follows:

the training sample of the neural network is composed of the current state, the action, the return value and the future state at different time in the historical information database, namely (S)_t,a_t,R_t,S_t+1) Wherein t ∈ (0, + ∞). In the deep Q network, in order to train the neural network efficiently, an experience playback pool is set and used for storing training samples at multiple moments, and the correlation degree between the training samples is reduced by randomly extracting part of samples, so that the problem of unstable iteration occurring in the training process is solved. Therefore, the invention migrates the training samples of the offline training module to the online decision module, and constructs the experience playback pool of the online decision module by using the migrated offline training samples and the online learning samples, which is expressed as:

D_sum＝D_on+ξD_off (3)

wherein D is_sumFor empirical playback of the total number of samples deposited in the pool, D_onFor the total number of on-line learning samples, the initial value is 0, D_offFor the total amount of offline training samples, ξ ∈ [0, 1]]As the sample mobility increases, ξ gradually decreases as the number of iterations in the training process.

At this time, the neural network starts iterative training, but due to the fact that the training samples and the weights are different between the offline training module and the online decision module, the neural network training effect of the online decision module may be poor after the training samples and the weights are migrated, and therefore the convergence rate of the neural network cannot achieve the expected effect. Therefore, it is necessary to reduce the training error between the offline training and the online decision module during the process of training the samples and weight migration, so as to ensure the neural network training effect of the online decision module. In order to solve the problems, the invention provides a method for realizing cooperative cooperation between offline training and online decision-making module through transfer learningDefining the training error generated between the offline training and the online decision module as the strategy loss, adopting a strategy simulation mechanism to estimate the Q value Q in the offline training module in order to minimize the strategy loss_off(S_t,a_t；θ_off) Converting the estimation network of the offline training module into an offline strategy network pi_off(S_t,a_t；θ_off) Expressed as:

wherein T represents a parameter that follows a Boltzmann distribution, and the larger the value of T, the action a_tThe less affected the selection of (A) is by the Q value, i.e. all actions are selected with nearly the same probability, A_offThe motion space for deep Q learning during offline training.

Similarly, the predicted Q value Q of the on-line decision module is utilized_on(S_t,a_t；θ_on) Converting the evaluation network of the on-line decision module into an on-line strategy network pi_on(S_t,a_t；θ_on) Expressed as:

wherein Q is_π(S_t,a_t；θ_on) Representing an unbiased estimate of the predicted Q value under strategy pi.

Step three, when the terminal is moving and is about to enter or leave a certain base station, network selection decision time can occur, the terminal needs to perform network selection at the time, in order to obtain the network selection decision time which the terminal needs to face, prediction is performed according to the received signal strength of the network and the moving speed of the terminal, the fact that the moving model of the terminal in the coverage area of the base station is from a point A to a point C is assumed, the point B represents the position of the terminal after moving by delta l from the point A, and according to the current moving trend of the terminal, the fact that the network selection decision time t will occur at the point C is predicted_CThen, the relationship between Δ OAM and Δ OBM is expressed as:

The network selection decision time t_CExpressed as:

suppose that at the time t of the network selection decision, the network corresponding to the maximum Q value in the candidate networks is n_mIf the terminal selects the network action best at the decision time t as a_t(n_m) By analogy, the optimal network selection action set formed by the terminal at different network selection decision moments is defined as an optimal strategy pi^*The strategy shows that in an ultra-dense heterogeneous wireless network environment with a dormancy mechanism, the terminal and the candidate network realize the best matching at different network selection decision moments.

Based on the above analysis, the present invention designs the algorithm flow chart shown in fig. 2.

In order to verify the invention, a simulation experiment is carried out on an MATLAB platform, and the following simulation scenes are set: a network formed by two access technologies of 5G and WLAN is used as a super-dense heterogeneous network model, and a simulation scene is set up on an MATLAB platform for simulation analysis. Assume that 2 macro base stations of 5G, 4 micro base stations of 5G, and 3 WLAN access points are distributed in a scene, the radii of the macro base stations of 5G are all 800m, the radii of the micro base stations of 5G are all 300m, and the radii of the WLANs are all 80 m. The coverage scenarios for the 5G and WLAN networks within the simulation scenario are shown in fig. 1.

In the simulation process, users in a scene are assumed to be randomly distributed in a simulation area, and the motion direction is randomly changed at intervals. In order to further highlight the superiority of the invention, the method provided by the invention is compared with a Q-Learning-Based method (Q-Learning) in the literature [ Jianmei C, Yao W, Yufeng L, et al QoE-aware Intelligent Handoff Schemeter Heterogeneous Networks [ J ]. IEEE Access,2018:1-1], a Deep Learning Based Handoff Management for depth WLANs: A Deep Learning left approach, IEEE Access,2019: 1-1], a Deep Q-Learning Network-Based method (Deep Q-Network, DQN) and a Short-Term neural Network-Based Short-Term neural Network (Ad-Learning) in the literature [ A two-terminal mapping-assisted Learning approach, 2019: 1-1], and a Short-Term neural Network-Based Short-Term neural Network (Short-Term neural Network ).

Time complexity is an important index of a network selection algorithm, time overhead pairs of the algorithm and the other three algorithms are shown in FIG. 3, three curves in the graph respectively represent the time consumption of the algorithm, Q-learning, DQN and LSTM algorithms, and the time consumed by the four algorithms is increased along with the increase of iteration times; however, the time increase amplitude of the algorithm adopted by the invention is not only obviously slower than two algorithms of DQN and LSTM, but also lower than Q-learning algorithm, and with the increase of iteration times, the last four curves are in a horn shape, which shows that the difference of the time consumed by the four algorithms is increased with the increase of the iteration times, thus proving that the time complexity advantage of the algorithm of the invention is very obvious. The algorithm of the invention adopts transfer learning to improve the traditional deep Q learning algorithm, and the learning efficiency of the online decision-making module is improved by transferring the training sample of the offline training module; meanwhile, the weight of the neural network of the off-line training module is migrated, so that the neural network training time of the on-line decision module is reduced, and the time consumption of the whole algorithm is reduced. For the Q-learning algorithm, when the state and the action space are rapidly increased, the computing capacity is continuously reduced, the time consumption is gradually increased, and the time difference from the algorithm is gradually opened. The DQN algorithm and the LSTM algorithm directly adopt a deep neural network to carry out iterative operation, and under the condition of large iteration times, the time consumption difference between the DQN algorithm and the LSTM algorithm is more obvious.

Fig. 4 shows the variation of the network average throughput obtained by the user terminal under the four algorithms as the simulation times increase. By comparing the four curves in the graph, it can be clearly seen that the average throughput of the network obtained by adopting the algorithm of the invention is far higher than that of the other three algorithms. The invention adopts the deep Q learning algorithm to successfully predict the state change condition of the base station caused by the dormancy mechanism in the future, so that the user terminal can reasonably select the network according to the future dynamic change of the network environment, and the loss of the network throughput caused by the dormancy of the base station in the future is reduced to the maximum extent; meanwhile, the return function of the deep Q learning algorithm is defined according to the throughput of the user accessing the candidate network, so that the actual requirements of the user are met better, and more throughput can be brought to the user in a high-dynamic network environment. For DQN and Q-learning algorithms, throughput is not as high as the algorithm of the present invention because both do not fully consider the status of the base station in the future network environment, nor design a suitable reward function for the user to increase network throughput. In the LSTM algorithm, because the algorithm does not specifically design and consider the network throughput obtained by the user in the process of network selection, the network throughput of the algorithm is the lowest among the current four algorithms.

Fig. 5 comparatively shows the access blocking rate performance of the four algorithms under the trend of increasing number of users. As can be seen from the figure, when the number of users accessing the base station is less than 40, no blocking occurs in each algorithm. When the number of users is 40, the LSTM algorithm generates blockage, and when the number of users is 50, the DQN algorithm also generates blockage; the algorithm of the present invention and the Q-learning algorithm are configured to block when the number of users is 60. With the increase of the number of users, the blocking rates of the four algorithms are increased; however, the algorithm of the present invention has the lowest blocking rate for the same number of users. The algorithm of the invention considers the dormancy condition of the base station, utilizes the dormancy probability to judge the state of the base station at the future moment, avoids the network resource waste caused by sudden dormancy of the base station, and increases the effective utilization rate of each network, thereby leading a user to select the network more reasonably and reducing the access blocking rate. The LSTM and DQN algorithms cannot accurately predict future dynamic changes of a base station, and the delay caused by the algorithms themselves is high, so that a blockage occurs when the number of users is small. For the Q-learning algorithm, although the blocking rate is not high when the number of users is small; however, without considering the base station dormancy, it is impossible to make a corresponding change in time with respect to a dynamic change occurring in the base station, resulting in a rapid increase in the blocking rate in the case where the number of users gradually increases.

Fig. 6 is a relationship between the average packet loss rate and the number of users in the network under four algorithms. It can be seen from the graph that the average packet loss rate of the algorithm of the present invention is always stable below 10%, and the average packet loss rates of the other three algorithms are all above 15%. Therefore, the packet loss rate generated by the algorithm is far lower than that of the other three algorithms. When network selection is carried out, the algorithm of the invention makes a reasonable return function from the perspective of a user according to the throughput obtained by a user terminal; meanwhile, the network dynamics caused by the dormancy of the base station is considered and successfully predicted, so that a proper network can be selected for a user, the loss of data in the transmission process is reduced, and the data can be continuously transmitted. For Q-learning and DQN algorithms, since they select networks only according to the service requirements of the user terminal, it is not possible to accurately predict future dynamic changes of the network; therefore, when the dynamics of the network continuously increases, the optimal network cannot be selected for the user in time, so that the packet loss rate is high. The LSTM algorithm fails to consider the service requirements of the user terminal and does not accurately predict the dynamic conditions of the future network, and the packet loss rate is the highest among the current four algorithms.

Fig. 7 is a comparison between the drop call rate and the number of users for the four algorithms. It can be seen from the figure that although the call drop rates of the four algorithms are slowly increased, after the number of users is increased to 40, the call drop rate of the Q-learning algorithm is rapidly increased and gradually positioned at the highest position, while the call drop rate increase of the algorithm of the present invention is the smallest, and the increase of the LSTM algorithm and the DQN algorithm is between the two. In the process that the number of users is increased from 10 to 100, the call drop rate of the algorithm is always at the lowest point compared with the other three algorithms. Compared with other three algorithms, the algorithm of the invention can predict the change situation of the future network under the condition that the network dynamics is continuously increased, and then provides the network with higher quality for the user to select, thereby effectively reducing the probability of the switching failure. For the Q-learning algorithm, since the network state cannot be accurately predicted, the dropped call rate is sharply increased when the number of users is increased. Similarly, for the DQN and LSTM algorithms, a result of higher network selection delay may be caused in the process of training the deep neural network; therefore, as the number of users increases, the call drop rate also increases significantly.

Fig. 8 shows the total number of handovers performed by the user using the four algorithms. As can be seen from the figure, when the number of users is 100, the total number of times of network handover of the users under the LSTM algorithm is about 380 times, about 370 times under the Q-learning algorithm, and about 310 times under the DQN algorithm; by adopting the algorithm provided by the invention, the total switching times are only about 230 times. This phenomenon shows that the total switching times of the algorithm of the invention are far lower than those of the other three algorithms; meanwhile, the algorithm of the invention can greatly reduce unnecessary switching and well relieve the ping-pong effect. This is because the present invention considers the situation that the algorithm handover failure rate is increased due to the dynamic enhancement of the network environment, so that frequent handover occurs. By combining the dormancy probability of the base station into the algorithm of the invention, the network state change condition of the user after network selection is successfully predicted, thereby greatly reducing the times of switching. The other three algorithms do not properly solve the problem that the switching frequency of the network and the ping-pong effect are aggravated due to the high dynamic influence of the network caused by the dormancy mechanism of the base station; therefore, compared with the existing algorithm, the algorithm can effectively reduce the unnecessary network switching.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A network selection method based on improved deep Q learning is characterized by comprising the following steps:

3. The method for network selection based on improved deep Q learning of claim 1, wherein the training samples and weights generated in step 102 are generated as follows:

D_sum＝D_on+ξD_off (3)

4. The method of claim 3, wherein the neural network weights θ obtained by offline training are used_offAfter the online decision-making module is migrated, the neural network starts iterative training, in the process that the offline training module and the online decision-making module are cooperatively matched through migration learning, a training error generated between the offline training module and the online decision-making module is defined as a strategy loss, a strategy simulation mechanism is adopted, and a Q value Q is estimated in the offline training module_off(S_t,a_t；θ_off) Converting the estimation network of the offline training module into an offline strategy network pi_off(S_t,a_t；θ_off)；

Similarly, the predicted Q value Q of the on-line decision module is utilized_on(S_t,a_t；θ_on) Converting the valuation network of the on-line decision module into the on-line strategySlightly networked pi_on(S_t,a_t；θ_on) The strategy loss between the offline training and online decision modules is measured by cross entropy.

5. The method as claimed in claim 4, wherein the offline policy network is pi_off(S_t,a_t；θ_off) Expressed as:

wherein T represents a parameter that follows a Boltzmann distribution, and the larger the value of T, the action a_tThe less affected the selection of (A) is by the Q value, i.e. all actions are selected with nearly the same probability, A_offAn action space for deep Q learning during offline training;

on-line policy network pi_on(S_t,a_t；θ_on) Expressed as:

6. The method as claimed in claim 4, wherein the terminal is capable of making a network selection decision when entering or leaving a base station during a moving process, and the terminal is required to make a network selection decision, and the prediction is made according to the received signal strength of the network and the moving speed of the terminal in order to obtain the network selection decision to be faced by the terminal.

7. The method of claim 6, wherein the predicting step according to the received signal strength of the network and the moving speed of the terminal specifically comprises: assuming that a mobile model of the terminal in the coverage area of the base station moves from a point A to a point C, a point B represents the position of the terminal after moving by delta l from the point A, and predicting the network selection decision moment t to appear at the point C according to the current motion trend of the terminal_CThen, the relationship between Δ OAM and Δ OBM is expressed as:

by detecting the received signal strength value of the B point, the distance l from the base station to the B point can be obtained_OBFinally, finallyThe average moving speed of the terminal in the coverage area of the base station can be represented as V, and the network selects the decision time t_CExpressed as:

8. The method according to claim 7, wherein the deep Q network specifically comprises:

Q(S,a_i；θ)＝f_DNN(S,a_i；θ) a_i∈A (10)

Make the training more stable, the target network

Thereby to pair