CN113055229B

CN113055229B - Wireless network self-selection protocol method based on DDQN

Info

Publication number: CN113055229B
Application number: CN202110249773.3A
Authority: CN
Inventors: 严海蓉; 王重阳
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2023-10-27
Anticipated expiration: 2041-03-05
Also published as: CN113055229A

Abstract

The invention relates to a DDQN-based wireless network self-selection protocol method, which aims at the situations of complex current wireless network environment and integration of a plurality of protocols. The method comprises the following steps: acquiring current network environment quality parameters and determining node service types in real time through an environment agent module; noise reduction and normalization are carried out on the data on the basis of the step 1), the node service type is determined through an analytic hierarchy process, and feature extraction is carried out; based on the step 2), data is input into a DDQN decision network for real-time training, and an execution result is applied to enable the network state to tend to be stable. According to the method, the data is directly subjected to feature extraction without preprocessing, the obtained historical data is used as training data, and the strong advantage of deep learning is utilized, so that the learning speed and decision performance of the reinforcement learning algorithm are effectively improved.

Description

A method of wireless network self-selection protocol based on DDQN

技术领域Technical field

本发明涉及一种在异构无线网络下网络协议自选择的方法，针对当前无线网络环境复杂及众多协议融合的情况。The invention relates to a method for self-selecting network protocols in heterogeneous wireless networks, aiming at the current complex wireless network environment and the integration of many protocols.

背景技术Background technique

随着网络技术的不断发展，当今世界上广泛应用的网络技术产生大量的重叠，当前的网络环境WLAN和蜂窝网络是最为常见的异构网络组合，在现代信息通信中也起到了重要的作用，运营商也在用户密集区如商场、学校、办公楼部署了自己的WLAN热点用于分散蜂窝网络造成的压力。With the continuous development of network technology, network technologies widely used in the world today have a large amount of overlap. In the current network environment, WLAN and cellular networks are the most common heterogeneous network combinations, and they also play an important role in modern information communications. Operators have also deployed their own WLAN hotspots in user-dense areas such as shopping malls, schools, and office buildings to disperse the pressure caused by cellular networks.

下一代的异构网络，是一种融合多种协议但环境复杂的网络，需要在任何时间，任何地点对用户提供可靠的网络服务。但是在这点实现之前，需要网络环境的成熟，要解决无线网络覆盖，网络自配置，网络设备的自动管理等功能。在即有的网络环境下，实现单一网络协议完成以上配置还有一定的困难，但可以通过一些算法来实现对当前异构网络的资源综合调度，高效的利用异构网络资源的切换将逐渐成为研究的热点。随着无线通信的进一步发展，将会对异构网络的可扩展性和灵活性提出一定的要求。The next generation heterogeneous network is a network that integrates multiple protocols but has a complex environment. It needs to provide reliable network services to users at any time and anywhere. But before this can be realized, the network environment needs to mature, and functions such as wireless network coverage, network self-configuration, and automatic management of network equipment must be solved. In the existing network environment, it is still difficult to implement the above configuration with a single network protocol. However, some algorithms can be used to achieve comprehensive resource scheduling of the current heterogeneous network. Efficient use of heterogeneous network resource switching will gradually become a research topic. hot spots. With the further development of wireless communications, certain requirements will be placed on the scalability and flexibility of heterogeneous networks.

强化学习作为一种在非确定环境下能够做出符合发展环境要求决策的一种工具，能够根据网络动态变化进行有针对性的调整，使异构无线网络能够成为一种自动适应用户场景变化的方案，优化网络环境。强化学习是机器学习的一种，主要通过智能体(Agent)在环境(Environment)中不断的进行调整，最终可以实现某个特定指标(Reward)的最大化，在无线网络中，由于节点的移动，以及节点之间的相互干扰，使网络环境变得复杂，相对于传统的机器学习算法，深度强化学习具有着更大的潜力及更高的准确度，无需对数据进行预处理直接进行特征提取，将获得的历史数据作为训练数据，利用深度学习的强大优势，有效提高强化学习算法的学习速度和决策性能。As a tool that can make decisions that meet the requirements of the development environment in a non-deterministic environment, reinforcement learning can make targeted adjustments according to dynamic changes in the network, making heterogeneous wireless networks a system that automatically adapts to changes in user scenarios. program to optimize the network environment. Reinforcement learning is a type of machine learning. It mainly uses agents to continuously adjust in the environment, and ultimately can maximize a specific indicator (Reward). In wireless networks, due to the movement of nodes, , and the mutual interference between nodes make the network environment complex. Compared with traditional machine learning algorithms, deep reinforcement learning has greater potential and higher accuracy, and can directly extract features without preprocessing the data. , using the obtained historical data as training data, and taking advantage of the powerful advantages of deep learning to effectively improve the learning speed and decision-making performance of the reinforcement learning algorithm.

发明内容Contents of the invention

针对以上现有网络的特点，本发明提供一种基于DDQN(Deep ReinforcementLearning with Double Q-learning)的无线网络自选择协议的方法。包括：网络质量数据的处理方案；基于深度学习的特征提取方案；基于DDQN的网络协议选择方案。本发明的目的通过以下技术方案来实现。In view of the above characteristics of existing networks, the present invention provides a wireless network self-selecting protocol method based on DDQN (Deep Reinforcement Learning with Double Q-learning). Including: network quality data processing solution; feature extraction solution based on deep learning; network protocol selection solution based on DDQN. The object of the present invention is achieved through the following technical solutions.

一种基于DDQN的无线网络自选择协议的方法，该方法包括如下步骤：A method for wireless network self-selection protocol based on DDQN, which method includes the following steps:

1)通过环境代理模块实时获取当前网络环境质量参数及确定节点业务类型；1) Obtain the current network environment quality parameters in real time and determine the node business type through the environment agent module;

2)在1)的基础上对数据进行降噪，归一化处理，通过层次分析法确定节点业务类型，进行特征提取；2) On the basis of 1), perform noise reduction and normalization processing on the data, determine the node business type through the analytic hierarchy process, and perform feature extraction;

3)在2)的基础上，把数据输入到DDQN决策网络中实时训练，应用执行结果，使网络状态趋于稳定。3) On the basis of 2), input the data into the DDQN decision-making network for real-time training, and apply the execution results to stabilize the network state.

1.一种基于DDQN的无线网络自选择协议的方法，其特征在于包括以下步骤：1. A method for wireless network self-selection protocol based on DDQN, which is characterized by including the following steps:

第一步：通过环境代理模块实时获取当前网络环境质量参数及节点业务类型确定状态、动作及奖励值；Step 1: Obtain the current network environment quality parameters and node business type in real time through the environment agent module to determine the status, action and reward value;

状态空间定义：在t时刻一个终端的状态状态空间S的定义为s_mn∈S，表示终端m接入了第n个网络并在网络中进行信息交互时的状态；其状态空间为：State space definition: The definition of the state space S of a terminal at time t is s _mn ∈S, which represents the state when the terminal m is connected to the nth network and interacts with information in the network; its state space is:

S＝s₁,s₂,…,s_mn (1)S＝s ₁ ,s ₂ ,…,s _mn (1)

状态定义：使用平均吞吐量T、时延D、信号强度P、节点距离W来描述网络状态，则网络质量Φ表示为：State definition: Use average throughput T, delay D, signal strength P, and node distance W to describe the network state. The network quality Φ is expressed as:

Φ＝T×D×P×W (2)Φ＝T×D×P×W (2)

动作空间定义：需要设定动作空间供智能体进行选择，动作空间的定义为：Action space definition: An action space needs to be set for the agent to choose. The definition of action space is:

A＝a₁,a₂,…,a_n (3)A＝a ₁ ,a ₂ ,…, _an (3)

其中a_n表示某个节点使用第n个网络协议；where a _n indicates that a node uses the nth network protocol;

接入业务网络参数由QoS参数组成，对网络QoS建立判决矩阵，求解参数权重：Access service network parameters consist of QoS parameters. Establish a decision matrix for network QoS and solve for parameter weights:

判决矩阵如上所示，其中各元素表示QoS参数的重要程度，具体如下表定义，同时判决矩阵应满足m_ij>0；m_ji＝1/m_ij；m_ij＝1；The decision matrix is as shown above, where each element represents the importance of the QoS parameters, as defined in the following table. At the same time, the decision matrix should satisfy m _ij >0; m _ji =1/m _ij ; m _ij =1;

表中没有出现的2、4、6、8用来表示相邻判断的中间值；由于在定义奖励值的过程中将业务类型分为4类，以及考虑了吞吐量、时延、信号强度三种属性，判决矩阵应该定义为3*3的矩阵，即M_i∈R_3×3，其中i＝1、2、3、4分别表示1类、2类、3类、4类四种业务类型，然后根据不同业务QoS参数的需求对四种业务分别建立判决矩阵；2, 4, 6, and 8 that do not appear in the table are used to represent the intermediate values of adjacent judgments; because in the process of defining reward values, business types are divided into 4 categories, and throughput, delay, and signal strength are considered. attribute, the decision matrix should be defined as a 3*3 matrix, that is, _Mi ∈R _3×3 , where i=1, 2, 3, and 4 respectively represent the four business types of category 1, category 2, category 3, and category 4. , and then establish decision matrices for the four types of services according to the requirements of different service QoS parameters;

根据现行网络业务类型划分标准RFC2474，通过DSCP确定业务等级中属性值；DSCP通过在每个数据包IP头部的服务类别TOS标识字节中，利用已使用的6比特和未使用的2比特，通过编码值来确定IP优先权；IP优先级字段，可以应用于流分类，数值越大表示优先级越高，取值0到63，可匹配64种等级，根据等级大小每七个等级划分为一类，即可通过发送IP数据包中的DSCP字段确定业务属性和和参数之间的关系；According to the current network service type classification standard RFC2474, DSCP is used to determine the attribute value in the service level; DSCP uses the used 6 bits and the unused 2 bits in the service category TOS identification byte in the IP header of each data packet. The IP priority is determined by encoding the value; the IP priority field can be applied to traffic classification. The larger the value, the higher the priority. The value ranges from 0 to 63, which can match 64 levels. According to the size of the level, every seven levels are divided into In the first category, the relationship between business attributes and parameters can be determined by sending the DSCP field in the IP data packet;

对于这四类业务，i依次取值1、2、3、4；将最大特征值对应的特征向量归一化，即归一化后的特征向量中的各个值就是对应的网络QoS参数的权重；在以上四种情况下会产生不同业务类型对网络参数需求的差别，这些差别将在之后对奖励值权重的划分产生影响；将整个网络看成是一个整体，最终的目标将是通过选择节点使用协议优化整体网络质量，奖励值是一个和网络具有强相关的函数；For these four types of business, i takes the values 1, 2, 3, and 4 in sequence; the eigenvector corresponding to the largest eigenvalue is normalized, that is Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, there will be differences in the network parameter requirements of different business types, and these differences will be produced later in the division of reward value weights. Impact; Treat the entire network as a whole, and the ultimate goal will be to optimize the overall network quality by selecting nodes to use protocols. The reward value is a function that is strongly related to the network;

V_t＝v₁,v₂,…,v_n (5)V _t =v ₁ ,v ₂ ,…,v _n (5)

t表示t时刻网络的状态信息，V_t是网络状态空间的子集Φ，因此，对于特定业务B，网络空间状态V_t，奖励函数R表示为，并将在下一步进行求解：t represents the state information of the network at time t, and V _t is a subset Φ of the network state space. Therefore, for a specific business B, the network space state V _t , the reward function R is expressed as, and will be solved in the next step:

R＝f_B(V_t) (6)R＝f _B (V _t ) (6)

节点的接入会影响网络参数的变动，当执行动作后，需要对网络状态进行衡量并反馈相应的奖励；当所执行的动作导致网络吞吐量升高、时延降低、信号强度增强即为有效动作；相反，当所执行动作导致网络吞吐量降低、时延降低、信号强度下降即为无效动作；因此计算奖励时考虑平均吞吐量α_avg,平均时延β_avg,信号强度γ；The access of nodes will affect changes in network parameters. After an action is executed, the network status needs to be measured and corresponding rewards fed back; when the action performed leads to an increase in network throughput, a reduction in latency, and an increase in signal strength, it is an effective action. ; On the contrary, when the action performed results in a decrease in network throughput, delay, and signal strength, it is an invalid action; therefore, the average throughput α _avg , average delay β _avg , and signal strength γ are considered when calculating the reward;

第二步：在1)的基础上对数据进行归一化处理，确定节点业务类型，确定奖励函数；Step 2: Normalize the data based on 1), determine the node business type, and determine the reward function;

使用min-max标准化，消除数据因单位不同而产生的影响：Use min-max standardization to eliminate the impact of different data units:

使用上述方程进行归一化分别得到归一化后的网络平均吞吐量f_t(α)_avg,平均时延f_t(β)_avg,信号强度f_t(γ)；Use the above equations for normalization to obtain the normalized network average throughput f _t (α) _avg , average delay f _t (β) _avg , and signal strength f _t (γ);

综合以上公式得奖励函数：Combining the above formulas, we get the reward function:

R＝ω₁f_t(α)_avg+ω₂f_t(β)_avg+ω₃f_t(γ) (8)R＝ω ₁ f _t (α) _avg +ω ₂ f _t (β) _avg +ω ₃ f _t (γ) (8)

其中ω₁、ω₂、ω₃是判决矩阵归一化后的特征向量对应的的网络平均吞吐量、时延、信号强度的权重；Among them, ω ₁ , ω ₂ , and ω ₃ are the weights of the average network throughput, delay, and signal strength corresponding to the eigenvectors after normalization of the decision matrix;

第三步：在2)的基础上，把数据输入到DDQN决策网络中实时训练，应用执行结果，使网络状态趋于稳定；Step 3: Based on 2), input the data into the DDQN decision-making network for real-time training, and apply the execution results to stabilize the network state;

首先初始化状态S、动作空间A，初始化Q矩阵为零矩阵，用随机的参数θ初始化Q-MainNet网络和Q-target网络,θ为网络参数，初始化时Q-MainNetθ随机设定，Q-targetθ^-＝0，t表示当前时间状态，智能体模块读取当前网络状态信息S_t，将其输入到Q-MainNet网络，在S_t状态下不同动作的Q值通过Q-MainNet网络输出；根据ε-greedy策略，Q-MainNet网络以概率ε随机选择一个动作a_t∈A，或者以概率1-ε选择动作终端在异构无线网络中执行相应动作，经过获取网络数据数据处理，处理成算法需要使用的格式，交给控制层进行处理；从而获得吞吐量α,时延β,信号强度γ；然后将他们分别归一化；根据业务的类型，通过层次分析法得到f_t(α)_avg、f_t(β)_avg、f_t(γ)的权重，之后加权求和得到奖励值R；Q-MainNet获取系统状态和奖励值，通过公式(9)First initialize the state S and action space A, initialize the Q matrix to a zero matrix, initialize the Q-MainNet network and Q-target network with random parameters θ, θ is the network parameter, Q-MainNet θ is randomly set during initialization, Q-targetθ ^- =0, t represents the current time state. The agent module reads the current network state information S _t and inputs it into the Q-MainNet network. The Q values of different actions in the S _t state are output through the Q-MainNet network; according to ε- Greedy strategy, the Q-MainNet network randomly selects an action a _t ∈ A with probability ε, or selects an action with probability 1-ε The terminal performs corresponding actions in the heterogeneous wireless network. After obtaining the network data, the data is processed into the format required by the algorithm and handed over to the control layer for processing; thereby obtaining the throughput α, delay β, and signal strength γ; and then they are Normalize separately; according to the type of business, obtain the weights of f _t (α) _avg , f _t (β) _avg , and f _t (γ) through the analytic hierarchy process, and then weight the sum to obtain the reward value R; Q-MainNet obtains System status and reward value, through formula (9)

进行奖励值计算，其中R_t+1是对应在S_t+1状态下计算得到的奖励，γ为衰减系数，智能体在当前的状态下的奖励值其实就是未来所有可能的奖励值转换成此时此刻的奖励值；动作执行完毕，系统进入下一个状态S_t+1；Calculate the reward value, where R _t+1 is the reward calculated corresponding to the state S _t+1 , and γ is the attenuation coefficient. The reward value of the agent in the current state is actually the conversion of all possible reward values in the future into this The reward value at this moment; after the action is executed, the system enters the next state S _t+1 ;

Q-MainNet网络将记忆组(s_t,a_t,r_t,s_t+1)即当前状态s_t，动作空间a_t当前奖励值r_t，以及t+1网络状态存储到经验池中，在每一步，Q-target网络随机从经验池中采样，与Q-MainNet网络的输出一起计算损失值相对于参数θ在两个网络Q的差值，即(TargetQ-Q(S_t+1,a；θ_t))²上执行梯度下降算法；每一轮迭代后，将Q-MainNet网络的参数复制给Q-target网络；不断循环进行训练。The Q-MainNet network stores the memory group (s _t , a _t , r _t , s _t+1 ), that is, the current state s _t , the current reward value _r _t of the action space a t, and the t+1 network state into the experience pool. At each step, the Q-target network randomly samples from the experience pool, and together with the output of the Q-MainNet network, calculates the difference in the loss value relative to the parameter θ in the two networks Q, that is, (TargetQ-Q(S _t+1 , a; θ _t )) ² to execute the gradient descent algorithm; after each round of iterations, copy the parameters of the Q-MainNet network to the Q-target network; continue to cycle for training.

附图说明Description of the drawings

图1基于DDQN的无线网络自选择协议的方法的整体流程图；Figure 1 is an overall flow chart of the method of wireless network self-selection protocol based on DDQN;

图2 DDQN算法运行图；Figure 2 DDQN algorithm operation chart;

具体实施方式Detailed ways

下面将参照附图1来描述根据本发明实施的基于DDQN的无线网络自选择协议的方法的具体步骤如下：The specific steps of the method for wireless network self-selection protocol based on DDQN implemented according to the present invention will be described with reference to Figure 1 as follows:

要使用强化学习算法，需要定义状态，动作及奖励值，网络质量参数作为状态值进行输入。To use the reinforcement learning algorithm, you need to define states, actions and reward values, and network quality parameters are input as state values.

状态空间定义：在t时刻一个终端的状态状态空间S的定义为s_mn∈S，表示终端m接入了第n个网络并在网络中进行信息交互时的状态。其状态空间为：State space definition: The state state space S of a terminal at time t is defined as s _mn ∈S, which represents the state when terminal m is connected to the nth network and interacts with information in the network. Its state space is:

S＝s₁,s₂,…,s_mn (1)S＝s ₁ ,s ₂ ,…,s _mn (1)

状态定义：异构网络中对于网络指标的描述通常使用吞吐量、时延、丢包率、网络负载等来描述网络业务状态，使用网络信号强度、节点距离、节点功耗、成本、信噪比来描述用户特性，本文将使用平均吞吐量T、时延D、信号强度P、节点距离W来描述网络状态，则网络质量Φ可表示为：Status definition: The description of network indicators in heterogeneous networks usually uses throughput, delay, packet loss rate, network load, etc. to describe network business status, and uses network signal strength, node distance, node power consumption, cost, signal-to-noise ratio To describe user characteristics, this article will use average throughput T, delay D, signal strength P, and node distance W to describe the network status. Then the network quality Φ can be expressed as:

Φ＝T×D×P×W (2)Φ＝T×D×P×W (2)

A＝a₁,a₂,…,a_n (3)A＝a ₁ ,a ₂ ,…, _an (3)

其中a_n表示某个节点使用第n个网络协议。where a _n indicates that a node uses the nth network protocol.

奖励值定义：每个节点在创建时都具有各自具体业务的特性，都会有自己的业务类型，即使是在同一种网络环境下，相对应就会有不同的奖励值。结合实际需求，将节点业务类型划分为如下类别：Reward value definition: Each node has its own specific business characteristics when it is created, and will have its own business type. Even in the same network environment, there will be different reward values. Based on actual needs, node business types are divided into the following categories:

1.实时性要求高，时延要尽可能的低，需要传输速率较高，若时延过大会影响业务实现。同时也需要一定的吞吐量来保证数据的可靠性。1. The real-time requirements are high, the delay must be as low as possible, and the transmission rate needs to be high. If the delay is too high, it will affect business realization. At the same time, a certain throughput is also required to ensure data reliability.

2.对吞吐量要求极高，相对于业务1时实性要求不强，需要较大的数据流量。2. It has extremely high requirements for throughput. Compared with business 1, real-time requirements are not strong and requires a large amount of data traffic.

3.对时延要求较高，需要应对突发情况下的网络流量，尽量减少时延，提高用户体验。3. It has high latency requirements and needs to deal with network traffic in emergencies, minimize latency, and improve user experience.

4.只需保证足够的吞吐量。4. Just ensure sufficient throughput.

判决矩阵如公式所示，其中各元素表示QoS参数的重要程度，具体如表定义，同时判决矩阵应满足m_ij>0；m_ji＝1/m_ij；m_ij＝1。The decision matrix is as shown in the formula, in which each element represents the importance of the QoS parameter, as defined in the table. At the same time, the decision matrix should satisfy m _ij >0; m _ji =1/m _ij ; m _ij =1.

表1属性与参数的关系Table 1 Relationship between attributes and parameters

表1中没有出现的2、4、6、8用来表示相邻判断的中间值。由于在定义奖励值的过程中将业务类型分为4类，以及考虑了吞吐量、时延、信号强度三种属性，判决矩阵应该定义为3*3的矩阵，即M_i∈R_3×3，其中i＝1、2、3、4分别表示1类、2类、3类、4类四种业务类型，然后根据不同业务QoS参数的需求对四种业务分别建立判决矩阵。2, 4, 6, and 8 that do not appear in Table 1 are used to represent the intermediate values of adjacent judgments. Since the service types are divided into four categories in the process of defining the reward value, and the three attributes of throughput, delay, and signal strength are considered, the decision matrix should be defined as a 3*3 matrix, that is, _Mi ∈R _3×3 , where i=1, 2, 3, and 4 respectively represent the four service types of Type 1, Type 2, Type 3, and Type 4. Then, a decision matrix is established for the four types of services according to the requirements of different service QoS parameters.

根据现行网络业务类型划分标准RFC2474，通过DSCP(Differentiated ServicesCode Point)确定业务等级中属性值。DSCP通过在每个数据包IP头部的服务类别TOS标识字节中，利用已使用的6比特和未使用的2比特，通过编码值来确定IP优先权。IP优先级字段，可以应用于流分类，数值越大表示优先级越高，取值0到63，可匹配64种等级，根据等级大小每七个等级划分为一类，即可通过发送IP数据包中的DSCP字段确定业务属性和和参数之间的关系。According to the current network service type classification standard RFC2474, the attribute values in the service level are determined through DSCP (Differentiated Services Code Point). DSCP determines the IP priority by encoding the value in the Class of Service TOS identification byte in the IP header of each data packet, using the used 6 bits and the unused 2 bits. The IP priority field can be applied to traffic classification. The larger the value, the higher the priority. The value ranges from 0 to 63, which can match 64 levels. According to the size of the level, every seven levels are divided into one category, and IP data can be sent by The DSCP field in the packet determines the relationship between service attributes and parameters.

对于这四类业务，i依次取值1、2、3、4。将最大特征值对应的特征向量归一化，即归一化后的特征向量中的各个值就是对应的网络QoS参数的权重。在以上四种情况下会产生不同业务类型对网络参数需求的差别，这些差别将在之后对奖励值权重的划分产生影响。将整个网络看成是一个整体，最终的目标将是通过选择节点使用协议优化整体网络质量，奖励值是一个和网络具有强相关的函数。For these four types of services, i takes the values 1, 2, 3, and 4 in sequence. Normalize the eigenvector corresponding to the largest eigenvalue, that is Each value in the normalized feature vector is the weight of the corresponding network QoS parameter. In the above four situations, different business types will have different requirements for network parameters, and these differences will have an impact on the division of reward value weights later. Considering the entire network as a whole, the ultimate goal will be to optimize the overall network quality by selecting nodes to use protocols. The reward value is a function that is strongly related to the network.

V_t＝v₁,v₂,…,v_n (5)V _t =v ₁ ,v ₂ ,…,v _n (5)

R＝f_B(V_t) (6)R＝f _B (V _t ) (6)

节点的接入会影响网络参数的变动，当执行动作后，需要对网络状态进行衡量并反馈相应的奖励。当所执行的动作导致网络吞吐量升高、时延降低、信号强度增强即为有效动作；相反，当所执行动作导致网络吞吐量降低、时延降低、信号强度下降即为无效动作。因此计算奖励时考虑平均吞吐量α_avg,平均时延β_avg,信号强度γ。The access of nodes will affect the changes of network parameters. After executing the action, the network status needs to be measured and corresponding rewards fed back. When the action performed leads to an increase in network throughput, a reduction in latency, and an increase in signal strength, it is an effective action; conversely, when the action performed results in a decrease in network throughput, latency, and signal strength, it is an invalid action. Therefore, the average throughput α _avg , the average delay β _avg , and the signal strength γ are considered when calculating the reward.

不同的网络参数其单位和数值通常有较大差别，需要进行归一化处理，对所有数值进行线性变换，将数值映射到[0,1]之间。The units and values of different network parameters are usually quite different, so they need to be normalized, linearly transform all values, and map the values to [0,1].

使用上述方程进行归一化分别得到归一化后的网络平均吞吐量f_t(α)_avg,平均时延f_t(β)_avg,信号强度f_t(γ)。Use the above equations for normalization to obtain the normalized network average throughput f _t (α) _avg , average delay f _t (β) _avg , and signal strength f _t (γ).

综合以上公式可得奖励函数：Combining the above formulas, we can get the reward function:

其中ω₁、ω₂、ω₃是判决矩阵归一化后的特征向量对应的的网络平均吞吐量、时延、信号强度的权重。Among them, ω ₁ , ω ₂ , and ω ₃ are the weights of the average network throughput, delay, and signal strength corresponding to the eigenvectors after normalization of the decision matrix.

第三步：在2)的基础上，把数据输入到DDQN决策网络中实时训练，应用执行结果，使网络状态趋于稳定。Step 3: Based on 2), input the data into the DDQN decision-making network for real-time training, and apply the execution results to stabilize the network state.

一个使用DQN最大的不足是，虽然argmax()方法可以让Q值迅速的向目标靠拢，但是很可能会导致过高估计，所谓的过高估计就是我们得到的算法模型有很大的偏差。为了解决这个问题就可以通过分离目标Q值计算和目标Q值选择来消除误差。网络信息处于离散状态，DDQN能够很好的处理离散状态下的数据。One of the biggest shortcomings of using DQN is that although the argmax() method can quickly bring the Q value closer to the target, it is likely to lead to overestimation. The so-called overestimation means that the algorithm model we obtain has a large deviation. In order to solve this problem, the error can be eliminated by separating the target Q value calculation and target Q value selection. Network information is in a discrete state, and DDQN can handle data in a discrete state very well.

参考附图2在DQN中使用两个神经网络来实现，分别是Q-MainNet和Q-target。同样在DDQN中也使用两个网络来进行运算，只是目标Q值的计算方式不同。Refer to Figure 2 to implement it using two neural networks in DQN, namely Q-MainNet and Q-target. Two networks are also used for calculations in DDQN, but the target Q value is calculated in different ways.

首先初始化状态S、动作空间A，初始化Q矩阵为零矩阵，用随机的参数θ初始化Q-MainNet网络和Q-target网络,θ为网络参数，初始化时Q-MainNetθ随机设定，Q-targetθ^-＝0，t表示当前时间状态，智能体模块读取当前网络状态信息S₎，将其输入到Q-MainNet网络，在S_t状态下不同动作的Q值通过Q-MainNet网络输出。根据ε-greedy策略，Q-MainNet网络以概率ε随机选择一个动作a_t∈A，或者以概率1-ε选择动作终端在异构无线网络中执行相应动作，经过获取网络数据数据处理，处理成算法需要使用的格式，交给控制层进行处理。从而获得吞吐量α,时延β,信号强度γ。然后将他们分别归一化。根据业务的类型，通过层次分析法得到f_t(α)_avg、f_t(β)_avg、f_t(γ)的权重，之后加权求和得到奖励值R。Q-MainNet获取系统状态和奖励值，通过公式(9)First initialize the state S and action space A, initialize the Q matrix to a zero matrix, initialize the Q-MainNet network and Q-target network with random parameters θ, θ is the network parameter, Q-MainNet θ is randomly set during initialization, Q-targetθ ^- =0, t represents the current time state. The agent module reads the current network state information S ₎ and inputs it into the Q-MainNet network. The Q values of different actions in the S _t state are output through the Q-MainNet network. According to the ε-greedy strategy, the Q-MainNet network randomly selects an action a _t ∈ A with probability ε, or selects an action with probability 1-ε The terminal performs corresponding actions in the heterogeneous wireless network. After obtaining the network data, the data is processed into the format required by the algorithm and handed over to the control layer for processing. Thus, the throughput α, delay β, and signal strength γ are obtained. Then normalize them separately. According to the type of business, the weights of f _t (α) _avg , f _t (β) _avg , and f _t (γ) are obtained through the analytic hierarchy process, and then the reward value R is obtained by weighted summation. Q-MainNet obtains the system status and reward value through formula (9)

进行奖励值计算，其中R_t+1是对应在S_t+1状态下计算得到的奖励，γ为衰减系数，智能体在当前的状态下的奖励值其实就是未来所有可能的奖励值转换成此时此刻的奖励值。动作执行完毕，系统进入下一个状态S_t+1。Calculate the reward value, where R _t+1 is the reward calculated corresponding to the state S _t+1 , and γ is the attenuation coefficient. The reward value of the agent in the current state is actually the conversion of all possible reward values in the future into this The reward value at this moment. After the action is executed, the system enters the next state S _t+1 .

Q-MainNet网络将记忆组(s_t,a_t,r_t,s_t+1)即当前状态s_t，动作空间a_t当前奖励值r_t，以及t+1网络状态存储到经验池中，在每一步，Q-target网络随机从经验池中采样，与Q-MainNet网络的输出一起计算损失值相对于参数θ在两个网络Q的差值，即(TargetQ-Q(S_t+1,a；θ_t))²上执行梯度下降算法。每隔G步，将Q-MainNet网络的参数复制给Q-target网络。不断循环进行训练。The Q-MainNet network stores the memory group (s _t , a _t , r _t , s _t+1 ), that is, the current state s _t , the current reward value _r _t of the action space a t, and the t+1 network state into the experience pool. At each step, the Q-target network randomly samples from the experience pool, and together with the output of the Q-MainNet network, calculates the difference in the loss value relative to the parameter θ in the two networks Q, that is, (TargetQ-Q(S _t+1 , a; θ _t )) ² to perform the gradient descent algorithm. Every G steps, the parameters of the Q-MainNet network are copied to the Q-target network. Continuous training cycle.

Claims

1. A method for wireless network self-selection protocol based on DDQN, which is characterized by including the following steps:

Step 1: After obtaining the network environment quality parameters and the node's changing business types in real time through the environment agent module, determine the status, action and reward value;

State space definition: The definition of the state space S of a terminal at time t is s _mn ∈S, which represents the state when the terminal m is connected to the nth network and interacts with information in the network; its state space is:

S＝s _m1 , s _m2 ,..., s _mn #(1)

State definition: Use average throughput T, delay D, signal strength P, and node distance W to describe the network state. The network quality Φ is expressed as:

Φ＝T×D×P×W#(2)

Action space definition: An action space needs to be set for the agent to choose. The definition of action space is:

A＝a ₁ , a ₂ ,..., _an #(3)

where a _n indicates that a node uses the nth network protocol;

Access service network parameters consist of QoS parameters. Establish a decision matrix M for network QoS and solve for parameter weights:

The decision matrix is as shown above, in which each element represents the importance of the QoS parameters. The specific definition is as follows. At the same time, the decision matrix should satisfy m _ij >0; m _ji =1/m _ij ;

When i and j are equally important, m _ij is 1;

When the importance of i and j is compared and i is slightly more important, m _ij is 3;

When the importance of i and j is compared and i is important, m _ij is 5;

When the importance of i and j is compared and i is very important, m _ij is 7;

When the importance of i and j is compared and i is extremely important, m _ij is 9;

The 2, 4, 6, and 8 that do not appear are used to represent the intermediate values of adjacent judgments; because in the process of defining the reward value, the business types are divided into 4 categories, and the three attributes of throughput, delay, and signal strength are considered , the decision matrix should be defined as a 3*3 matrix, that is, M _b ∈ R _3×3 , where b = 1, 2, 3, and 4 respectively represent the four business types of category 1, category 2, category 3, and category 4. Then Establish 4 decision matrices for each of the four services according to the requirements of different service QoS parameters;

According to the current network service type classification standard RFC2474, DSCP is used to determine the attribute value in the service level; DSCP uses the used 6 bits and the unused 2 bits in the service category TOS identification byte in the IP header of each data packet. The IP priority is determined by encoding the value; the IP priority field can be applied to traffic classification. The larger the value, the higher the priority. The value ranges from 0 to 63, which can match 64 levels. According to the size of the level, every seven levels are divided into In the first category, the relationship between business attributes and parameters can be determined by sending the DSCP field in the IP data packet;

For these four types of business, b takes the values 1, 2, 3, and 4 in sequence; the eigenvector corresponding to the largest eigenvalue is normalized, that is Each value in the normalized feature vector is the weight of the corresponding network QoS parameter; in the above four cases, there will be differences in the network parameter requirements of different business types, and these differences will be produced later in the division of reward value weights. Impact; Treat the entire network as a whole, and the ultimate goal will be to optimize the overall network quality by selecting nodes to use protocols. The reward value is a function that is strongly related to the network;

V _t =v ₁ , v ₂ ,..., v _n #(5)

t represents the state information of the network at time t, and V _t is a subset Φ of the network state space. Therefore, for a specific business B, the network space state V _t , the reward function R is expressed as, and will be solved in the next step:

R＝f _B (V _t )#(6)

The access of nodes will affect changes in network parameters. After an action is executed, the network status needs to be measured and corresponding rewards fed back; when the action performed leads to an increase in network throughput, a reduction in latency, and an increase in signal strength, it is an effective action. ; On the contrary, when the action performed results in a decrease in network throughput, delay, and signal strength, it is an invalid action; therefore, the average throughput α _avg , the average delay β _avg , and the signal strength γ are considered when calculating the reward;

Step 2: Based on the first step, normalize the data, determine the node business type, and determine the reward function;

Use min-max standardization to eliminate the impact of different data units:

where x′ is the standardized value of x after conversion, where α, β, and γ are evaluated in sequence;

Use the above equations for normalization to obtain the normalized average network throughput f _t (α) _avg at time t, the average delay f _t (β) _avg , and the signal strength f _t (γ);

Combining the above formulas, we get the reward function:

R＝ω ₁ f _t (α)av _g +ω ₂ f _t (β)av _g +ω ₃ f _t (γ)#(8)

Among them, ω ₁ , ω ₂ , and ω ₃ are the weights of the average network throughput, delay, and signal strength corresponding to the eigenvectors after normalization of the decision matrix;

Step 3: Based on the second step, input the data into the DDQN decision-making network for real-time training, and apply the execution results to stabilize the network state;

First initialize the state space S and action space A, initialize the Q matrix to a zero matrix, initialize the Q-MainNet network and Q-target network with random parameters θ, θ is the network parameter, Q-MainNet θ is randomly set during initialization, and Q-target θ ^- =0, t represents the current time state. The agent module reads the current network state information S _t and inputs it into the Q-MainNet network. The Q values of different actions in the S _t state are output through the Q-MainNet network; according to ε-greedy strategy, the Q-MainNet network randomly selects an action a _t ∈A with probability ε, or selects an action with probability 1-ε The terminal performs corresponding actions in the heterogeneous wireless network. After obtaining the network data, it is processed into the format required by the algorithm and handed over to the control layer for processing; thereby obtaining throughput α, delay β, and signal strength γ; and then they are Normalize separately; according to the type of business, obtain the weights of f _t (α) _avg , f _t (β) _avg , and f _t (γ) through the analytic hierarchy process, and then weight the sum to obtain the reward value R; Q-MainNet obtains System status and reward value, through formula (9)

Calculate the reward value, where R _t+1 is the reward calculated corresponding to the state S _t+1 , and γ is the attenuation coefficient. The reward value of the agent in the current state is actually the conversion of all possible reward values in the future into this The reward value at this moment; after the action is executed, the system enters the next state S _t+1 ;

The Q-MainNet network stores the memory group (s _t , a _t , r _t , s _t+1 ), that is, the current state s _t , the current reward value _r _t of the action space a t , and the t+1 network state into the experience pool. At each step, the Q-target network randomly samples from the experience pool, and together with the output of the Q-MainNet network, calculates the difference in the loss value relative to the parameter θ in the two networks Q, that is, (TargetQ-Q(S _t+1 , a, θ _t )) ² to execute the gradient descent algorithm; after each round of iterations, copy the parameters of the Q-MainNet network to the Q-target network; continue to cycle for training.