WO2022110980A1 - 一种基于深度强化学习的智能网络编码方法和设备 - Google Patents

一种基于深度强化学习的智能网络编码方法和设备 Download PDF

Info

Publication number
WO2022110980A1
WO2022110980A1 PCT/CN2021/118099 CN2021118099W WO2022110980A1 WO 2022110980 A1 WO2022110980 A1 WO 2022110980A1 CN 2021118099 W CN2021118099 W CN 2021118099W WO 2022110980 A1 WO2022110980 A1 WO 2022110980A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
network
coding
packet
intermediate node
Prior art date
Application number
PCT/CN2021/118099
Other languages
English (en)
French (fr)
Inventor
王琪
刘建敏
徐勇军
王永庆
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Publication of WO2022110980A1 publication Critical patent/WO2022110980A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the invention relates to the field of information technology, in particular to a network coding method.
  • Linear network coding is a class of network coding that linearly combines data by coding coefficients selected from a finite field. Compared with nonlinear network coding using nonlinear combination functions, linear network coding has lower complexity and simpler model, so it has been deeply studied and widely used.
  • linear network coding The basic idea of linear network coding is that nodes in the network linearly encode the original data by selecting coding coefficients from a finite field to form new encoded data and forward it, and the receiving node can recover the original data through corresponding decoding operations.
  • Linear network coding methods mainly include deterministic network coding algorithms and random linear network coding algorithms. Deterministic network coding algorithms can guarantee successful decoding of target nodes, but it requires global information such as network topology and link capacity. There are many topologies in reality, and it is impractical to design specific coding methods for different types of networks. Furthermore, it is not suitable for dynamic networks, because the real-time collection of global information from distributed nodes is very complex and cannot be applied at scale.
  • random linear network coding nodes use independent and randomly selected coding coefficients in a certain domain to linearly combine the data to be sent. Relevant research has proved that as long as the finite field is large enough, random linear network coding can ensure that each receiving node can complete the decoding with a high probability, that is, the global coding coefficient matrix corresponding to the receiving node is full rank. Since the main feature of stochastic linear network coding is the random selection of coefficients of linear combinations, stochastic linear network coding is suitable for networks with unknown or changing topology, as it can be easily implemented in a distributed manner.
  • a node with coding capability has three data packets X, Y, Z to send, the node can randomly select coding coefficients a 1 , a 2 , a 3 , b 1 , b 2 , b 3 , c 1 , c 2 , c 3 , then use the coding coefficients to combine the packets into a 1 X+a 2 Y+a 3 Z, b 1 X+b 2 Y+b 3 Z, c 1 X+c 2 Y+c 3 Z, and then These combinations are sent out.
  • the receiving node receives the three code combinations, when the matrix Man Yishi, through linear operations, can solve the original information packets X, Y, Z.
  • the present invention proposes a network coding method according to the first aspect of the present invention, the network includes a source node and an intermediate node, and the method includes:
  • the source node divides the information to be sent into K slices x 1 , x 2 ,...,x K , where K is an integer greater than 1, and determines the coding coefficients g(x 1 ) of each slice according to the source node coding model, g(x 2 ),...,g(x K ), encode K slices, generate an encoded packet P S , and send the encoded packet P S to the next hop node, wherein the source node encoding model is obtained by training the DQN network , where each step environment state is used As training input, ss k is the environment state at the kth step, x k is the kth slice of the packet, is the M recently received encoded packets stored in the buffer of the next-hop intermediate node of the source node, where M is an integer greater than 1;
  • the intermediate node receives the encoding packet sent by the previous node, encodes the received encoding packet P j M times, and determines the encoding coefficients g(P j (1)), g(P j (2) each time according to the intermediate node encoding model.
  • the intermediate node coding model is obtained by training the DQN network, wherein each step environment state is used As the training input, s k is the environment state of the kth step, P new is the current encoding packet, P j (k) is the kth encoding packet in the buffer of the intermediate node, is the M recently received encoded packets stored in the buffer of the next-hop node z of the intermediate node.
  • the training of the source node encoding model includes the steps:
  • Step 110 randomly sample experience from the experience replay memory M s to train N s ;
  • Step 120 Send the N s trained DQN parameters to the source node to update the N snode ;
  • Step 130 On the source node, the environment state ss k is used as the input of the DQN model of the N snode , the Q value corresponding to each behavior is output, and the behavior is selected by the greedy strategy probability ⁇ to determine the encoding coefficients of the K slices of the original information, After execution, the experience of the source node interacting with the environment is collected and stored in the experience replay memory Ms.
  • the intermediate node encoding model includes a target network NR and an execution network NRnode
  • the training of the intermediate node encoding model includes:
  • Step 210 randomly sample experience in the experience replay memory MR to train NR ;
  • Step 220 Send the DQN parameters after NR training to each intermediate node to update the NRnode ;
  • Step 230 On each intermediate node, take the environmental state sk as the input of the DQN model of the N Rnode , output the Q value corresponding to each behavior, and use the greedy strategy probability ⁇ to select the behavior to determine the value of the M packets in the intermediate node buffer.
  • the coding coefficients after execution, collect the experience of the intermediate nodes interacting with the environment, and store the experience in the experience replay memory MR .
  • training N s includes:
  • a k represents the behavior of the kth step
  • r k represents the reward after taking the action at the kth step
  • ⁇ k represents the network parameters of the DQN at step k.
  • training the NR includes:
  • Q t arg et is the target Q value calculated by NR ;
  • a k represents the behavior of the kth step
  • r k represents the reward after taking the action at the kth step
  • ⁇ k represents the network parameters of the DQN at step k.
  • r k is 1; otherwise, r k is 0.
  • r k is 1; otherwise, r k is 0.
  • the source node if the source node does not receive the ACK, the source node's ssk unchanged; if the intermediate node does not receive an ACK, the intermediate node's sk constant.
  • the source node generates the encoded packet P S in the following manner:
  • the kth encoding in the M encodings of the intermediate node includes:
  • P j (k) is the kth coded packet in the buffer of the intermediate node, and k is 1...M.
  • a computer-readable storage medium in which one or more computer programs are stored, and when executed, the computer programs are used to implement the network coding method of the present invention.
  • a network coding computing system comprising a storage device and one or more processors; wherein the storage device is used to store one or more computer programs, the computer programs are When executed by the processor, it is used to implement the network coding method of the present invention.
  • the present invention has the following advantages:
  • the present invention innovatively proposes a method for adaptively selecting coding coefficients by using deep reinforcement learning, compared with the prior art, the present invention can be based on dynamic changes in the network (including changes in network link quality and the number of intermediate nodes).
  • the coding coefficients are adjusted adaptively to adapt to the highly dynamic network environment and improve the decoding efficiency.
  • the present invention uses a Markov Decision Process (MDP) to formulate the coding coefficient optimization problem, where network changes can be automatically and continuously represented as MDP state transitions.
  • MDP Markov Decision Process
  • the present invention has good model generalization ability and can be generalized to networks with different network scales and different link qualities, so that the present invention can adapt to the dynamic changes of the network.
  • the present invention realizes a distributed coding coefficient optimization mechanism, and the coding coefficient optimization model network based on the deep Q-learning Network (DQN) is centrally trained by the preset optimizer, and at the same time, the source nodes and The coding coefficient optimization model of the intermediate node is distributed and executed on the source node and the intermediate node respectively, which simplifies the coding coefficient optimization implementation and improves the stability of DQN training.
  • DQN deep Q-learning Network
  • Fig. 1 shows a source node network coding flow chart according to an embodiment of the present invention
  • FIG. 2 shows a flowchart of network coding of an intermediate node according to an embodiment of the present invention
  • FIG. 3 shows a functional configuration block diagram of a device for deep reinforcement learning intelligent network coding according to an embodiment of the present invention
  • FIG. 4 shows a multi-hop linear network topology diagram according to an embodiment of the present invention
  • FIG. 5 shows a multi-hop parallel network topology diagram according to an embodiment of the present invention
  • FIG. 6 shows a simulation experiment result diagram of a multi-hop linear network according to an embodiment of the present invention
  • FIG. 7 shows a simulation experiment result diagram of a multi-hop parallel network according to an embodiment of the present invention.
  • FIG. 8 shows a simulation experiment result diagram of generalization ability on different network scales according to an embodiment of the present invention
  • FIG. 9 shows a simulation experiment result diagram of generalization ability on different link qualities according to an embodiment of the present invention.
  • FIG. 10 shows a comparison diagram of the simulation experiment results of the embodiment of the present invention, the benchmark coding algorithm and the RL-aided SNC method and the results of the real test platform.
  • a network in the present invention, includes a source node, an intermediate node, and a destination node that receives information.
  • the information is generated by the source node, sent by the source node, passed through the intermediate node, and finally received by the destination node.
  • the source node divides the information into multiple slices, determines the coding coefficients of each slice, encodes these slices, generates an encoded packet, and sends the encoded packet to the next-hop node.
  • the intermediate node receives the encoded packet, determines the encoding coefficient of each packet for the received encoded packet, encodes the multiple encoded packets again, generates a new encoded packet, and sends the new encoded packet to the next-hop node.
  • the present invention adopts the deep reinforcement learning method DQN to determine the coding coefficient, and the model of the DQN method includes multiple steps and multiple environmental states.
  • each step corresponds to determining coding coefficients for each slice or each packet
  • the behavior in this step is the determined coding coefficients
  • the environmental state is the relevant information slice or multiple coded packets.
  • DQN uses the Q value to evaluate each behavior. Among the multiple behaviors in each environmental state, the behavior with the largest Q value is the best behavior, that is, the behavior that should be taken in this environmental state.
  • DQN wants to find the best solution as a whole, so the best behavior is evaluated from a series of behaviors as a whole, that is, in the current environment state, this behavior can make all steps the best cumulative reward.
  • DQN trains the neural network so that the neural network can calculate the Q value corresponding to each behavior of each environmental state.
  • the training method of DQN is to collect input and output from the real environment, where the input is the environment state, and the output is the Q value of the behavior.
  • the environment state is input into the convolutional neural network CNN, the Q value of the behavior is calculated, and the loss function is used to The error between the calculated target Q value and the real Q value is expressed, and the neural network parameters are trained for the purpose of reducing the error.
  • the existing DQN also includes: sample playback buffer (Replay Buffer) or experience playback (Experience replay), and target network (Target Network).
  • Replay Buffer sample playback buffer
  • Experience replay experience playback
  • Target Network target network
  • the present invention introduces a new network, which is still named Target Network, and the original target network is called execution Network (Behavior Network).
  • both network models use exactly the same parameters.
  • the Behavior Network is responsible for interacting with the environment and obtaining interaction samples.
  • the target Q value obtained by Q-Learning is calculated by the Target Network; then it is compared with the Q value obtained by the Behavior Network in the interaction with the environment, and the error is obtained.
  • the Target Network Carry out training, continuously update the model of the Target Network, and then synchronize the updated model to the Behavior Network to update the model of the Behavior Network.
  • the experience of the Behavior Network model will be synchronized to the Target Network, so that the next stage of training can be carried out.
  • Target Network the model that calculates the Q value will be fixed for a period of time, so that the model can reduce the volatility of the model.
  • the Target Network of the present invention includes two neural networks N s and NR , where N s is used for the source node and is trained by the preset optimizer O s , and NR is used for all intermediate nodes, which is trained by the preset optimizer O R , O s and OR each have a memory for storing experience, which includes the environmental state, behavior, and reward of each step.
  • the memory of O s is Ms
  • the memory of OR is MR .
  • the Behavior Network of the present invention includes a set of neural networks N snode deployed on the source node and a set of neural networks N Rnode respectively deployed on all intermediate nodes.
  • N snode is the replication of N s
  • N Rnode is the replication of NR .
  • the N snode and N Rnode are not trained, but the Q value corresponding to the behavior is obtained after inputting the environmental state to them on each node.
  • the deep reinforcement learning-based network coding method of the present invention includes two parts: a centralized training process and a distributed execution process.
  • the centralized training process the DQN-based coding coefficient optimization model network is trained centrally by a preset optimizer.
  • the distributed execution process the coding coefficient optimization model based on the DQN source node and the intermediate node is executed distributedly on the source node and the intermediate node respectively, and the experience generated by the execution is sent back to the optimizer for training. In order to speed up the training of DQN.
  • the source node optimizer O s randomly samples experience from the experience replay memory M s to train the DQN of the source node N s , and inputs the source node environment state ss k (the specific content of the source node environment state will be described in detail below), the neural network N s is trained by minimizing the preset loss function, and the output of N s is the optimal cumulative reward value Q value Q k obtained after selecting the behavior a k under the environmental state ss k .
  • the loss function is:
  • Q t arg et is the target Q value calculated by N s
  • q(ss k , ak ; ⁇ k ) is the empirically known Q after selecting the behavior a k under the environment state ss k value
  • ⁇ k represents the network parameters of the DQN at the current decision step k.
  • the intermediate node optimizer OR randomly samples experience from the experience replay memory MR to train the DQN of the intermediate node NR , and inputs the intermediate node environment state sk (the specific content of the intermediate node environment state will be described in detail below),
  • the neural network NR is trained by minimizing the preset loss function, and the output of NR is the optimal cumulative reward value Q value Q k obtained after selecting the behavior ak in the environmental state sk .
  • the loss function is: In the loss function, Q t arg et is the target Q value calculated by NR , and q (s k , ak ; value, ⁇ k represents the network parameters of the DQN at the current decision step k.
  • the centralized optimizers O s and OR send the updated DQN parameters to each source node and intermediate node in the network.
  • the source and intermediate nodes use the received DQN parameters to update the DQN parameters of the neural network N snode and N Rnode on the node.
  • the intermediate node gets a reward value r k
  • the optimizer OR collects the experience of the intermediate node interacting with the environment ( sk , ak , r k , sk+1 ), and uses The experience is stored in the experience playback memory MR .
  • the two modules of deep reinforcement learning agent and network environment in the source node coding coefficient optimization model based on deep reinforcement learning are designed as follows:
  • the source node is regarded as a deep reinforcement learning agent
  • the abstract environment is a network formed by a source node and all next-hop intermediate nodes of the source node, including the source node, all next-hop intermediate nodes of the source node, and all the next-hop intermediate nodes of the source node and the source node.
  • the deep reinforcement learning agent observes the environmental state ss k of the current decision step k, and takes an action a k to act on the environment according to the environmental state ss k , and the environment will feed back a reward r k to the deep reinforcement learning agent to achieve Deep reinforcement learning agent-environment interaction.
  • the environmental state ssk observed by the source node is:
  • M is an integer greater than 1, that is Wherein, P S (l) is the lth coded packet in the buffer of the next-hop intermediate node of the source node.
  • the source node executes the behavior a k :
  • the reward r k received from the environment is:
  • the current node reserves the encoded packet sent to the next-hop node to form the next-hop intermediate node buffer on the source node And confirm whether the next hop node has received the encoded packet through the ACK returned by the next hop node after sending. If this node does not receive an ACK, it means that the next hop node has not received the encoded packet, then There will be no change, that is, when the source node sends the next encoded packet, its state ss k Little has changed relative to sending the current packet. If this node receives ACK, it means that the next hop node successfully received the encoded packet, then A change occurs, that is, when the source node sends the next encoded packet, its state ss k Changed relative to sending the current packet.
  • next hop node After all K steps are executed, it is sent to the next hop node to determine the rewards of K steps from steps 1 to K, and the rewards of these K steps are the same. Since this node is in the buffer The encoding packet accepted by the next hop node is reserved in the ACK, so regardless of whether the node receives an ACK or not, the node can change the encoding packet according to whether it is sent or not. The behavior is evaluated by the rank of the linear system formed by the encoding package.
  • Figure 2 shows the encoding process of intermediate nodes based on deep reinforcement learning.
  • the coding coefficients of the kth coded packet in the area are XORed with the current coded packet Pnew .
  • Pnew P j .
  • two modules of deep reinforcement learning agent and network environment in the intermediate node coding coefficient optimization model based on deep reinforcement learning are designed as follows:
  • the abstract environment is a network formed by the current intermediate node i and the next-hop node of the intermediate node i, including the intermediate node i, the next-hop node of the intermediate node i, and the intermediate node i and the intermediate node.
  • the deep reinforcement learning agent observes the environmental state sk of the current decision step k, and takes an action a k to act on the environment according to the environmental state sk , and the environment will feed back a reward r k to the deep reinforcement learning agent to achieve Deep reinforcement learning agent-environment interaction.
  • the environmental state sk observed by the intermediate node i is:
  • the environment state sk includes the current encoded packet P new , the intermediate node i buffer
  • the k-th coded packet P j (k) in and the recently received M (M 10) coded packets stored in the buffer of the next hop node z of the intermediate node i which is Among them, P i (l) is the lth coded packet in the buffer of the next hop node z of the intermediate node i, and P j (1), P j (l), ..., P j (M)
  • the reception is earlier than the reception of P j .
  • the intermediate node i executes the behavior ak :
  • the reward r k received from the environment is:
  • the encoding packet P j received by the intermediate node i from its previous hop node j is re-encoded M times, and finally the intermediate node i sends the encoding packet P new encoded by the last decision step M. .
  • the current node reserves the encoded packet sent to the next-hop node to form the next-hop intermediate node buffer on the intermediate node And confirm whether the next hop node has received the encoded packet through the ACK returned by the next hop node after sending. If this node does not receive an ACK, it means that the next hop node has not received the encoded packet, then No change will occur, that is, when the intermediate node i sends the next encoded packet, the state sk in the Nothing has changed relative to sending the current packet.
  • this node receives ACK, it means that the next hop node successfully received the encoded packet, then A change occurs, that is, when the intermediate node i sends the next encoded packet, its state sk in Changed relative to sending the current packet. It can be seen that whether the ACK packet is accepted is determined by the link quality, and then the link quality will affect the buffer The stored coding packets, so the coding model of the intermediate node can adjust the coding coefficients adaptively according to the change of the network link quality.
  • next hop node After all M steps are executed, it is sent to the next hop node to determine the rewards of M steps from steps 1 to M, and the rewards of these M steps are the same. Since this node is in the buffer The encoding packet accepted by the next hop node is reserved in the ACK, so regardless of whether the node receives an ACK or not, the node can change the encoding packet according to whether it is sent or not. The behavior is evaluated by the rank of the linear system formed by the encoding package.
  • Fig. 3 is a block diagram showing a functional configuration of a device for intelligent network coding for deep reinforcement learning according to an embodiment of the present invention.
  • the device includes: a source node coding coefficient optimization unit, configured to optimize the coding coefficients of data packets on the source node through a deep reinforcement learning coding coefficient optimization model of the source node; an intermediate node coding coefficient optimization unit, configured to pass the depth of the intermediate point.
  • the reinforcement learning coding coefficient optimization model is used to optimize the coding coefficients of the data packets on the intermediate nodes;
  • the intelligent network coding unit is configured to code the information according to the optimized coding coefficients;
  • the data packet forwarding unit is configured to forward the coded data packets.
  • This example uses the framework TensorFlow 1.15 based on Python 3.5 to construct an intelligent network coding method based on deep reinforcement learning according to the present invention and the architecture of its deep neural network.
  • a multi-hop linear network topology and a multi-hop parallel network topology with a single source, multiple intermediate nodes and a single destination are considered.
  • Figure 4 shows the multi-hop linear network topology diagram
  • Figure 5 shows the Multi-hop parallel network topology diagram.
  • Decoding rate The probability that the destination node can successfully decode (restore the original information) after receiving P data packets
  • K is the number of packets into which a message is divided
  • E is the number of redundant packets when using network coding
  • Nr is the number of packets received at the destination node.
  • PER Packet Error Rate
  • SINR Signal to Interference plus Noise Ratio
  • the overhead when the number (N) of intermediate nodes is equal to 2, 4, 6, and 8 is 2.5%, 4.2%, 4.5% and 5.2%, respectively.
  • the remaining information leads to an increase in the number of packets Nr finally received by the destination node (the numerator in the overhead formula), so the overhead will increase.
  • Figure 7 shows that in the multi-hop parallel network topology, the packet error rate of the link between the source node and the intermediate node is 0.1, the packet error rate of the link between the intermediate node and the destination node is 0.3, the source node and the destination node.
  • the decoding rate of the example of the present invention is related to the number of data packets sent by the source node and the number of intermediate nodes. It can be seen that as the number of packets sent by the source node increases and the number of intermediate nodes increases, the probability of successful decoding by the target node is improved. In addition, in the case that the target node receives the same data packet, the larger K is, the lower the probability of the target node decoding.
  • Fig. 8 shows the generalization ability of the present invention on different numbers of intermediate nodes under the linear topology where the packet error rate of each link is 0.1.
  • we define these test results as use together
  • Represents the test results of the DQN model trained under the link quality PER S-R1 w , PER R1-D
  • the Raspberry Pi 3B+ features a 1.4GHz ARM A53 processor, 1GB RAM, and integrated wireless and Bluetooth capabilities.
  • TensorFlow Lite to deploy the DQN model trained by the example of the present invention to Raspberry Pi 3B+.
  • the example of the present invention is compared with the traditional benchmark coding algorithm and the existing coding algorithm based on reinforcement learning coding algorithm (RL-aided SNC: Dynamic Sparse Coded Multi-Hop Transmissions using Reinforcement Learning).
  • RL-aided SNC Dynamic Sparse Coded Multi-Hop Transmissions using Reinforcement Learning
  • the source node uses the traditional fountain code
  • the intermediate node uses the random network coding algorithm.
  • the experimental results of this example show that the intelligent network coding method based on deep reinforcement learning of the present invention has higher decoding rate and lower overhead than existing coding methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明提供一种基于深度强化学习的网络编码方法,所述方法包括:源节点将要发送的信息划分成K个片,根据源节点编码模型确定每个片的编码系数,生成并向下一跳节点发送编码包;中间节点接收前一节点发送的编码包,将收到的编码包再次编码,根据中间节点编码模型确定编码系数,生成并向下一跳节点发送编码包,其中所述源节点和中间节点编码模型通过对DQN网络训练得到。本发明可以根据网络动态变化来自适应地调节编码系数,改善解码效率,并具备良好的模型泛化能力,能泛化于具有不同网络规模和不同链路质量下的网络,本发明分别在源节点和中间节点上分布式执行的各自的编码系数优化模型,简化了编码系数优化实施并且改善了DQN训练的稳定性。

Description

一种基于深度强化学习的智能网络编码方法和设备 技术领域
本发明涉及信息技术领域,尤其涉及网络编码方法。
背景技术
线性网络编码是一类网络编码,由选自有限域中的编码系数对数据进行线性组合。与利用非线性组合函数的非线性网络编码相比,线性网络编码具有较低的复杂度和更简单的模型,因此已经得到了深入的研究和广泛的应用。
线性网络编码的基本思想是网络中的节点通过从有限域中选取编码系数对原始数据进行线性编码以形成新的编码数据并进行转发,接收节点通过相应的解码操作可以恢复出原始数据。线性网络编码方法主要包括确定性网络编码算法和随机线性网络编码算法。确定性网络编码算法可以保证目标节点成功解码,但是它需要全局信息,例如网络拓扑和链路容量。现实中存在多种拓扑,为不同类型的网络设计特定的编码方法不切实际。此外,它不适用于动态网络,因为从分布式节点实时收集全局信息非常复杂,无法大规模应用。在随机线性网络编码中,节点使用独立、随机选取在某限域的编码系数,对需要发送的数据进行线性组合。相关研究已经证明,只要有限域足够大,随机线性网络编码可以确保每个接收节点能够以较高的概率完成解码,即接收节点对应的全局编码系数矩阵是满秩的。由于随机线性网络编码的主要特征是随机选择线性组合的系数,因此随机线性网络编码适用于拓扑未知或变化的网络,因为它可以轻松地以分布式方式实现。例如一个具有编码能力的节点有三个数据包X、Y、Z需要发送,该节点可以随机选取编码系数a 1、a 2、a 3、b 1、b 2、b 3、c 1、c 2、c 3,然后使用编码系数将数据包组合为a 1X+a 2Y+a 3Z、b 1X+b 2Y+b 3Z、c 1X+c 2Y+c 3Z,再将这些组合发送出去。接收节点收到3个编码组合后,当矩阵
Figure PCTCN2021118099-appb-000001
满轶时,通过线性运算,可以解出原始信息包X、Y、Z。
各种原因都可能造成解码失败,不仅是由中间节点所提取线性相关系数造成,也有可能是因为网络不稳定导致的丢包使得中间节点未接收到一些用于解码的分组。在随机线性网络编码中,系数是从一个伽罗华域中以相等的概率随机提取的。因此,这种编码方法无法根据网络动态变化(包括网络链路质量和中间节点数量的变化)来调整编码系数造成的解码效率低的问题。
发明内容
本发明针对上述问题,根据本发明的第一方面,提出一种网络编码方法,所述网络包括源节点和中间节点,所述方法包括:
源节点将要发送的信息划分成K个片x 1,x 2,…,x K,K为大于1的整数,根据源节点编码模型确定每个片的编码系数g(x 1),g(x 2),...,g(x K),将K个片编码,生成编码包P S,并向下一跳节点发送编码包P S,其中所述源节点编码模型通过对DQN网络训练得到,其中使用各步环境状态
Figure PCTCN2021118099-appb-000002
作为训练输入,ss k为第k步的环境状态,x k为信息包的第k个片,
Figure PCTCN2021118099-appb-000003
为该源节点的下一跳中间节点的缓冲区里所存储的近期收到的M个编码包,M为大于1的整数;
中间节点接收前一节点发送的编码包,将收到的编码包P j编码M次,根据中间节点编码模型确定每次的编码系数g(P j(1)),g(P j(2)),…g(P j(M)),生成编码包P new,并向下一跳节点发送编码包P new,其中所述中间节点编码模型通过对DQN网络训练得到,其中使用各步环境状态
Figure PCTCN2021118099-appb-000004
作为训练输入,s k为第k步的环境状态,P new为当前编码包,P j(k)为该中间节点缓冲区中的第k个编码包,
Figure PCTCN2021118099-appb-000005
为该中间节点下一跳节点z的缓冲区里所存储的近期收到的M个编码包。
在本发明的一个实施例中,其中所述源节点编码模型包括目标网络N s和执行网络N snode,所述源节点编码模型的训练包括步骤:
步骤110:从经验回放存储器M s中随机采样经验来训练N s
步骤120:将N s训练后的DQN参数发给源节点,以对N snode进行更新;和/或
步骤130:在源节点上,将环境状态ss k作为N snode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε选择行为来决定原始信息的K个片的编码系数,执行后,收集源节点与环境交互的经验,并将该经验存 储到经验回放存储器M s中。
在本发明的一个实施例中,其中中间节点编码模型包括目标网络N R和执行网络N Rnode,所述中间节点编码模型的训练包括:
步骤210:在经验回放存储器M R中随机采样经验来训练N R
步骤220:将N R训练后的DQN参数发给各中间节点,以对N Rnode进行更新;和/或
步骤230:在各中间节点上,将环境状态s k作为N Rnode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε选择行为来决定中间节点缓冲区的M个包的编码系数,执行后,收集中间节点与环境交互的经验,并将该经验存储到经验回放存储器M R中。
在本发明的一个实施例中,其中对N s进行训练包括:
将网络编码的环境状态ss k做为N s的输入,通过最小化损失函数
Figure PCTCN2021118099-appb-000006
对神经网络进行训练,k取值为1…K,其中Q t arg et为N s计算的目标Q值;
a k表示第k步的行为;
r k表示第k步采取行为后的奖励;
θ k表示第k步的DQN的网络参数。
在本发明的一个实施例中,其中对N R进行训练包括:
将网络编码的环境状态s k做为N R的输入,通过最小化损失函数
Figure PCTCN2021118099-appb-000007
对神经网络进行训练,k取值为1…M,其中
Q t arg et为N R计算的目标Q值;
a k表示第k步的行为;
r k表示第k步采取行为后的奖励;
θ k表示第k步的DQN的网络参数。
在本发明的一个实施例中,其中对于N s
a k为信息的第k个片x k的编码系数,a k∈A S,其中,A S={0,1,...,(q-1)},q是伽罗华域的域值大小;
当该源节点发送的编码包能够使得由该源节点的下一跳中间节点缓冲区里的编码包所形成的线性系统的秩增加时,r k为1,否则,r k为0。
在本发明的一个实施例中,其中对于N R
a k为第k个包的编码系数,a k∈A R,其中A R={0,1,...,(q-1)},q是伽罗华域 的域值大小;
当该中间节点发送的编码包能够使得由该中间节点的下一跳节点缓冲区里的编码包所形成的线性系统的秩增加时,r k为1,否则,r k为0。
在本发明的一个实施例中,其中,如果源节点没有收到ACK,源节点的ss k
Figure PCTCN2021118099-appb-000008
不变;如果中间节点没有收到ACK,该中间节点的s k
Figure PCTCN2021118099-appb-000009
不变。
在本发明的一个实施例中,其中源节点通过以下方式生成编码包P S
P S=G S·X,其中,X=[x 1,x 2,...,x K],G S=[g(x 1),g(x 2),...,g(x K)]。
在本发明的一个实施例中,其中,中间节点的M次编码中的第k次编码包括:
k=1时,
Figure PCTCN2021118099-appb-000010
k>1时,
Figure PCTCN2021118099-appb-000011
P j(k)为该中间节点的缓冲区中的第k个编码包,k取值为1…M。
根据本发明的第二方面,提供一种计算机可读存储介质,其中存储有一个或者多个计算机程序,所述计算机程序在被执行时用于实现本发明的网络编码方法。
根据本发明的第三方面,提供一种网络编码的计算系统,包括存储装置、以及一个或者多个处理器;其中,所述存储装置用于存储一个或者多个计算机程序,所述计算机程序在被所述处理器执行时用于实现本发明的网络编码方法。
与现有技术相比,本发明的实施例的优点在于:
本发明与现有技术相比,具有以下优点:
1.由于本发明创新性地提出了利用深度强化学习自适应地选择编码系数方法,与现有技术相比,本发明可以根据网络动态变化(包括网络链路质量和中间节点数量的变化)来自适应地调节编码系数,以适应高动态变化的网络环境,改善解码效率。
2.本发明使用马尔科夫决策过程(MDP)来制定编码系数优化问题,其中网络变化可以自动且连续地表示为MDP状态转换。此外,本发明具备良好的模型泛化能力,能泛化于具有不同网络规模和不同链路质量下的网络,使得该发明可以适应网络的动态变化。
3.本发明实现了分布式编码系数优化机制,基于深度Q网络(Deep  Q-learning Network,DQN)的编码系数优化模型网络被预先设置的优化器集中式训练,同时,基于DQN的源节点和中间节点的编码系数优化模型分别在源节点和中间节点上分布式执行,进而简化了编码系数优化实施并且改善了DQN训练的稳定性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本发明的实施例,并与说明书一起用于解释本发明的原理。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1示出了根据本发明实施例的源节点网络编码流程图;
图2示出了根据本发明实施例的中间节点网络编码流程图;
图3示出了根据本发明实施例的用于深度强化学习智能网络编码的设备的功能配置框图;
图4示出了根据本发明实施例的多跳线性网络拓扑图;
图5示出了根据本发明实施例的多跳并行网络拓扑图;
图6示出了根据本发明实施例的多跳线性网络的仿真实验结果图;
图7示出了根据本发明实施例的多跳并行网络的仿真实验结果图;
图8示出了根据本发明实施例的在不同网络规模上的泛化能力的仿真实验结果图;
图9示出了根据本发明实施例的在不同链路质量上的泛化能力的仿真实验结果图;
图10示出了本发明实施例、基准编码算法与RL-aided SNC三种方法的仿真实验结果与真实试验平台结果的对比图。
具体实施方式
针对背景技术指出的问题,发明人进行了研究,提出了一种基于深度强化学习的网络编码方法,下面结合附图和具体实施例,对本方法进行详细描述。
概括说来,在本发明中,网络包括源节点、中间节点和接收信息的目的节点。信息产生于源节点,由源节点发出,经过中间节点,最终由目的节点 接收。源节点将信息划分成多个片,确定每个片的编码系数,将这些片编码,生成编码包,并向下一跳节点发送编码包。中间节点接收编码包,为收到的编码包确定每个包的编码系数,将多个编码包再次编码,生成新的编码包,并向下一跳节点发送新的编码包。
本发明采用深度强化学习方法DQN确定编码系数,DQN方法的模型中包含多个步骤,多个环境状态,在每个环境状态可采取多种行为,每种行为对应不同的奖励。在本发明中,每个步骤对应于为每一片或每个包确定编码系数,在该步骤的行为为所确定的编码系数,环境状态为相关的信息片或多个编码包。DQN使用Q值评价每个行为,在每个环境状态下的多个行为中,使得Q值最大的行为为最佳行为,也就是在该环境状态下应当采取的行为。DQN要从整体上寻找最佳方案,因此该最佳行为是从一系列行为的整体上评价的,即在当前环境状态下此行为可使所有步骤累积奖励最佳。
Q值的计算基于奖励,采用如下公式:Q k=r k+γmax Q k+1,k为正整数,第k步的Q值Q k依赖于k+1步的Q值,具体地,为k+1步所有行为的Q值中的最大值max Q k+1,γ为折扣因子,0≤γ≤1,r k为第k步奖励,而最后一步的Q值即为最后一步的奖励。
DQN通过训练神经网络,使神经网络可以计算每个环境状态的每个行为对应的Q值。DQN的训练方法为从真实环境中采集输入与输出,其中,输入为环境状态,输出为行为的Q值,将环境状态输入卷积神经网络CNN后,计算出行为的Q值,使用损失函数来表达计算的目标Q值与真实Q值之间的误差,以减小该误差为目的,对神经网络参数进行训练。在实际执行过程中,尽管Q值最大的行为为最佳行为,但是为了平衡学习与探索,会尝试采取新的行为,例如采用贪心策略,即用较小的概率ε(ε<1)选择采取未知的行为,而用1-ε选择采取通过学习已知的Q值最大的行为。
现有的DQN还包括:样本回放缓冲区(Replay Buffer)或者叫做经验回放(Experience replay),以及目标网络(Target Network)。为了减轻相关问题带来的影响,尽可能地将训练与执行两个部分解耦,本发明引入了一个新的网络,仍然命名为目标网络(Target Network),而将原本的目标网络称为执行网络(Behavior Network)。
在训练开始时,两个网络模型使用完全相同的参数。在执行过程中,Behavior Network负责与环境交互,得到交互样本。在训练过程中,由Q-Learning得到的目标Q值由Target Network计算得到;然后用它和Behavior  Network在与环境交互中获得的Q值进行比较,得出误差,通过减小误差,对Target Network进行训练,不断更新Target Network的模型,再将更新后的模型同步到Behavior Network,更新Behavior Network的模型。
每当训练完成一定轮数的迭代,Behavior Network模型的经验就会同步给Target Network,这样就可以进行下一个阶段的训练了。通过使用Target Network,计算Q值的模型在一段时间内将被固定,这样模型可以减轻模型的波动性。
本发明的Target Network包括两个神经网络N s和N R,N s用于源节点,由预先设置的优化器O s训练,N R用于所有中间节点,由预先设置的优化器O R训练,O s和O R各有一个存储器用于存储经验,经验包括各步骤的环境状态、行为、奖励。O s的存储器为M s,O R的存储器为M R。本发明的Behavior Network包括在源节点上部署的一套神经网络N snode以及在所有的中间节点上都各自部署的一套神经网络N Rnode。N snode为对N s的复制,N Rnode为对N R的复制。对N snode和N Rnode不进行训练,而是在各节点上对它们输入环境状态后获取行为对应的Q值。
本发明的基于深度强化学习的网络编码方法包括两个部分:集中式训练过程和分布式执行过程。在集中式训练过程中,基于DQN的编码系数优化模型网络被预先设置的优化器集中式训练。在分布式执行过程,基于DQN的源节点和中间节点的编码系数优化模型分别在源节点和中间节点上分布式执行,并将执行产生的经验送回优化器进行训练,边执行,边训练,以加快DQN的训练的速度。
(1)在集中式训练过程中,源节点优化器O s从经验回放存储器M s中随机采样经验来训练源节点N s的DQN,输入源节点环境状态ss k(源节点环境状态的具体内容将在下文详细描述),通过最小化预先设置的损失函数对神经网络N s进行训练,N s的输出为该环境状态ss k下选择行为a k后获得最优累积奖励值Q值Q k。其中,损失函数为:
Figure PCTCN2021118099-appb-000012
Figure PCTCN2021118099-appb-000013
在损失函数中,Q t arg et为N s计算出的目标Q值,q(ss k,a k;θ k)为根据经验所知在该环境状态ss k下,选择行为a k后的Q值,θ k表示在当前决策步k下的所述DQN的网络参数。
同样的,中间节点优化器O R从经验回放存储器M R中随机采样经验来训练中间节点N R的DQN,输入中间节点环境状态s k(中间节点环境状态的具体内容将在下文详细描述),通过最小化预先设置的损失函数对神经网络N R 进行训练,N R的输出为该环境状态s k下选择行为a k后获得最优累积奖励值Q值Q k。其中,损失函数为:
Figure PCTCN2021118099-appb-000014
在损失函数中,Q t arg et为N R计算出的目标Q值,q(s k,a k;θ k)为根据经验所知在该环境状态s k下,选择行为a k后的Q值,θ k表示在当前决策步k下的所述DQN的网络参数。
一旦DQN的参数被更新,集中优化器O s和O R会将更新后的DQN参数发送给网络中的每个源节点和中间节点。源和中间节点利用所收到DQN参数更新该节点上的神经网络N snode和N Rnode的DQN参数。
(2)在分布式执行过程中,对于源节点,根据其所观察到的当前环境状态ss k,将ss k作为源节点N snode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε(例如ε=0.1)选择一个行为来决定原始信息的第k个片的编码系数,一个行为a k执行后,该源节点获得一个奖励值r k,优化器O s收集源节点与环境交互的经验(ss k,a k,r k,ss k+1),并将该经验存储到经验回放存储器M s中;对于中间节点i,根据其所观察到的环境状态s k,将s k作为中间节点N Rnode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε(例如ε=0.1)选择一个行为来决定该中间节点缓冲区的第k个包的编码系数,一个行为a k执行后,该中间节点获得一个奖励值r k,优化器O R收集中间节点与环境交互的经验(s k,a k,r k,s k+1),并将经验存储到经验回放存储器M R中。
以下结合本发明的实施例介绍源节点和中间节点编码的具体方法及其对应的环境状态、行为和奖励。
源节点的编码与其对应的环境状态、行为和奖励
图1示出了基于深度强化学习的源节点的编码过程:一个信息X(X=[x 1,x 2,…,x K])划分成K个片,K为大于1的整数,这K个片的编码系数优化过程视为一个马尔科夫过程(MDP),该MDP包含了K个决策步,在第k(k=1,2,…,K)个决策步中,第k个片x k的编码系数被确定;
具体地,基于深度强化学习的源节点编码系数优化模型中的深度强化学习智能体与网络环境两大模块设计如下:
(1)源节点视为深度强化学习智能体;
(2)抽象环境为由源节点和该源节点的所有下一跳中间节点形成的网络,包括源节点、该源节点的所有下一跳中间节点,以及该源节点与该源节点的所有下一跳中间节点所形成的链路。
(3)深度强化学习智能体观察当前决策步k的环境状态ss k,并根据环境状态ss k采取一个行为a k作用于环境,环境将反馈一个奖励r k给深度强化学习智能体,以实现深度强化学习智能体与环境的交互。
根据本发明的一个实施例,在当前决策步k下,该源节点所观察到的环境状态ss k为:
环境状态ss k包括一个信息包的第k个片x k和该源节点的下一跳中间节点的缓冲区里所存储的近期收到的M(例如M=10)个编码包
Figure PCTCN2021118099-appb-000015
M为大于1的整数,即
Figure PCTCN2021118099-appb-000016
其中,P S(l)是该源节点的下一跳中间节点缓冲区中的第l个编码包。
具体地,在当前环境状态ss k下,该源节点执行行为a k
在每个决策步k下,该源节点选择一个行为a k∈A S来决定信息包的第k个片x k的编码系数g(x k),g(x k)=a k,其中,A S={0,1,...,(q-1)},q是伽罗华域(Galois field)的域值大小,在一个实施例中,q=2,在另一个实施例中,q为正整数。
根据本发明的一个实施例,在当前环境状态ss k下,该源节点执行行为a k后,收到来自环境的奖励r k为:
当该源节点发送的编码包能够使得由该源节点的下一跳中间节点缓冲区里的编码包所形成的线性系统的秩增加,r k=1,否则,r k=0。
经历K个决策步后,一个信息包的K个片的编码系数均被确定,那么源节点利用所确定的编码系数对K个片进行编码并发送编码后的包P S,P S=G S·X,其中,X=[x 1,x 2,...,x K],G S=[g(x 1),g(x 2),...,g(x K)]。
在一个实施例中,本节点保留发给下一跳节点的编码包以形成源节点上的下一跳中间节点缓冲区
Figure PCTCN2021118099-appb-000017
并通过发送后下一跳节点反馈的ACK确认下一跳节点是否收到编码包。如果本节点没有收到ACK,这说明下一跳节点没有收到编码包,则
Figure PCTCN2021118099-appb-000018
不会发生变化,即源节点发送下一个编码包时,其状态ss k中的
Figure PCTCN2021118099-appb-000019
相对于发送当前包并没有发生改变。如果本节点收到ACK,这说明下一跳节点成功收到编码包,则
Figure PCTCN2021118099-appb-000020
发生变化,即该源节点发送下一个编码包时,其状态ss k中的
Figure PCTCN2021118099-appb-000021
相对于发送当前包发生了改变。由此可见ACK包是否接受是由链路质量决定的,进而链路质量会影响缓冲区
Figure PCTCN2021118099-appb-000022
所存储的编码包,所以源节点的编码模型可以根据网络链路质量的变化来自适应地调节编码系数。
在一个实施例中,在所有K个步骤都执行完,发送给下一跳节点,确定 步骤1至K的K个步骤的奖励,这K个步骤的奖励相同。由于本节点在缓冲区
Figure PCTCN2021118099-appb-000023
中保留了下一跳节点所接受的编码包,因此不论本节点是否收到ACK,本节点都可以根据所发送的编码包是否会改变
Figure PCTCN2021118099-appb-000024
里编码包所形成的线性系统的秩来评价行为。
中间节点的编码与其对应的环境状态、行为和奖励
图2示出了基于深度强化学习的中间节点编码过程,当前中间节点i对所收到的来自该中间节点i的上一跳节点j的编码包P j再次编码的过程视为一个马尔科夫过程(MDP),该MDP包含了M(例如M=10)个决策步,在第k(k=1,2,…,M)个决策步中,该中间节点i决定该中间节点i的缓冲区里的第k个编码包的编码系数,并将第k个编码包与当前编码包P new进行异或操作。在第一个决策步中,即k=1时,P new=P j
根据本发明的一个实施例,基于深度强化学习的中间节点编码系数优化模型中的深度强化学习智能体与网络环境两大模块设计如下:
(1)中间节点视为深度强化学习智能体;
(2)抽象环境为由当前中间节点i和该中间节点i的下一跳节点形成的网络,包括该中间节点i、该中间节点i的下一跳节点,以及该中间节点i与该中间节点i的下一跳节点z所形成的链路;
(3)深度强化学习智能体观察当前决策步k的环境状态s k,并根据环境状态s k采取一个行为a k作用于环境,环境将反馈一个奖励r k给深度强化学习智能体,以实现深度强化学习智能体与环境的交互。
根据本发明的一个实施例,在当前决策步k下,该中间节点i所观察到的环境状态s k为:
环境状态s k包括当前编码包P new,该中间节点i缓冲区
Figure PCTCN2021118099-appb-000025
中的第k个编码包P j(k)以及该中间节点i的下一跳节点z的缓冲区里所存储的近期收到的M(M=10)个编码包
Figure PCTCN2021118099-appb-000026
Figure PCTCN2021118099-appb-000027
其中,P i(l)是该中间节点i的下一跳节点z的缓冲区中的第l个编码包,并且P j(1),P j(l),…,P j(M)的接收早于P j的接收。
根据本发明的一个实施例,在当前环境状态s k下,该中间节点i执行行为a k
在每个决策步k下,该中间节点i选择一个行为a k∈A R来决定该中间节 点缓冲区里的第k个包的编码系数g(P j(k)),g(P j(k))=a k,其中,A R={0,1,...,(q-1)},q是伽罗华域(Galois field)的域值大小,在一个实施例中,q=2,在另一个实施例中,q为正整数。
根据本发明的一个实施例,在当前环境状态s k下,该中间节点i执行行为a k后,收到来自环境的奖励r k为:
当该中间节点i发送的编码包能够使得由该中间节点i的下一跳节点z缓冲区里的编码包所形成的线性系统的秩增加,r k=1;否则,r k=0。
第k个决策步后,当前编码包P new被重新编码,即
Figure PCTCN2021118099-appb-000028
Figure PCTCN2021118099-appb-000029
特别地,当k=1,
Figure PCTCN2021118099-appb-000030
经历M个决策步后,该中间节点i收到的来自其上一跳节点j的编码包P j被重新编码M次,最终该中间节点i发送最后一个决策步M编码后的编码包P new
在一个实施例中,本节点保留发给下一跳节点的编码包以形成中间节点上的下一跳中间节点缓冲区
Figure PCTCN2021118099-appb-000031
并通过发送后下一跳节点反馈的ACK确认下一跳节点是否收到编码包。如果本节点没有收到ACK,这说明下一跳节点没有收到编码包,则
Figure PCTCN2021118099-appb-000032
不会发生变化,即该中间节点i发送下一个编码包时,其状态s k中的
Figure PCTCN2021118099-appb-000033
相对于发送当前包并没有发生改变。如果本节点收到ACK,这说明下一跳节点成功收到编码包,则
Figure PCTCN2021118099-appb-000034
发生变化,即该中间节点i发送下一个编码包时,其状态s k中的
Figure PCTCN2021118099-appb-000035
相对于发送当前包发生了改变。由此可见ACK包是否接受是由链路质量决定的,进而链路质量会影响缓冲区
Figure PCTCN2021118099-appb-000036
所存储的编码包,所以中间节点的编码模型可以根据网络链路质量的变化来自适应地调节编码系数。
在一个实施例中,在所有M个步骤都执行完,发送给下一跳节点,确定步骤1至M的M个步骤的奖励,这M个步骤的奖励相同。由于本节点在缓冲区
Figure PCTCN2021118099-appb-000037
中保留了下一跳节点所接受的编码包,因此不论本节点是否收到ACK,本节点都可以根据所发送的编码包是否会改变
Figure PCTCN2021118099-appb-000038
里编码包所形成的线性系统的秩来评价行为。
图3示出了根据本发明实施例的用于深度强化学习的智能网络编码的设备功能配置框图。该设备包括:源节点编码系数优化单元,配置为通过源节点的深度强化学习编码系数优化模型来优化源节点上的数据包的编码系数;中间节点编码系数优化单元,配置为通过中间点的深度强化学习编码系数优化模型来优化中间节点上的数据包的编码系数;智能网络编码单元,配置为 根据优化的编码系数对信息进行编码;数据包转发单元,配置为转发编码后的数据包。
下面对本发明的仿真和平台验证实验对于本发明的效果给予说明。
本实例使用基于Python3.5的框架TensorFlow 1.15来构建本发明所述的一种基于深度强化学习的智能网络编码方法及其深度神经网络的体系结构。在本实例中,考虑了一个具有单源,多中间节点和单目的地的多跳线性网络拓扑和多跳并行网络拓扑,图4示出了多跳线性网络拓扑图,图5示出了多跳并行网络拓扑图。
使用解码率和开销这2个性能指标对本发明所述的一种基于深度强化学习的智能网络编码方法进行评估。在分析实验结果之前,先对本实验所涉及的概念和术语进行简单的说明:
解码率:目的节点收到P个数据包后,可以成功解码(恢复原始信息)的概率;
开销:用于衡量不同编码算法的解码效率,我们可以定义开销
Figure PCTCN2021118099-appb-000039
其中,K是一个信息被划分的包的数量,E是使用网络编码时多余的数据包数量,Nr是在目标节点接收的数据包数量。
链路质量:本专利用包错误率(Packet error rate,PER)表示链路质量。对给定的信号与干扰加噪声比(Signal to Interference plus Noise Ratio,SINR)值γ,数据包错误传输的概率
Figure PCTCN2021118099-appb-000040
其中N b是一个数据包的大小(单位:bit);BER(γ)是对给定的SINR值γ的位错误率,它取决于物理层采用的技术和信道的统计特征。
图6显示了在多跳线性网络拓扑中,每条链路的包的错误率是0.1的情况下,本发明实例的解码率与源节点发送数据包的数量和中间节点数量的关系。可以看出,随着源节点发送数据包的数量的增加和中间节点数量的增加,目标节点成功解码的概率被改善。此外,在目标节点收到相同数据包的情况下,K越大,目标节点解码的概率越低。在K=5的情况下,中间节点的数量(N)等于2,4,6,8时的开销分别是12.2%、15.1%、19.2%和20.1%。在K=10的情况下,中间节点的数量(N)等于2,4,6,8时的开销分别是2.5%,4.2%,4.5%和5.2%。中间节点数越多,数据包经过更长的路径(更多的中间节点)才能传到目的节点,包的总丢失率较大,有些信息包无法传到目的节点,因此源节点需要发送很多冗余的信息,导致目的节点最终收到的数据包数Nr增 加(开销公式中的分子),所以开销会增大。
图7显示了在多跳并行网络拓扑中,源节点与中间节点间的链路的包的错误率是0.1,中间节点与目标节点间的链路的包的错误率是0.3,源节点与目标节点间的链路的包的错误率是0.8情况下,本发明实例的解码率与源节点发送数据包的数量和中间节点的数量关系。可以看出,随着源节点发送数据包的数量的增加和中间节点数量的增加,目标节点成功解码的概率被改善。此外,在目标节点收到相同数据包的情况下,K越大,目标节点解码的概率越低。在K=5的情况下,中间节点的数量(N)等于2,6,10,14时的开销分别是12.2%、15.1%、19.2%和20.1%。在K=10的情况下,中间节点的数量(N)等于2,6,10,14时的开销分别是4.8%,4.1%,3.8%和3.1%。
图8显示了在线性拓扑下,每条链路的包的错误率是0.1的情况下,本发明在不同中间节点数量上的泛化能力。我们首先在中间节点数量N=1的情况下,为本发明实例训练一个DQN模型,定义为Train N=1。然后我们使用训练好的DQN模型来测试在其他中间节点数量下的解码率,我们将这些测试结果定义为(Test N=i,Train N=1,i=2,4,6,8)。最后,我们将这些结果与在相同中间节点数量下的训练和测试结果(定义为Test N=i,Train N=i,i=2,4,6,8)进行比较。可以看出,在本发明事例中,(Test N=i,Train N=1,i=2,4,6,8)结果与(Test N=i,Train N=i,i=2,4,6,8)结果较为吻合,且在N=2,4,6,8下,均方根误差(RMSE)分别是0.0034、0.0072、0.011和0.015,这验证了本发明方法在不同网络规模上的泛化能力。
图9显示了在线性拓扑下,中间节点数量N=1的情况下,本发明在不同链路质量上的泛化能力。在图4的源S和中间节点R1间的链路的包的错误率PER S-R1=0.3,且中间节点R1与目标节点D间的链路的包的错误率PER R1-D=0.3的情况下,为本发明实例训练一个DQN模型,定义为
Figure PCTCN2021118099-appb-000041
然后我们使用训练好的DQN模型来测试在其他链路质量下(PER S-R1=0,PER R1-D=0),(PER S-R1=0.1,PER R1-D=0.3),(PER S-R1=0.1,PER R1-D=0.5)的解码率,我们将这些测试结果定义为
Figure PCTCN2021118099-appb-000042
并用
Figure PCTCN2021118099-appb-000043
代表在链路质量为PER S-R1=w,PER R1-D=y下训练的DQN模型在链路质量PER S-R1=u,PER R1-D=v下进行测试的测试结果,在图中
Figure PCTCN2021118099-appb-000044
的标注为
Figure PCTCN2021118099-appb-000045
最后,将这些结果与在相同链路质量下的训练和测试结果
Figure PCTCN2021118099-appb-000046
进行比较。可以看出,在本发明事例中,
Figure PCTCN2021118099-appb-000047
结果与
Figure PCTCN2021118099-appb-000048
结果较为吻合,且在链路质量(PER S-R1=0,PER R1-D=0),(PER S-R1=0.1,PER R1-D=0.3),(PER S-R1=0.1,PER R1-D=0.5)下,均方根误差(Root Mean Square Error,RMSE)分别是0、0.002和0.003,这验证了本发明方法在不同链路质量上的泛化能力。
最后,在真实的试验平台上评估本发明实例的性能,配置本发明的源节点编码系数优化单元、中间节点编码系数优化单元、智能网络编码单元和数据包转发单元,并使用Raspberry Pi 3 B+型进行实验。Raspberry Pi 3B+具有1.4GHz ARM A53处理器,1GB RAM以及集成的无线和蓝牙功能。我们利用TensorFlow Lite将本发明实例已训练好的DQN模型部署到Raspberry Pi 3B+。本实验中,将本发明实例与传统的基准编码算法和现有的基于强化学习编码算法(RL-aided SNC:Dynamic Sparse Coded Multi-Hop Transmissions using Reinforcement Learning)的编码算法进行比较。在基准编码算法中,源节点使用传统的喷泉码,同时中间节点使用随机网络编码算法。同时我们比较了在仿真环境下得到的解码结果和在真实的试验平台上的解码结果。
图10显示了在多跳线性拓扑下,每条链路的包的错误率等于0.1,K=5的情况下,本发明实例与传统的基准编码算法和现有的基于强化学习编码算法RL-aided SNC的解码率比较。可以看出,在相同的中间节点数量下,本发明的解码率效率高。此外,可以看到仿真结果与真实试验平台得到的结果较为一致。在仿真环境和真实试验平台下,三种编码算法的解码结果的均方根误差分别是0.0042、0.0153、0.0379。
本实例的实验结果说明了本发明所述的基于深度强化学习的智能网络编码方法较现有编码方法有更高的解码率和更低的开销。
需要说明的是,上述实施例中介绍的各个步骤并非都是必须的,本领域技术人员可以根据实际需要进行适当的取舍、替换、修改等。
最后所应说明的是,以上实施例仅用以说明本发明的技术方案而非限制。尽管上文参照实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,对本发明的技术方案进行修改或者等同替换,都不脱离本发明技术方 案的精神和范围,其均应涵盖在本发明的权利要求范围当中。

Claims (12)

  1. 一种网络编码方法,所述网络包括源节点和中间节点,所述方法包括:
    源节点将要发送的信息划分成K个片x 1,x 2,…,x K,K为大于1的整数,根据源节点编码模型确定每个片的编码系数g(x 1),g(x 2),...,g(x K),将K个片编码,生成编码包P S,并向下一跳节点发送编码包P S,其中所述源节点编码模型通过对DQN网络训练得到,其中使用各步环境状态
    Figure PCTCN2021118099-appb-100001
    作为训练输入,ss k为第k步的环境状态,x k为信息包的第k个片,
    Figure PCTCN2021118099-appb-100002
    为该源节点的下一跳中间节点的缓冲区里所存储的近期收到的M个编码包,M为大于1的整数;
    中间节点接收前一节点发送的编码包,将收到的编码包P j编码M次,根据中间节点编码模型确定每次的编码系数g(P j(1)),g(P j(2)),…g(P j(M)),生成编码包P new,并向下一跳节点发送编码包P new,其中所述中间节点编码模型通过对DQN网络训练得到,其中使用各步环境状态
    Figure PCTCN2021118099-appb-100003
    作为训练输入,s k为第k步的环境状态,P new为当前编码包,P j(k)为该中间节点缓冲区中的第k个编码包,
    Figure PCTCN2021118099-appb-100004
    为该中间节点下一跳节点z的缓冲区里所存储的近期收到的M个编码包。
  2. 根据权利要求1所述的方法,其中所述源节点编码模型包括目标网络N s和执行网络N snode,所述源节点编码模型的训练包括步骤:
    步骤110:从经验回放存储器M s中随机采样经验来训练N s
    步骤120:将N s训练后的DQN参数发给源节点,以对N snode进行更新;和/或
    步骤130:在源节点上,将环境状态ss k作为N snode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε选择行为来决定原始信息的K个片的编码系数,执行后,收集源节点与环境交互的经验,并将该经验存储到经验回放存储器M s中。
  3. 根据权利要求1所述的方法,其中中间节点编码模型包括目标网络N R和执行网络N Rnode,所述中间节点编码模型的训练包括:
    步骤210:在经验回放存储器M R中随机采样经验来训练N R
    步骤220:将N R训练后的DQN参数发给各中间节点,以对N Rnode进行更新;和/或
    步骤230:在各中间节点上,将环境状态s k作为N Rnode的DQN模型的输入,输出每个行为对应的Q值,以贪心策略概率ε选择行为来决定中间节点缓冲区的M个包的编码系数,执行后,收集中间节点与环境交互的经验,并将该经验存储到经验回放存储器M R中。
  4. 根据权利要求2所述的方法,其中对N s进行训练包括:
    将网络编码的环境状态ss k做为N s的输入,通过最小化损失函数
    Figure PCTCN2021118099-appb-100005
    对神经网络进行训练,k取值为1…K,其中Q target为N s计算的目标Q值;
    a k表示第k步的行为;
    r k表示第k步采取行为后的奖励;
    θ R表示第k步的DQN的网络参数。
  5. 根据权利要求3所述的方法,其中对N R进行训练包括:
    将网络编码的环境状态s k做为N R的输入,通过最小化损失函数
    Figure PCTCN2021118099-appb-100006
    对神经网络进行训练,k取值为1…M,其中
    Q target为N R计算的目标Q值;
    a k表示第k步的行为;
    r k表示第k步采取行为后的奖励;
    θ k表示第k步的DQN的网络参数。
  6. 根据权利要求4所述的方法,其中对于N s
    a k为信息的第k个片x k的编码系数,a k∈A S,其中,A S={0,1,...,(q-1)},q是伽罗华域的域值大小;
    当该源节点发送的编码包能够使得由该源节点的下一跳中间节点缓冲区里的编码包所形成的线性系统的秩增加时,r k为1,否则,r k为0。
  7. 根据权利要求5所述的方法,其中对于N R
    a k为第k个包的编码系数,a k∈A R,其中A R={0,1,...,(q-1)},q是伽罗华域 的域值大小;
    当该中间节点发送的编码包能够使得由该中间节点的下一跳节点缓冲区里的编码包所形成的线性系统的秩增加时,r k为1,否则,r k为0。
  8. 根据权利要求1所述的方法,其中,如果源节点没有收到ACK,源节点的ss k
    Figure PCTCN2021118099-appb-100007
    不变;如果中间节点没有收到ACK,该中间节点的s k
    Figure PCTCN2021118099-appb-100008
    不变。
  9. 根据权利要求1所述的方法,其中源节点通过以下方式生成编码包P S
    P S=G S·X,其中,X=[x 1,x 2,...,x K],G S=[g(x 1),g(x 2),...,g(x K)]。
  10. 根据权利要求1所述的方法,其中,中间节点的M次编码中的第k次编码包括:
    k=1时,
    Figure PCTCN2021118099-appb-100009
    k>1时,
    Figure PCTCN2021118099-appb-100010
    P j(k)为该中间节点的缓冲区中的第k个编码包,k取值为1…M。
  11. 一种计算机可读存储介质,其中存储有一个或者多个计算机程序,所述计算机程序在被执行时用于实现如权利要求1-10任意一项所述的方法。
  12. 一种网络编码的计算系统,包括
    存储装置、以及一个或者多个处理器;
    其中,所述存储装置用于存储一个或者多个计算机程序,所述计算机程序在被所述处理器执行时用于实现如权利要求1-10任意一项所述的方法。
PCT/CN2021/118099 2020-11-26 2021-09-14 一种基于深度强化学习的智能网络编码方法和设备 WO2022110980A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011344089.5A CN112564712B (zh) 2020-11-26 2020-11-26 一种基于深度强化学习的智能网络编码方法和设备
CN202011344089.5 2020-11-26

Publications (1)

Publication Number Publication Date
WO2022110980A1 true WO2022110980A1 (zh) 2022-06-02

Family

ID=75045041

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118099 WO2022110980A1 (zh) 2020-11-26 2021-09-14 一种基于深度强化学习的智能网络编码方法和设备

Country Status (2)

Country Link
CN (1) CN112564712B (zh)
WO (1) WO2022110980A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112564712B (zh) * 2020-11-26 2023-10-10 中国科学院计算技术研究所 一种基于深度强化学习的智能网络编码方法和设备
CN116074891A (zh) * 2021-10-29 2023-05-05 华为技术有限公司 通信方法及相关装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140064296A1 (en) * 2012-02-15 2014-03-06 Massachusetts Institute Of Technology Method And Apparatus For Performing Finite Memory Network Coding In An Arbitrary Network
US20160359770A1 (en) * 2015-06-03 2016-12-08 Steinwurf ApS Composite Extension Finite Fields For Low Overhead Network Coding
CN111770546A (zh) * 2020-06-28 2020-10-13 江西理工大学 一种基于q学习的容迟网络随机网络编码策略
CN112564712A (zh) * 2020-11-26 2021-03-26 中国科学院计算技术研究所 一种基于深度强化学习的智能网络编码方法和设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102209079A (zh) * 2011-06-22 2011-10-05 北京大学深圳研究生院 一种基于tcp协议的自适应网络控制传输方法和系统
CN104079483B (zh) * 2013-03-29 2017-12-29 南京邮电大学 一种容迟网络中基于网络编码的多阶段安全路由方法
CN110113131B (zh) * 2019-04-24 2021-06-15 香港中文大学(深圳) 一种基于批编码的网络通信方法及系统
CN110519020B (zh) * 2019-08-13 2020-09-11 中国科学院计算技术研究所 无人系统网络智能跨层数据传输方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140064296A1 (en) * 2012-02-15 2014-03-06 Massachusetts Institute Of Technology Method And Apparatus For Performing Finite Memory Network Coding In An Arbitrary Network
US20160359770A1 (en) * 2015-06-03 2016-12-08 Steinwurf ApS Composite Extension Finite Fields For Low Overhead Network Coding
CN111770546A (zh) * 2020-06-28 2020-10-13 江西理工大学 一种基于q学习的容迟网络随机网络编码策略
CN112564712A (zh) * 2020-11-26 2021-03-26 中国科学院计算技术研究所 一种基于深度强化学习的智能网络编码方法和设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI DAN, GUANG XUAN, ZHOU ZHIHENG, LI CONGDUAN, TAN CHEE WEI: "Hierarchical Performance Analysis on Random Linear Network Coding", IEEE TRANSACTIONS ON COMMUNICATIONS, vol. 66, no. 5, 1 May 2018 (2018-05-01), PISCATAWAY, NJ. USA. , pages 2009 - 2021, XP055933863, ISSN: 0090-6778, DOI: 10.1109/TCOMM.2017.2787991 *

Also Published As

Publication number Publication date
CN112564712A (zh) 2021-03-26
CN112564712B (zh) 2023-10-10

Similar Documents

Publication Publication Date Title
WO2022110980A1 (zh) 一种基于深度强化学习的智能网络编码方法和设备
Yao et al. Machine learning aided load balance routing scheme considering queue utilization
KR101751497B1 (ko) 행렬 네트워크 코딩을 사용하는 장치 및 방법
CN103650399B (zh) 纠正数据单元的自适应生成
JP2016201794A (ja) 故障検出装置、方法及びシステム
CN110190926A (zh) 基于网络计算的纠删码修复方法、纠删码更新方法及系统
JP7451689B2 (ja) ネットワーク輻輳処理方法、モデル更新方法、および関連装置
WO2023155481A1 (zh) 无线通信网络知识图谱的智能分析与应用系统及方法
CN112751644A (zh) 数据传输方法、装置及系统、电子设备
Wang et al. INCdeep: Intelligent network coding with deep reinforcement learning
CN115278811A (zh) 一种基于决策树模型的mptcp连接路径选择方法
US9876608B2 (en) Encoding apparatus and encoding method
Valerio et al. A reinforcement learning-based data-link protocol for underwater acoustic communications
Thouin et al. Large scale probabilistic available bandwidth estimation
Kontos et al. A topology inference algorithm for wireless sensor networks
CN109039531A (zh) 一种基于机器学习调整lt码编码长度的方法
CN117581493A (zh) 链路适配
Yu et al. DRL-based fountain codes for concurrent multipath transfer in 6G networks
Wu et al. On-demand Intelligent Routing Algorithms for the Deterministic Networks
CN113507738A (zh) 一种移动自组网路由决策方法
TWI833065B (zh) 網路優化器及其網路優化方法
Mehta et al. Adaptive Cross-Layer Optimization Using Mimo Fuzzy Control System in Ad-hoc Networks.
Visca et al. A model for route learning in opportunistic networks
Du et al. Learning-Based Congestion Control Assisted by Recurrent Neural Networks for Real-Time Communication
Wei et al. G-Routing: Graph Neural Networks-Based Flexible Online Routing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896484

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896484

Country of ref document: EP

Kind code of ref document: A1