CN112564712B

CN112564712B - Intelligent network coding method and equipment based on deep reinforcement learning

Info

Publication number: CN112564712B
Application number: CN202011344089.5A
Authority: CN
Inventors: 王琪; 刘建敏; 徐勇军; 王永庆
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2023-10-10
Anticipated expiration: 2040-11-26
Also published as: CN112564712A; WO2022110980A1

Abstract

The invention provides a network coding method based on deep reinforcement learning, which comprises the following steps: dividing information to be transmitted into K pieces by a source node, determining the coding coefficient of each piece according to a source node coding model, generating and transmitting a coding packet to a next hop node; and the intermediate node receives the coding packet sent by the previous node, codes the received coding packet again, determines a coding coefficient according to an intermediate node coding model, and generates and sends the coding packet to the next-hop node, wherein the source node and the intermediate node coding model are obtained by training the DQN network. The invention can adaptively adjust the coding coefficient according to the dynamic change of the network, improves the decoding efficiency, has good model generalization capability, can generalize the network under different network scales and different link qualities, simplifies the coding coefficient optimization implementation and improves the DQN training stability by the respective coding coefficient optimization models which are respectively distributed and executed on the source node and the intermediate node.

Description

Intelligent network coding method and equipment based on deep reinforcement learning

Technical Field

The invention relates to the technical field of information, in particular to a network coding method.

Background

Linear network coding is a type of network coding in which data is linearly combined by coding coefficients selected from a finite field. Linear network coding has lower complexity and simpler model than nonlinear network coding using nonlinear combining functions, and thus has been intensively studied and widely used.

The basic idea of linear network coding is that nodes in the network linearly encode original data by selecting coding coefficients from a finite field to form new encoded data and forward the new encoded data, and the receiving node can recover the original data by corresponding decoding operations. The linear network coding method mainly comprises a deterministic network coding algorithm and a random linear network coding algorithm. Deterministic network coding algorithms canThe target node is guaranteed to decode successfully, but it needs global information such as network topology and link capacity. In reality, there are a variety of topologies, and it is impractical to design specific coding methods for different types of networks. Furthermore, it is not suitable for dynamic networks because the collection of global information from distributed nodes in real time is very complex and cannot be applied on a large scale. In random linear network coding, nodes use independent and random coding coefficients selected in a certain limit domain to carry out linear combination on data to be transmitted. Related studies have shown that random linear network coding ensures that each receiving node can finish decoding with a high probability, i.e. the global coding coefficient matrix corresponding to the receiving node is full rank, as long as the finite field is large enough. Since the main feature of random linear network coding is to randomly select coefficients of linear combinations, random linear network coding is suitable for networks of unknown or varying topology, as it can be easily implemented in a distributed manner. For example, a node with encoding capability has three packets X, Y, Z to be transmitted, and the node can randomly choose the encoding coefficient a ₁ 、a ₂ 、a ₃ 、b ₁ 、b ₂ 、b ₃ 、c ₁ 、c ₂ 、c ₃ The data packets are then combined into a using the coding coefficients ₁ X+a ₂ Y+a ₃ Z、b ₁ X+b ₂ Y+b ₃ Z、c ₁ X+c ₂ Y+c ₃ And Z, sending out the combinations. After receiving 3 code combinations, the receiving node works as matrixUpon anecdotal, the original packet X, Y, Z can be solved by linear operations.

The decoding failure may be caused by various reasons, not only the linear correlation coefficient extracted by the intermediate node, but also packet loss caused by network instability, so that the intermediate node does not receive some packets for decoding. In random linear network coding, coefficients are randomly extracted from a Galois field with equal probability. Therefore, this coding method cannot adjust the coding coefficient according to the dynamic changes of the network (including the changes of the network link quality and the number of intermediate nodes), which causes the problem of low decoding efficiency.

Disclosure of Invention

The present invention addresses the above-mentioned problems by providing, according to a first aspect of the present invention, a network coding method, the network including a source node and an intermediate node, the method comprising:

the source node divides the information to be transmitted into K pieces x ₁ ,x ₂ ,…,x _K K is an integer greater than 1, and the coding coefficient g (x ₁ )，g(x ₂ )，...，g(x _K ) K slices are encoded to generate an encoded packet P _S And transmits the encoded packet P to the next hop node _S Wherein the source node coding model is obtained by training a DQN network, wherein each step of environmental state is usedAs training input ss _k For the environmental state of the kth step, x _k For the kth piece of the packet, +.>M recently received coded packets stored in a buffer area of a next hop intermediate node for the source node, M being an integer greater than 1;

the intermediate node receives the coded packet sent by the previous node and sends the received coded packet P _j Coding M times, determining coding coefficient g (P _j (1))，g(P _j (2)),…g(P _j (M)) to generate the encoded packet P _new And transmits the encoded packet P to the next hop node _new Wherein the intermediate node coding model is obtained by training a DQN network, wherein each step of environmental state is usedS as training input _k P being the environmental state of the kth step _new For the current encoded packet, P _j (k) For the kth encoded packet in the intermediate node buffer +.>The recently received M encoded packets stored in the buffer for the intermediate node next hop node z.

In one embodiment of the present invention, wherein the source node encoding model includes a target network N _s And executing network N _snode The training of the source node coding model comprises the following steps:

step 110: playback of memory M from experience _s Training N with medium random sampling experience _s ；

Step 120: will N _s The trained DQN parameters are sent to the source node for N _snode Updating; and/or

Step 130: at the source node, the environmental state ss _k As N _snode The input of the DQN model of (1) outputs the Q value corresponding to each behavior, selects the behavior to determine the coding coefficient of K slices of the original information by greedy strategy probability epsilon, collects the experience of the source node interacting with the environment after execution, and stores the experience in an experience playback memory M _s Is a kind of medium.

In one embodiment of the invention, wherein the intermediate node coding model comprises a target network N _R And executing network N _Rnode Training of the intermediate node coding model includes:

step 210: in an empirical playback memory M _R Training N with medium random sampling experience _R ；

Step 220: will N _R The trained DQN parameters are sent to each intermediate node to generate N _Rnode Updating; and/or

Step 230: at each intermediate node, the environmental state s _k As N _Rnode The input of the DQN model of (2) outputting the Q value corresponding to each behavior, determining the coding coefficient of M packets of the intermediate node buffer area by using greedy strategy probability epsilon selection behavior, collecting the experience of the intermediate node interaction with the environment after execution, and storing the experience in an experience playback memory M _R Is a kind of medium.

In one embodiment of the inventionIn embodiments, wherein for N _s Training includes:

encoding network environment state ss _k As N _s By minimizing the loss functionTraining the neural network, wherein K takes a value of 1 … K, and Q _target Is N _s A calculated target Q value;

a _k representing the behavior of the kth step;

r _k representing rewards after taking action in the kth step;

θ _k network parameters representing the DQN of step k.

In one embodiment of the present invention, wherein for N _R Training includes:

coding the environmental state s of a network _k As N _R By minimizing the loss functionTraining the neural network, wherein k takes a value of 1 … M

Q _target Is N _R A calculated target Q value;

a _k representing the behavior of the kth step;

r _k representing rewards after taking action in the kth step;

θ _k network parameters representing the DQN of step k.

In one embodiment of the present invention, wherein for N _s ：

a _k The kth slice x of information _k Coding coefficient, a _k ∈A _S Wherein A is _S = {0,1, (q-1) }, q is the domain value size of the galois field;

r when the code packet sent by the source node can increase the rank of the linear system formed by the code packet in the next-hop intermediate node buffer of the source node _k 1, otherwise, r _k Is 0.

At the bookIn one embodiment of the invention, wherein for N _R ：

a _k Coding coefficient for kth packet, a _k ∈A _R Wherein A is _R = {0,1, (q-1) }, q is the domain value size of the galois field;

r when the encoded packet sent by the intermediate node enables the rank of the linear system formed by the encoded packets in the next hop node buffer of the intermediate node to increase _k 1, otherwise, r _k Is 0.

In one embodiment of the invention, wherein if the source node does not receive an ACK, the ss of the source node _k A kind of electronic deviceUnchanged; if the intermediate node does not receive the ACK, s of the intermediate node _k Is->Is unchanged.

In one embodiment of the present invention, wherein the source node generates the encoded packet P by _S ：

P _S ＝G _S X, wherein x= [ X ] ₁ ,x ₂ ,...,x _K ],G _S ＝[g(x ₁ ),g(x ₂ ),...,g(x _K )]。

In one embodiment of the present invention, wherein the kth one of the M encodings of the intermediate node comprises:

when k=1, P _new ＝P _j ⊕(g(P _j (k))·P _j (k)),

When k > 1, P _new ＝P _new ⊕(g(P _j (k))·P _j (k)),

P _j (k) For the kth encoded packet in the buffer of the intermediate node, k takes a value of 1 … M.

According to a second aspect of the present invention there is provided a computer readable storage medium having stored therein one or more computer programs which when executed are adapted to carry out the network coding method of the present invention.

According to a third aspect of the present invention there is provided a network encoded computing system comprising a storage device, and one or more processors; wherein the storage means is for storing one or more computer programs which when executed by the processor are for carrying out the network coding method of the invention.

Compared with the prior art, the embodiment of the invention has the advantages that:

compared with the prior art, the invention has the following advantages:

1. compared with the prior art, the method can adaptively adjust the coding coefficient according to the dynamic change of the network (including the change of the quality of network links and the number of intermediate nodes) so as to adapt to the network environment with high dynamic change and improve the decoding efficiency.

2. The present invention uses a Markov Decision Process (MDP) to formulate a coding coefficient optimization problem, wherein network changes can be automatically and continuously represented as MDP state transitions. In addition, the invention has good model generalization capability and can generalize the network under different network scales and different link qualities, so that the invention can adapt to the dynamic change of the network.

3. The invention realizes a distributed coding coefficient optimization mechanism, a coding coefficient optimization model Network based on a Deep Q-learning Network (DQN) is intensively trained by a preset optimizer, and meanwhile, coding coefficient optimization models based on a source node and an intermediate node of the DQN are respectively executed in a distributed manner on the source node and the intermediate node, thereby simplifying coding coefficient optimization implementation and improving the stability of DQN training.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 illustrates a source node network encoding flow diagram according to an embodiment of the invention;

FIG. 2 illustrates an intermediate node network encoding flow diagram according to an embodiment of the invention;

FIG. 3 illustrates a functional configuration block diagram of an apparatus for deep reinforcement learning intelligent network coding in accordance with an embodiment of the present invention;

FIG. 4 illustrates a multi-jumper network topology according to an embodiment of the invention;

FIG. 5 illustrates a multi-hop parallel network topology according to an embodiment of the invention;

FIG. 6 shows a simulation experiment result diagram of a multi-jumper network according to an embodiment of the present invention;

FIG. 7 shows a diagram of simulation experiment results of a multi-hop parallel network according to an embodiment of the present invention;

FIG. 8 is a graph showing simulation experiment results of generalization capability on different network scales according to an embodiment of the present invention;

fig. 9 is a diagram showing simulation experiment results of generalization capability on different link qualities according to an embodiment of the present invention;

FIG. 10 shows a graph comparing simulation experiment results of three methods of the reference encoding algorithm and the RL-obtained SNC with the results of a real experiment platform.

Detailed Description

The present invention provides a network coding method based on deep reinforcement learning, and the method is described in detail below with reference to the accompanying drawings and specific embodiments.

In general terms, in the present invention, a network includes a source node, an intermediate node, and a destination node that receives information. The information is generated in the source node, sent out by the source node, passes through the intermediate node and is finally received by the destination node. The source node divides the information into a plurality of slices, determines the coding coefficient of each slice, codes the slices, generates a coded packet, and transmits the coded packet to the next hop node. The intermediate node receives the encoded packets, determines the encoding coefficient of each packet for the received encoded packets, encodes the plurality of encoded packets again, generates a new encoded packet, and transmits the new encoded packet to the next hop node.

The invention adopts a depth reinforcement learning method DQN to determine the coding coefficient, and the model of the DQN method comprises a plurality of steps and a plurality of environment states, and can take a plurality of actions in each environment state, wherein each action corresponds to different rewards. In the present invention, each step corresponds to determining a coding coefficient for each slice or each packet, where the behavior is the determined coding coefficient and the environmental state is the relevant slice or packets. The DQN evaluates each behavior using the Q value, and among the behaviors in each environmental state, the behavior that maximizes the Q value is the best one, that is, the one that should be taken in that environmental state. DQN is to find the best solution overall, and therefore the best behaviour is evaluated from the whole of a series of behaviour, i.e. that behaviour which can optimise the jackpot for all steps in the current environmental situation.

The calculation of the Q value is based on the reward, using the following formula: q (Q) _k ＝r _k +γmaxQ _k+1 K is a positive integer, Q value Q of the kth step _k Depending on the Q value of the k+1 steps, in particular, the maximum maxQ of the Q values of all the behaviors of the k+1 steps _k+1 Gamma is discount factor, 0 gamma is less than or equal to 1, r _k The last step is awarded for the kth step, and the Q value of the last step is awarded for the last step.

The DQN enables the neural network to calculate the Q value for each behavior for each environmental state by training the neural network. The training method of the DQN is to collect input and output from a real environment, wherein the input is an environment state, the output is a Q value of a behavior, the Q value of the behavior is calculated after the environment state is input into a convolutional neural network CNN, and an error between the calculated target Q value and the real Q value is expressed by using an loss function so as to reduce the error. In actual execution, although the behavior with the greatest Q value is the best behavior, in order to balance learning and exploration, new behaviors are tried, for example, a greedy strategy is adopted, namely, unknown behaviors are selected with a small probability epsilon (epsilon < 1), and behaviors with the greatest Q value known by learning are selected with 1 epsilon.

The existing DQN further comprises: a sample playback Buffer (Replay Buffer), otherwise known as experience playback (Experience Replay), and a Target Network (Target Network). In order to alleviate the influence of the related problems, the training and execution parts are decoupled as far as possible, and the invention introduces a new Network, still named as Target Network (Target Network), and the original Target Network is named as execution Network (Behavior Network).

At the beginning of training, both network models use exactly the same parameters. In the execution process, the Behavior Network is responsible for interacting with the environment to obtain an interaction sample. In the training process, a Target Q value obtained by Q-Learning is obtained by Target Network calculation; and then comparing the model with a Q value obtained by the Behavior Network in interaction with the environment to obtain an error, training the Target Network by reducing the error, continuously updating the model of the Target Network, synchronizing the updated model to the Behavior Network, and updating the model of the Behavior Network.

Every time the training is completed for a certain number of iterations, the experience of the Behavior Network model is synchronized to the Target Network, so that the training of the next stage can be performed. By using the Target Network, the model for calculating the Q value will be fixed for a period of time, so that the model can alleviate the fluctuation of the model.

The Target Network of the present invention comprises two neural networks N _s And N _R ，N _s For source node, by preset optimizer O _s Training, N _R For all intermediate nodes, by a preset optimizer O _R Training, O _s And O _R Each having a memory for storing experiences including environmental status, behavior, rewards for each step. O (O) _s Memory of M _s ，O _R Memory of M _R . The Behavior Network of the invention comprises a set of neural networks N deployed on source nodes _snode And a set of neural networks each deployed on all intermediate nodesN complex _Rnode 。N _snode To N _s Copy of N _Rnode To N _R Is a copy of (a). For N _snode And N _Rnode Instead of training, the Q values corresponding to the behaviors are obtained after the environmental states are input to the nodes.

The network coding method based on the deep reinforcement learning comprises two parts: a centralized training process and a distributed execution process. In the centralized training process, the DQN-based coding coefficient optimization model network is trained in a preset optimizer. In the distributed execution process, coding coefficient optimization models based on a source node and an intermediate node of the DQN are respectively executed in a distributed mode on the source node and the intermediate node, experience generated by execution is sent back to an optimizer for training, and the DQN training speed is increased by executing and training.

(1) During the centralized training process, source node optimizer O _s Playback of memory M from experience _s Training source node N with medium random sampling experience _s Is input into the source node environment state ss _k (details of the environmental state of the source node will be described in detail later), the neural network N is modified by minimizing a predetermined loss function _s Training N _s Is output as the environmental state ss _k Lower selection behavior a _k Then obtain the optimal cumulative prize value Q _k . Wherein, the loss function is: in the loss function, Q _target Is N _s The calculated target Q value, Q (ss _k ,a _k ；θ _k ) Is known to be in the environment state ss from experience _k Next, select action a _k Q, θ after _k Representing the network parameters of said DQN at the current decision step k.

Likewise, intermediate node optimizer O _R Playback of memory M from experience _R Training with medium random sampling experienceIntermediate node N _R Is input with the intermediate node environment state s _k (details of the intermediate node environmental states will be described in detail below), the neural network N is modified by minimizing a pre-set loss function _R Training N _R Is the output of the environment state s _k Lower selection behavior a _k Then obtain the optimal cumulative prize value Q _k . Wherein, the loss function is:in the loss function, Q _target Is N _R The calculated target Q value, Q (s _k ,a _k ；θ _k ) Is known empirically to be in the ambient state s _k Next, select action a _k Q, θ after _k Representing the network parameters of said DQN at the current decision step k.

Once the parameters of the DQN are updated, the centralized optimizer O _s And O _R The updated DQN parameters are sent to each of the source and intermediate nodes in the network. The source and intermediate nodes update the neural network N on that node with the received DQN parameters _snode And N _Rnode Is set to be a DQN parameter of (c).

(2) In the distributed execution process, for the source node, according to the observed current environment state ss _k Will ss _k As the source node N _snode The input of the DQN model of (a) outputs the Q value corresponding to each behavior, selects one behavior with greedy strategy probability epsilon (e.g., epsilon=0.1) to determine the coding coefficient of the kth slice of the original information, one behavior a _k After execution, the source node obtains a prize value r _k Optimizer O _s Collecting experience of source node interaction with environment (ss _k ,a _k ,r _k ,ss _k+1 ) And stores the experience to the experience playback memory M _s In (a) and (b); for intermediate node i, based on its observed environmental state s _k Will s _k As intermediate node N _Rnode The input of the DQN model of (a), the output of the Q value corresponding to each behavior, and the selection of a behavior to determine the intermediate node buffer with greedy policy probability epsilon (e.g., epsilon=0.1)Coding coefficient of kth packet, a behavior a _k After execution, the intermediate node obtains a prize value r _k Optimizer O _R Collecting experience of intermediate node interactions with the environment (s _k ,a _k ,r _k ,s _k+1 ) And stores the experience to the experience playback memory M _R Is a kind of medium.

The following describes specific methods of source node and intermediate node encoding and their corresponding environmental states, behaviors, and rewards in connection with embodiments of the invention.

Encoding of source nodes and their corresponding environmental status, behavior and rewards

Fig. 1 shows the encoding process of a source node based on deep reinforcement learning: information X (x= [ X ₁ ,x ₂ ,…,x _K ]) Dividing into K slices, K being integers greater than 1, the coding coefficient optimization process of the K slices being regarded as a markov process (MDP) comprising K decision steps, in which the K (k=1, 2, …, K) th slice x _k Is determined;

specifically, two major modules of a deep reinforcement learning agent and a network environment in a source node coding coefficient optimization model based on deep reinforcement learning are designed as follows:

(1) The source node is regarded as a deep reinforcement learning agent;

(2) An abstract environment is a network formed by a source node and all next-hop intermediate nodes of the source node, including the source node, all next-hop intermediate nodes of the source node, and links formed by the source node and all next-hop intermediate nodes of the source node.

(3) Deep reinforcement learning agent observes environmental state ss of current decision step k _k And according to the environment state ss _k Take an action a _k Acting on the environment, the environment will feed back a prize r _k And the method is used for the deep reinforcement learning agent so as to realize the interaction between the deep reinforcement learning agent and the environment.

According to one embodiment of the invention, at the current decision step k, the observed environmental state ss of the source node _k The method comprises the following steps:

environmental state ss _k Kth slice x comprising one packet _k And recently received M (e.g., m=10) encoded packets stored in the buffer of the next-hop intermediate node of the source nodeM is an integer greater than 1, i.eWherein P is _S (l) Is the first encoded packet in the next hop intermediate node buffer of the source node.

Specifically, at the current environmental state ss _k Next, the source node performs act a _k ：

At each decision step k, the source node selects an action a _k ∈A _S To determine the kth slice x of the packet _k Coding coefficient g (x) _k )，g(x _k )＝a _k Wherein A is _S = {0,1, (q-1) }, q is the domain value size of the Galois field (Galois field), in one embodiment q=2, in another embodiment q is a positive integer.

According to one embodiment of the invention, at the current environmental state ss _k Next, the source node performs act a _k After receiving rewards r from the environment _k The method comprises the following steps:

when the code packet sent by the source node can increase the rank of a linear system formed by the code packets in the next-hop intermediate node buffer of the source node, r _k =1, otherwise, r _k ＝0。

After undergoing the K decision steps, the coding coefficients of the K slices of one packet are determined, and then the source node encodes the K slices by using the determined coding coefficients and transmits the encoded packet P _S ，P _S ＝G _S X, wherein x= [ X ] ₁ ,x ₂ ,...,x _K ],G _S ＝[g(x ₁ ),g(x ₂ ),...,g(x _K )]。

In one embodiment, the node remains addressed to the next hop nodeEncoding packets to form a next-hop intermediate node buffer on a source nodeAnd confirming whether the next-hop node receives the coded packet or not through sending the ACK fed back by the next-hop node. If the node does not receive ACK, which means that the next hop node does not receive the encoded packet, then +.>Will not change, i.e. the state ss of the source node when it sends the next encoded packet _k Is->No change occurs with respect to the sending of the current packet. If the node receives ACK, which means that the next hop node successfully receives the encoded packet, then +.>Change, i.e. the state ss of the source node when it sends the next encoded packet _k Is->Changes occur with respect to transmitting the current packet. It follows that whether an ACK packet is accepted is determined by the link quality, which in turn affects the buffer +.>The stored code packets, the coding model of the source node may adaptively adjust the coding coefficients according to changes in network link quality.

In one embodiment, after all K steps have been performed, the rewards for K steps of steps 1 through K are determined and sent to the next hop node, the rewards for K steps being the same. Since the node is in the buffer areaThe coded packet accepted by the next-hop node is reserved, so that the node is no matterIf ACK is not received, the node can change according to whether the transmitted code packetThe linear system rank formed by the inner code packets evaluates the behavior.

Coding of intermediate nodes and their corresponding environmental states, behaviors and rewards

FIG. 2 shows a depth reinforcement learning based intermediate node encoding process, with a current intermediate node i encoding a received packet P from a previous hop node j of the intermediate node i _j The re-encoding process is considered as a markov process (MDP) comprising M (e.g., m=10) decision steps, in which the intermediate node i decides the encoding coefficient of the kth encoded packet in the buffer of the intermediate node i and compares the kth encoded packet with the current encoded packet P _new And performing exclusive or operation. In the first decision step, i.e. when k=1, P _new ＝P _j 。

According to one embodiment of the invention, two large modules of a deep reinforcement learning agent and a network environment in an intermediate node coding coefficient optimization model based on deep reinforcement learning are designed as follows:

(1) The intermediate node is regarded as a deep reinforcement learning agent;

(2) The abstract environment is a network formed by a current intermediate node i and a next-hop node of the intermediate node i, and comprises the intermediate node i, the next-hop node of the intermediate node i and a link formed by the intermediate node i and the next-hop node z of the intermediate node i;

(3) The deep reinforcement learning agent observes the environmental state s of the current decision step k _k And according to the environmental state s _k Take an action a _k Acting on the environment, the environment will feed back a prize r _k And the method is used for the deep reinforcement learning agent so as to realize the interaction between the deep reinforcement learning agent and the environment.

According to one embodiment of the invention, the environmental state s observed by the intermediate node i at the current decision step k _k The method comprises the following steps:

environmental state s _k Comprising the current encoded packet P _new The intermediate node i bufferThe kth encoded packet P in (b) _j (k) And the recently received M (m=10) encoded packets stored in the buffer of the next hop node z of the intermediate node i +.>I.e. < ->Wherein P is _i (l) Is the first encoded packet in the buffer of the next hop node z of the intermediate node i, and P _j (1),P _j (l),…,P _j (M) is received earlier than P _j Is received by the receiver.

According to one embodiment of the invention, at the current environmental state s _k The intermediate node i performs action a _k ：

At each decision step k, the intermediate node i selects an action a _k ∈A _R To determine the coding coefficient g (P) _j (k))，g(P _j (k))＝a _k Wherein A is _R = {0,1, (q-1) }, q is the domain value size of the Galois field (Galois field), in one embodiment q=2, in another embodiment q is a positive integer.

According to one embodiment of the invention, at the current environmental state s _k The intermediate node i performs action a _k After receiving rewards r from the environment _k The method comprises the following steps:

when the code packet sent by the intermediate node i can increase the rank of the linear system formed by the code packet in the next hop node z buffer of the intermediate node i, r _k =1; otherwise, r _k ＝0。

After the kth decision step, the current encoded packet P _new Is recoded, i.e. P _new ＝P _new ⊕(gP _j (k))·P _j (k) Especially when k=1, p _new ＝P _j ⊕(g(P _j (k))·P _j (k) A kind of electronic device. After M decision steps, the intermediate node i receives the coded packet P from the previous hop node j _j Is recoded M times, and finally the intermediate node i transmits the coded packet P after the last decision step M is coded _new 。

In one embodiment, the node retains encoded packets addressed to the next hop node to form a next hop intermediate node buffer on the intermediate nodeAnd confirming whether the next-hop node receives the coded packet or not through sending the ACK fed back by the next-hop node. If the node does not receive ACK, which means that the next hop node does not receive the encoded packet, then +.>Will not change, i.e. the state s of the intermediate node i when it transmits the next encoded packet _k Is->No change occurs with respect to the sending of the current packet. If the node receives ACK, which means that the next hop node successfully receives the encoded packet, then +.>Change, i.e. the state s of the intermediate node i when it sends the next encoded packet _k Is->Changes occur with respect to transmitting the current packet. It follows that whether an ACK packet is accepted is determined by the link quality, which in turn affects the buffer +.>The stored encoded packets, the encoding model of the intermediate node can be adapted according to the changes in network link qualityThe coding coefficients are adjusted.

In one embodiment, after all M steps have been performed, the rewards for M steps 1 through M are determined and sent to the next hop node, the rewards for M steps being the same. Since the node is in the buffer areaThe code packet accepted by the next hop node is reserved, so that whether the node receives the ACK or not, the node can change according to whether the transmitted code packet is changed or notThe linear system rank formed by the inner code packets evaluates the behavior.

FIG. 3 illustrates a device functional configuration block diagram of intelligent network coding for deep reinforcement learning according to an embodiment of the present invention. The apparatus includes: a source node coding coefficient optimization unit configured to optimize coding coefficients of a data packet on a source node by a depth reinforcement learning coding coefficient optimization model of the source node; an intermediate node coding coefficient optimization unit configured to optimize coding coefficients of the data packet on the intermediate node by a depth reinforcement learning coding coefficient optimization model of the intermediate node; an intelligent network coding unit configured to code information according to the optimized coding coefficient; and the data packet forwarding unit is configured to forward the encoded data packet.

The effects of the present invention will be described below in terms of simulation and platform verification experiments of the present invention.

The example uses a frame TensorFlow 1.15 based on Python3.5 to construct the intelligent network coding method based on the deep reinforcement learning and the architecture of the deep neural network. In this example, consider a multi-jumper network topology with a single source, multiple intermediate nodes and a single destination and a multi-hop parallel network topology, with fig. 4 showing the multi-jumper network topology and fig. 5 showing the multi-hop parallel network topology.

The intelligent network coding method based on the deep reinforcement learning is evaluated by using 2 performance indexes of decoding rate and overhead. Before analyzing the experimental results, the concepts and terms involved in the experiment will be briefly described:

decoding rate: probability of successful decoding (recovering original information) after the destination node receives the P data packets;

overhead: for measuring decoding efficiency of different coding algorithms, we can define overheadWhere K is the number of packets for which one information is divided, E is the number of redundant packets when network coding is used, and Nr is the number of packets received at the destination node.

Link quality: this patent indicates link quality by packet error rate (Packet error rate, PER). Probability of packet error transmission for a given signal-to-interference plus noise ratio (Signal to Interference plus Noise Ratio, SINR) value γWherein N is _b Is the size of a data packet (unit: bit); BER (γ) is the bit error rate for a given SINR value γ, and depends on the technology employed by the physical layer and the statistical characteristics of the channel.

Fig. 6 shows the relationship between the decoding rate of the present example and the number of packets transmitted by the source node and the number of intermediate nodes in the case where the packet error rate per link is 0.1 in the multi-jumper network topology. It can be seen that as the number of packets transmitted by the source node increases and the number of intermediate nodes increases, the probability of successful decoding by the destination node is improved. Furthermore, the greater K, the lower the probability of decoding by the target node in the case that the target node receives the same data packet. In the case of k=5, the overhead when the number (N) of intermediate nodes is equal to 2,4,6,8 is 12.2%, 15.1%, 19.2% and 20.1%, respectively. In the case of k=10, the overhead when the number (N) of intermediate nodes is equal to 2,4,6,8 is 2.5%,4.2%,4.5% and 5.2%, respectively. The more intermediate nodes, the more data packets can pass through longer paths (more intermediate nodes) to reach the destination node, the larger the total loss rate of the packets, and some information packets can not reach the destination node, so that the source node needs to send a lot of redundant information, and the number Nr of the data packets finally received by the destination node is increased (molecules in an overhead formula), so that the overhead is increased.

Fig. 7 shows the relationship between the decoding rate and the number of packets transmitted by the source node and the number of intermediate nodes in the case where the packet error rate of the link between the source node and the intermediate node is 0.1, the packet error rate of the link between the intermediate node and the destination node is 0.3, and the packet error rate of the link between the source node and the destination node is 0.8 in the multi-hop parallel network topology. It can be seen that as the number of packets transmitted by the source node increases and the number of intermediate nodes increases, the probability of successful decoding by the destination node is improved. Furthermore, the greater K, the lower the probability of decoding by the target node in the case that the target node receives the same data packet. In the case of k=5, the overhead when the number (N) of intermediate nodes is equal to 2,6,10,14 is 12.2%, 15.1%, 19.2% and 20.1%, respectively. In the case of k=10, the overhead when the number of intermediate nodes (N) is equal to 2,6,10,14 is 4.8%,4.1%,3.8% and 3.1%, respectively.

Fig. 8 shows the generalization ability of the present invention over different numbers of intermediate nodes in the case where the error rate of packets per link is 0.1 in a linear topology. We first Train a DQN model for the present example, defined as Train, with the number of intermediate nodes n=1 _N＝1 . We then use the trained DQN model to Test the decoding rate at other intermediate node numbers, we define these Test results as (Test _N＝i ,Train _N＝1 I=2, 4,6, 8). Finally, we combine these results with training and testing results at the same number of intermediate nodes (defined as Test _N＝i ,Train _N＝i I=2, 4,6, 8). It can be seen that in the present example, (Test) _N＝i ,Train _N＝1 I=2, 4,6, 8) results and (Test _N＝i ,Train _N＝i I=2, 4,6, 8) results are relatively consistent, and Root Mean Square Error (RMSE) is 0.0034, 0.0072, 0.011 and 0.015 respectively at n=2, 4,6,8, which verifies that the method of the present invention is applicable to different network specificationsGeneralization capability on the mold.

Fig. 9 shows the generalization capability of the present invention at different link qualities with the number of intermediate nodes n=1 in a linear topology. Error rate PER of packets of a link between source S and intermediate node R1 of fig. 4 _S-R1 =0.3, and error rate PER of packets of the link between intermediate node R1 and destination node D _R1-D In the case of =0.3, a DQN model was trained for the example of the invention, defined asWe then used the trained DQN model to test the channel quality at other link quality (PER _S-R1 ＝0,PER _R1-D ＝0)，(PER _S-R1 ＝0.1,PER _R1-D ＝0.3)，(PER _S-R1 ＝0.1,PER _R1-D =0.5), we define these test results as +.>Use->Representing a link quality PER _S-R1＝w ，PER _R1-D＝y Down-trained DQN model at link quality PER _S-R1＝u ，PER _R1-D＝v Test results of the test under test, in the figureIs marked as +.>Finally, these results are compared with training and testing results at the same link qualityA comparison is made. It can be seen that in the case of the present invention, < > is>Results and resultsThe results are more consistent and at link quality (PER _S-R1 ＝0,PER _R1-D ＝0)，(PER _S-R1 ＝0.1,PER _R1-D ＝0.3)，(PER _S-R1 ＝0.1,PER _R1-D =0.5), the root mean square error (Root Mean Square Error, RMSE) is 0, 0.002 and 0.003, respectively, which verifies the generalization ability of the inventive method over different link qualities.

Finally, evaluating the performance of the embodiment of the invention on a real test platform, configuring a source node coding coefficient optimizing unit, an intermediate node coding coefficient optimizing unit, an intelligent network coding unit and a data packet forwarding unit, and using Raspberry Pi3B+ type to carry out experiments. Raspberry Pi3B+ has a 1.4GHz ARM A53 processor, a 1GB RAM, and integrated wireless and Bluetooth functionality. We deployed the DQN model trained by the example of the invention to the racbore pi3b+ using a TensorFlow Lite. In this experiment, the inventive example was compared to a conventional reference coding algorithm and an existing reinforcement learning coding algorithm (RL-aided SNC: dynamic Sparse Coded Multi-Hop Transmissions using Reinforcement Learning) based coding algorithm. In the reference encoding algorithm, the source node uses a conventional fountain code, while the intermediate node uses a random network encoding algorithm. Meanwhile, the decoding result obtained in the simulation environment is compared with the decoding result on a real test platform.

Fig. 10 shows the decoding rate of the present example compared to the conventional reference encoding algorithm and the existing reinforcement-learning encoding algorithm RL-aid SNC in the case where the packet error rate of each link is equal to 0.1, k=5 in the multi-jumper topology. It can be seen that the decoding rate efficiency of the present invention is high with the same number of intermediate nodes. In addition, the simulation result is more consistent with the result obtained by the real test platform. Under the simulation environment and the real test platform, the root mean square error of the decoding results of the three coding algorithms is 0.0042, 0.0153 and 0.0379 respectively.

The experimental result of the embodiment shows that the intelligent network coding method based on the deep reinforcement learning has higher decoding rate and lower cost compared with the existing coding method.

It should be noted that, the steps in the foregoing embodiments are not necessary, and those skilled in the art may perform appropriate operations, substitutions, modifications and the like according to actual needs.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the invention has been described in detail with reference to the embodiments, those skilled in the art will understand that modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A network coding method, the network comprising a source node and an intermediate node, the method comprising:

the source node divides the information to be transmitted into K pieces x ₁ ，x ₂ ，...，x _K K is an integer greater than 1, and the coding coefficient g (x ₁ )，g(x ₂ )，...，g(x _K ) K slices are encoded to generate an encoded packet P _S And transmits the encoded packet P to the next hop node _S Wherein the source node encoding model is derived by training a DQN network, wherein during the source node encoding model training process, the DQN of the source node is trained by randomly sampling experience from an experience replay memory, wherein each step of environmental state is usedTraining by minimizing a predetermined loss function as a training input, the output of which is the optimal cumulative prize value Q, ss obtained after a selected behavior in the ambient conditions _k For the environmental state of the kth step, x _k For the kth piece of the packet, +.>Recent receipt stored in a buffer for a next-hop intermediate node of the source nodeM coded packets, M being an integer greater than 1;

the intermediate node receives the coded packet sent by the previous node and sends the received coded packet P _j Coding M times, determining coding coefficient g (P _j (1))，g(P _j (2))，...，g(P _j (M)) to generate the encoded packet P _new And transmits the encoded packet P to the next hop node _new Wherein the intermediate node coding model is obtained by training a DQN network, wherein during the intermediate node coding model training process, the DQN of the intermediate node is trained by randomly sampling experience from an experience replay memory, wherein each step of environmental state is usedTraining by minimizing a predetermined loss function as a training input, the output of which is the optimal cumulative prize value Q value obtained after the selection of the behavior under the environmental conditions, s _k P being the environmental state of the kth step _new For the current encoded packet, P _j (k) For the kth encoded packet in the intermediate node buffer +.>The recently received M encoded packets stored in the buffer for the intermediate node next hop node z.

2. The method of claim 1, wherein the source node encoding model comprises a target network N _s And executing network N _snode The training of the source node coding model comprises the following steps:

Step 130: at the source node, the environmental state ss _k As N _snode The input of the DQN model of (1) outputs the Q value corresponding to each behavior, and the behavior is selected by greedy strategy probability epsilon to determineDetermining the coding coefficients of K pieces of original information, collecting the experience of source node interaction with environment after execution, and storing the experience in experience playback memory M _s Is a kind of medium.

3. The method of claim 1, wherein the intermediate node coding model comprises a target network N _R And executing network N _Rnode Training of the intermediate node coding model includes:

4. The method of claim 2, wherein for N _s Training includes:

encoding network environment state ss _k As N _s By minimizing the loss functionTraining a neural network, wherein K takes the value 1..k, where Q _target Is N _s A calculated target Q value;

a _k representing the behavior of the kth step;

r _k representing rewards after taking action in the kth step;

θ _k network parameters representing the DQN of step k;

ss _k+1 representing the environment state of the network coding in the k+1 step;

q(ss _k ，a _k ；θ _k ) Represented in said environment state ss _k Lower selection behavior a _k And the Q value after that.

5. A method according to claim 3, wherein for N _R Training includes:

coding the environmental state s of a network _k As N _R By minimizing the loss functionTraining a neural network, wherein k takes the value 1..m, wherein

Q _target Is N _R A calculated target Q value;

a _k representing the behavior of the kth step;

r _k representing rewards after taking action in the kth step;

θ _k network parameters representing the DQN of step k;

s _k+1 representing the environment state of the network coding in the k+1 step;

q(s _k ，a _k ；θ _k ) Represented in said environmental state s _k Lower selection behavior a _k And the Q value after that.

6. The method of claim 4, wherein for N _s ：

7. The method of claim 5, wherein for N _R ：

8. The method of claim 1, wherein the ss of the source node if the source node does not receive an ACK _k A kind of electronic deviceUnchanged; if the intermediate node does not receive the ACK, s of the intermediate node _k Is->Is unchanged.

9. The method of claim 1, wherein the source node generates the encoded packet P by _S ：

P _S ＝G _S X, wherein x= [ X ] ₁ ，x ₂ ，...，x _K ]，G _S ＝[g(x ₁ )，g(x ₂ )，...，g(x _K )]。

10. The method of claim 1, wherein a kth one of M encodings of the intermediate node comprises:

when k=1, the number of the groups,

when k is more than 1, the method comprises the steps of,

P _j (k) K is a value of 1..m for the kth encoded packet in the buffer of the intermediate node.

11. A computer readable storage medium, in which one or more computer programs are stored which, when executed, are adapted to carry out the method of any one of claims 1-10.

12. A network coded computing system, comprising

A storage device, and one or more processors;

wherein the storage means is for storing one or more computer programs which, when executed by the processor, are for implementing the method of any of claims 1-10.