CN115102867A

CN115102867A - Block chain fragmentation system performance optimization method combined with deep reinforcement learning

Info

Publication number: CN115102867A
Application number: CN202210505118.4A
Authority: CN
Inventors: 万剑雄; 姚冰冰; 李雷孝; 刘楚仪
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-09-23
Anticipated expiration: 2042-05-10
Also published as: CN115102867B

Abstract

The performance optimization method of the block chain fragmentation system combined with deep reinforcement learning is characterized in that a block chain fragmentation selection problem is established as a Markov decision process model, the model consists of four parts of a system state, a behavior, an incentive and a value function, and the solution of the model is that an optimal behavior is continuously selected under a dynamic block chain fragmentation system environment, so that the throughput of the block chain fragmentation system is maximized. The BDQSB algorithm can select the most appropriate fragmentation strategy according to the transmission rate among the nodes, the settlement capability of the nodes, the consensus history of the nodes and the probability of malicious nodes by continuously exploring and learning the block size, the block output time, the block chain fragmentation quantity and the complex relation with the block chain fragmentation system, so that the performance of the block chain fragmentation system is improved. Compared with other schemes, the invention can improve the performance of the block chain fragmentation system, solve the problem of behavior space explosion and reduce the time cost of neural network training.

Description

Block chain fragmentation system performance optimization method combined with deep reinforcement learning

Technical Field

The invention belongs to the technical field of data management and evidence storage, relates to intelligent control of a block chain system fragment, and particularly relates to a method for optimizing the performance of a block chain fragment system by combining deep reinforcement learning.

Background

And (4) partitioning the block chain, namely partitioning nodes in the block chain system into different partitions. The block chain transaction processing capability is improved by processing the transactions by the nodes in the chip in parallel, namely the performance of the block chain is improved.

The block chain can be fragmented by adopting a static optimization method, wherein the static optimization method means that when a block chain system adopts a fragmentation technology, the used fragmentation strategy is always fixed and unchanged. However, the blockchain system is constantly changing, and the blockchain system does not conform to a dynamic blockchain environment by using a static optimization method.

Currently, a dynamic optimization method is adopted for the partition of the blockchain system, for example, a deep reinforcement learning algorithm is used to dynamically provide a partition strategy for the blockchain system. According to the current system state of the block chain, the reinforcement learning algorithm provides an optimal fragmentation strategy for the current state, so that the throughput of the block chain system is maximized.

The dynamic optimization method provides a slicing strategy according to the dynamic block chain system environment, and is more suitable for the dynamic block chain system compared with a static optimization method. At present, a deep reinforcement learning algorithm is added into a block chain fragmentation system, and a DQN (deep Q network) algorithm is mostly used for research to solve the defects of a static block chain fragmentation strategy and the problem of state space explosion, but the method using the DQN algorithm cannot solve the problem of behavior space explosion caused by behavior combination after behavior dimension is enlarged.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a performance optimization method of a block chain fragmentation system combined with deep reinforcement learning, which combines a BDQ algorithm with a block chain fragmentation technology under a dynamic block chain environment to solve the problem of behavior space explosion caused by behavior combination after behavior dimension expansion and further solve the problem of low block chain throughput.

In order to achieve the purpose, the invention adopts the technical scheme that:

the performance optimization method of the block chain fragmentation system combined with deep reinforcement learning comprises the following steps:

step 1, a block chain simulation system comprises N nodes, transmission rates are arranged among all the nodes, the nodes have computing capacity, and malicious nodes exist in the nodes;

step 2, establishing a Markov decision process model for the block chain fragmentation problem, wherein the model is determined by the system state S _t Behavior space A, reward R _t+1 And a cost function Q (S) _t ,a _t ) The four parts are formed;

system state S at time t _t Defined as a set of transmission rates R between nodes _t Set of computing capabilities of nodes C _t Node consensus history set H _t And probability P of malicious node _t ；

The behavior space A comprises a block size B, a block output time TI and a block chain fragmentation number K;

reward R _t+1 The reward obtained after the block chain fragmentation system at the time t executes the behavior is represented by the system state S at the time t _t The yield obtained by the next action, namely the number of transactions processed by the block chain per second;

the markov decision process model is summarized as: system state S at any time t _t Next, by selecting the optimal behavior, the system cumulative reward is maximized, and the formula is:

is constrained to

wherein ,a_t For actions taken by the system at time t, γ ^t Is at time t R _t+1 The attenuation factor of (c);

and 3, solving the model by adopting a deep reinforcement learning BDQ algorithm, continuously exploring and learning the complex relation between the throughput of the block chain system and the block size, the block outlet time and the block chain fragment number, finally carrying out fragmentation according to the block chain fragment number, and parallelly processing the transactions by the nodes in the fragments according to the block size and the block outlet time so as to maximize the transaction number processed by the block chain.

Compared with the prior art, the invention has the beneficial effects that:

the algorithm uses a deep reinforcement learning BDQ algorithm according to the dynamically changed block chain system environment to provide an optimal slicing strategy for the block chain system, and changes the original DQN algorithm into the BDQ algorithm. The invention can solve the problem that the neural network is difficult to train caused by behavior space explosion, and can reduce the time cost of neural network training. Compared with other schemes, the invention can improve the performance of the block chain fragmentation system, solve the problem of behavior space explosion and reduce the time cost of neural network training.

Drawings

Fig. 1 is a block chain slicing simulation system structure.

Fig. 2 is a diagram of a neural network structure of the BDQ algorithm.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a performance improvement method of a block chain fragmentation system combined with deep reinforcement learning, which is used for establishing a Markov decision process model for a block chain fragmentation problem, providing a deep reinforcement learning BDQ algorithm as a core of a block chain fragmentation strategy selection algorithm and designing a block chain fragmentation optimal selection strategy (BDQSB) Based on the deep reinforcement learning. The solution of the model constructed by the invention is that the optimal behavior is continuously selected under a series of system states, so that the accumulated reward of the system is maximized, and finally the throughput of the block chain is improved. Compared with other schemes, the invention can improve the performance of the block chain fragmentation system, solve the problem of behavior space explosion and reduce the time cost of neural network training.

Fig. 1 is a block chain fragmentation simulation system structure diagram, where the simulation system includes N nodes, all the nodes have transmission rates, the nodes have computing capabilities, and malicious nodes exist in the nodes. The system slicing process comprises the following steps: according to the number K of the block chain fragments in the behaviors, firstly, selecting nodes in the directory committee, and the number of the nodes in the directory committee

Then, the N-C nodes except the nodes in the catalog committee are divided into pieces, the nodes are divided into different pieces according to the last L bits of the node ID, and L is log ₂ The K node ID and the slice number are both binary coded characters. After the block chain fragmentation is completed, the block chain system obtains K fragments, the transactions are distributed to different fragments to be processed, the intra-fragment nodes pack the transactions into blocks with the size of B, the blocks are broadcasted to other intra-fragment nodes to carry out consensus, and consensus history H is generated in the consensus process. The K fragments send the blocks passing the verification in the fragments to a directory committee, the directory committee finally packs the K blocks into final blocks, and then the final blocks are broadcasted to other nodes in the directory committee to carry out final consensus to form consensus history. The probability P of a malicious node in the block chain can be calculated according to the consensus history in the slice and the consensus history in the directory committee. After the above process is finished, the state of the block chain system changes.

FIG. 2 is a diagram of a neural Network structure of the BDQ (Branching delay Q-Network) algorithm. The existing research of applying a deep reinforcement learning algorithm to a block chain fragmentation system is mostly a used DQN algorithm, and compared with the traditional DQN algorithm, the BDQ algorithm provides a new neural network structure, a behavior space has several sub-behaviors corresponding to several network branches, and a shared decision module is provided. The BDQ algorithm provides a degree of independence for each individual action dimensionThe method has better standability and expandability. BDQ algorithm links block state S _t ＝(R _t ,C _t ,H _t ,P _t ) The state is abstracted through a shared decision module (namely, a hidden layer of the neural network), and the output is divided into two branches, namely a state branch and a behavior branch. Behavior branches output a dominance function for each child behavior, and state branches output a state value function V (S) _t ) And combining the advantage function and the state value function of the child behavior to obtain a Q function of the child behavior, and selecting corresponding behavior according to the Q value of each output child behavior when the block chain fragmentation system makes a decision.

The performance optimization method specifically comprises the following steps:

1. establishing a Markov decision process model for the block chain fragmentation problem, wherein the model consists of the following four parts:

state space: the state space is defined as the system state S _t I.e. set of transmission rates R between nodes _t Set of computing capabilities of nodes C _t Node consensus history set H _t And probability P of malicious node _t The formula is as follows:

S _t ＝{R _t ,C _t ,H _t ,P _t }

wherein R_t ＝{R _i,j J belongs to N, Rij represents the transmission speed of a link between the node i and the node j; c _t ＝{C _i },i∈N，C _i A computational resource that is a blockchain node i; h _t ＝{H _i }，H _i Is the consensus history of node i, H _i 1 or H _i ＝0，H _i 1 indicates that node i is not legal for block verification, H _i 0 means that the node i is valid for block verification; p _t And calculating according to the consensus history.

The behavior space is as follows: the behavior space is represented as a, and includes a block size B, a block-out time TI, and a block chain fragmentation number K, and the formula is:

A＝{B,TI,K}

the reward function: instant prize R _t+1 Which represents the reward obtained after the block chain slicing system executes the action at the time t,from time t the system state S _t The yield obtained by the following actions, i.e. the number of transactions processed per second by the blockchain, is formed by the formula:

wherein ,B_H Is the size of the block header; b is the average size of the transaction. (B-B) _H ) Indicating the size of each block processing transaction. (B-B) _H ) And b is the size of the processing transaction divided by the size of the average transaction, and represents the number of processing transactions in each fragment.

Representing the total number of K sharded processing transactions.

The total number of transactions divided by the time of the block represents the number of transactions processed per second by the block chain, i.e., the throughput of the transactions.

A cost function: is defined as Q (S) _t ,a _t ) The formula is as follows:

cost function Q (S) _t ,a _t ) Also known as Q function, a _t The epsilon A is the action taken by the system at the time t,

for the expectation function, y is the future time relative to time t, R _t+y+1 Represents the reward obtained after the system takes action at the time of t + y, gamma represents the attenuation factor and represents the attention degree of taking certain action under a certain state to the future reward of the system, namely the environmental influence, gamma is more than or equal to 0 and less than 1, and gamma ^y Y power of gamma, is t + y time R _t+y+1 The attenuation factor of (2).

Thus, the established Markov decision process model can be summarized at any time tMomentary system state S _t Then, by selecting the optimal behavior, the accumulated reward of the system is maximized, and the model formula is as follows:

is constrained to

wherein ,γ^t Is at time t R _t+1 The attenuation factor of (2).

2. Model solution and solution algorithm

a solution of the model a, namely, the optimal cost function is obtained through calculation, namely, the system state S at any time t can be obtained according to the optimal cost function _t And selecting the optimal behavior to maximize the accumulated reward, wherein the optimal value function calculation formula is as follows:

at any time t, the optimal behavior selection formula is as follows:

wherein Q^* (S _t ,a _t ) Represents the optimal cost function, S _t+1 Represents the system state at time t +1, a _t+1 Represents any of all actions that the system may take at time t +1, i.e., a certain action in action space a.

And b, solving an algorithm, namely, calculating to obtain an optimal value function and selecting optimal behaviors in the decision so as to maximize the accumulated reward.

The invention solves the algorithm selection deep reinforcement learning BDQ algorithm, and the algorithm is accumulated by continuous decision (S) _t ,a _t ,R _t+1 ,S _t+1 ) Training a neural network with sample records such that the neural network can approximate a cost function, and further selecting an optimal behavior such that the cumulative reward for the model is maximized, wherein R _t+1 Adopt a for the system _t The reward obtained later, S _t+1 The system state at time t + 1.

In the neural network training process, the BDQ algorithm proposes a new neural network structure, and as shown in fig. 2, the neural network of the BDQ algorithm has several sub-behaviors in the behavior space corresponding to several network branches, that is, the network branches correspond to the sub-behaviors in the behavior space a one-to-one, and have a shared decision module (hidden layer of the neural network). The BDQ algorithm provides a certain degree of independence for each independent action dimension, and the expandability is good. BDQ algorithm converts block chain state S at time t _t ＝(R _t ,C _t ,H _t ,P _t ) The state is input into a neural network, abstracted through a shared decision module, and output is divided into two branches, namely a state branch and a behavior branch. Behavior branching outputs a dominance function for each child behavior, i.e. a dominance function A for the block size B ₁ (S _t ,a ₁ ) Dominant function A of block out time TI ₂ (S _t ,a ₂ ) And the number of blockchain shards K ₃ (S _t ,a ₃ ) The state branch outputs a state value function V (S) _t ) And combining the dominance function and the state value function of the child behavior to obtain a value function of the child behavior, and selecting corresponding behavior according to the Q value of each output child behavior when the block chain fragmentation system makes a decision.

The updating process of the neural network is to randomly extract the experience of minimatch size in an experience pool and update the neural network parameters in a gradient descending mode. The BDQ algorithm updates the loss function by the formula

wherein y_d The definition is as follows,

is shown at Q _d According to state S in network _t+1 Selecting the sub-behavior a corresponding to the maximum Q value _d Then according to the state-behavior pair to

Selecting the corresponding Q value by the network, wherein the value function in the BDQ algorithm is a state value function V (S) _t ) And the dominance function A of the behavior _d (S _t ,a _d ) Composition, the cost function is:

Q _d (S _t ,a _d )＝V(S _t )+A _d (S _t ,a _d )

the BDQ algorithm comprises two neural networks with the same structure, wherein an online network is updated in real time, a target network is updated every C step, and an online network parameter value is assigned to the target network.

3. By continuously exploring and learning the complex relation between the throughput of the block chain system and the block size, the block output time and the block chain fragment number, the fragments are finally divided according to the block chain fragment number, and the nodes in the fragments process transactions in parallel according to the block size and the block output time, so that the transaction number processed by the block chain is maximized, and the performance of the block chain is improved.

The operation logic for performing performance optimization intelligent control based on the BDQ algorithm is as follows:

1) initializing an experience playback pool D with the size of N, and storing a system state S at the moment t of the block chain fragmentation system by the experience playback pool D _t Behavior a _t Award R _t+1 And the system state S of the blockchain system at the next time _t+1 ；

2) Initializing two networks with the same structure, namely an online network and a target network; the weights of the two networks are theta and theta respectively ^- ；

3) Setting the initial time t to be 0, and recording the time of sample recording in the experience playback pool D as tau;

4) the probability of searching for the initial behavior is epsilon, and the amount of decrease of the searching rate with t is delta _ε Minimum exploration probability ε _min ；

5) Starting a circulation body;

6) obtaining the state S of the block chain system at the current time t _t ＝{R _t ,C _t ,H _t ,P _t }；

7) Selecting an action using an epsilon-greedy strategy:

8) execution of action a _t The block chain system carries out fragmentation, processes transactions in parallel in the fragmentation, packs the transactions into blocks, carries out consensus on the blocks, and sends the blocks passing the consensus to a directory committee for final packing and consensus; and obtaining the system environment S of the block chain at the next time of the system _t+1 Calculating R _t+1 ；

9) Will (S) _t ,a _t ,R _t+1 ,S _t+1 ) Storing the data into a cache array;

10) randomly extracting Y sample records from the cache array (S) _τ ,a _τ ,R _τ+1 ,S _τ+1 )；

11) And calculating a Q sample value by using the Y records, wherein the formula is as follows:

Q _d (S _τ ,a _d )＝V(S _τ )+A _d (S _τ ,a _d )

12) updating the neural network using the following loss function:

wherein ,

13) the exploration probability epsilon is equal to epsilon-delta _ε and ε_min Minimum value of (1);

14) if t mod C is 0, the target network copies the online network parameter,

15) increasing time t by 1;

16) and ending the circulation body.

The invention provides an embodiment of specific block chain fragmentation optimization, which comprises the following steps:

the block chain fragmentation system has 200 nodes, and the block chain system state S at the moment t is determined _t Inputting the data into a neural network of a BDQ algorithm, outputting Q values of three sub-behaviors B, TI and K, and selecting the sub-behavior with the maximum Q value to form a behavior a to be executed by the block chain slicing system _t . Assume behavior a _t In the block chain fragmentation system, according to the behavior, the number K of block chain fragments is 4, the block chain fragmentation system first selects nodes in the directory committee, and the number of nodes in the directory committee

And then, dividing the N-C (180) nodes except the nodes in the directory committee into 4 pieces, wherein the number of the nodes in each piece is 65. And after the block chain fragmentation is finished, distributing the transaction into different slices for processing, and packing the transaction into a block with the size of B being 4 by the intra-slice node according to the block outlet time TI being 8. The block chain nodes are divided into 4 different fragments, so that the 4 fragments process transactions in parallel, and the throughput of the block chain fragmentation system is improved. The BDQ algorithm is used for providing a slicing strategy for the real-time dynamic block chain slicing system, and the performance of the block chain slicing system is improved. Compared with the use of the DQN algorithm, the BDQ algorithm improves the block chain throughput.

Claims

1. The performance optimization method of the block chain fragmentation system combined with deep reinforcement learning is characterized by comprising the following steps of:

step 1, a block chain simulation system comprises N nodes, transmission rates are provided among all the nodes, the nodes have computing capacity, and malicious nodes exist in the nodes;

the markov decision process model is summarized as: system state S at any time t _t Then, by selecting the optimal behavior, the system cumulative reward is maximized, and the formula is as follows:

is constrained to

wherein ,a_t Actions taken by the system for time t, γ ^t Is at time t R _t+1 The attenuation factor of (d);

2. The method of claim 1, wherein the system state S is a system state of a block chain fragmentation system performance optimization method in combination with deep reinforcement learning _t In, R _t ＝{R _i,j },i,j∈N，R _ij Expressed as the transmission rate of the link between node i and node j; c _t ＝{C _i },i∈N，C _i A computing resource that is node i; h _t ＝{H _i }，H _i Is a consensus history of node i, H _i 1 or H _i ＝0，H _i 1 indicates that node i is not legal for block verification, H _i 0 means that the node i is valid for block verification; p _t And calculating according to the consensus history.

3. The method of claim 1, wherein the reward R is a number of factors selected from the group consisting of a number of factors, a type of the reward R, and a combination thereof _t+1 The formula of (1) is:

wherein ,B_H Is the size of the block header; b is the average size of the transaction, (B-B) _H ) Represents the size of each block processing transaction, (B-B) _H ) The/b indicates the number of transactions processed per slice,

representing the total number of K sharded processing transactions.

4. The method of claim 1, wherein the cost function Q (S) is a function of the value of the system performance optimization method based on the blockchain slicing method with deep reinforcement learning _t ,a _t ) The formula of (1) is:

wherein ,

for the expectation function, y is the future time relative to time t, R _t+y+1 Represents the reward obtained after the system takes action at the time of t + y, gamma represents the attenuation factor and represents the attention degree of taking certain action under a certain state to the future reward of the system, namely the environmental influence, gamma is more than or equal to 0 and less than 1, and gamma ^y Is at time t + y _t+y+1 The attenuation factor of (2).

5. The method for optimizing performance of a block chain slicing system in combination with deep reinforcement learning of claim 1, wherein in the step 2, an optimal cost function is obtained through calculation, that is, the system state S at any time t according to the optimal cost function _t And selecting the optimal behavior to maximize the accumulated reward, wherein the calculation formula of the optimal value function is as follows:

at any time t, the optimal behavior selection formula is as follows:

wherein Q^* (S _t ,a _t ) Representing the optimal cost function, S _t+1 Represents the system state at time t +1, a _t+1 Represents any of all actions that the system may take at time t +1, i.e., a certain action in action space a.

6. The method for optimizing the performance of the blockchain slicing system in combination with deep reinforcement learning of claim 1, wherein the BDQ algorithm is implemented by continuous decision accumulation (S) _t ,a _t ,R _t+1 ,S _t+1 ) Training the neural network by the sample records, so that the neural network can approximate the value function, and then selectingOptimal behavior such that the cumulative reward for the model is maximized, where S _t+1 Representing the system state at time t + 1.

7. The method for optimizing the performance of the block chain slicing system in combination with deep reinforcement learning according to claim 1 or 6, wherein the neural network of the BDQ algorithm, network branches and the sub-behaviors of the behavior space A are in one-to-one correspondence, and a shared decision module, namely a hidden layer of the neural network, is provided; the system state S at the moment t _t ＝{R _t ,C _t ,H _t ,P _t The state is abstracted through a shared decision module, the output is divided into two branches, namely a state branch and a behavior branch, and the behavior branch outputs a dominant function of each sub-behavior, namely a dominant function A of a block size B ₁ (S _t ,a ₁ ) Dominant function A of block out time TI ₂ (S _t ,a ₂ ) And the number of blockchain shards K ₃ (S _t ,a ₃ ) The state branch outputs a state value function V (S) _t ) And combining the dominance function and the state value function of the child behavior to obtain a value function of the child behavior, and selecting corresponding behavior according to the Q value of each output child behavior when the block chain fragmentation system makes a decision.

8. The method of claim 7, wherein the updating process of the neural network is an experience of randomly extracting minipatch size from an experience pool, and the neural network parameters are updated in a gradient descent manner, and an update formula of the BDQ algorithm to the loss function is as follows:

wherein y_d The definition is as follows:

Selecting the corresponding Q value by the network, wherein the value function in the BDQ algorithm is a state value function V (S) _t ) And the dominance function A of the behavior _d (S _t ,a _d ) Composition, cost function of

Q _d (S _t ,a _d )＝V(S _t )+A _d (S _t ,a _d )

9. The method for optimizing the performance of the blockchain slicing system in combination with deep reinforcement learning according to claim 1, wherein the operation logic for performing performance optimization based on the BDQ algorithm is as follows:

5) Starting a circulation body;

7) Selecting an action using an epsilon-greedy strategy:

9) Will (S) _t ,a _t ,R _t+1 ,S _t+1 ) Storing the data into a cache array;

Q _d (S _τ ,a _d )＝V(S _τ )+A _d (S _τ ,a _d )

12) updating the neural network using the following loss function:

wherein ,

13) the exploration probability epsilon is equal to epsilon-delta _ε and ε_min Minimum value of (d);

14) if t mod C is 0, the target network copies the online network parameter,

15) increasing time t by 1;

16) and ending the circulation body.