CN115102867B

CN115102867B - Block chain slicing system performance optimization method combining deep reinforcement learning

Info

Publication number: CN115102867B
Application number: CN202210505118.4A
Authority: CN
Inventors: 万剑雄; 姚冰冰; 李雷孝; 刘楚仪
Original assignee: Inner Mongolia University of Technology
Current assignee: Inner Mongolia University of Technology
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2023-04-25
Anticipated expiration: 2042-05-10
Also published as: CN115102867A

Abstract

The block chain slicing system performance optimization method combining with deep reinforcement learning establishes a block chain slicing selection problem as a Markov decision process model, wherein the model consists of four parts of system states, behaviors, rewards and cost functions, and the solution of the model is to continuously select the optimal behaviors under the dynamic block chain slicing system environment so as to maximize throughput of the block chain slicing system. The BDQSB algorithm can select the most suitable slicing strategy and improve the performance of the block chain slicing system according to the transmission rate among nodes, the settlement capability of the nodes, the consensus history of the nodes and the probability of malicious nodes by continuously exploring and learning the block size, the block outlet time, the block chain slicing number and the complex relation with the block chain slicing system. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.

Description

Block chain slicing system performance optimization method combining deep reinforcement learning

Technical Field

The invention belongs to the technical field of data management and evidence storage, relates to intelligent control of block chain system segmentation, and particularly relates to a block chain segmentation system performance optimization method combined with deep reinforcement learning.

Background

Blockchain shards, i.e., partitioning nodes in a blockchain system into different shards. The ability of the blockchain to handle transactions is improved by nodes within the chip processing the transactions in parallel, i.e., improving the performance of the blockchain.

The block chain can be fragmented by adopting a static optimization method, wherein the static optimization method means that the used fragmentation strategy is always fixed when the block chain system adopts the fragmentation technology. However, the blockchain system is time-of-day, and the blockchain system does not conform to the dynamic blockchain environment by adopting a static optimization method.

Currently, a dynamic optimization method is adopted for slicing the blockchain system, for example, a deep reinforcement learning algorithm is used for dynamically providing a slicing strategy for the blockchain system. And providing an optimal slicing strategy for the current state by the reinforcement learning algorithm according to the current system state of the blockchain, so that the throughput of the blockchain system is maximized.

The dynamic optimization method provides a slicing strategy according to the dynamic blockchain system environment, and is more suitable for the dynamic blockchain system than the static optimization method. At present, a deep reinforcement learning algorithm is added into a blockchain slicing system, most of researches use DQN (Deep Q Network) algorithm to solve the defects of a static blockchain slicing strategy and the problem of state space explosion, but the method using DQN algorithm cannot solve the problem of behavior space explosion caused by behavior combination after behavior dimension expansion.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method for optimizing the performance of a blockchain slicing system combined with deep reinforcement learning, which combines a blockchain slicing technology with a deep reinforcement learning BDQ algorithm in a dynamic blockchain environment so as to solve the problem of behavior space explosion caused by behavior combination after behavior dimension expansion and further solve the problem of low blockchain throughput.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the block chain slicing system performance optimization method combined with deep reinforcement learning comprises the following steps:

step 1, a block chain simulation system comprises N nodes, wherein all nodes have transmission rates, the nodes have computing power, and malicious nodes exist in the nodes;

step 2, establishing a Markov decision process model for the block chain slicing problem, wherein the model is formed by a system state S _t Behavioral space A, reward R _t+1 And a cost function Q (S _t ,a _t ) Four parts;

system state S at time t _t Defined as transmission between nodesSet of transmission rates R _t Computing power set C of nodes _t Consensus history set H of nodes _t And probability P of malicious node _t ；

The behavior space A comprises a block size B, a block outlet time TI and a block chain segmentation quantity K;

rewards R _t+1 Representing rewards obtained after the execution of behaviors of the blockchain slicing system at the time t, and determining the state S of the system at the time t _t The benefits obtained by taking action, namely the number of transactions per second processed by the blockchain;

the Markov decision process model is summarized as: system state S at arbitrary time t _t And then, the system accumulated rewards are maximized by selecting the optimal behavior, wherein the formula is as follows:

constrained to

wherein ,a_t For the action taken by the system at time t, gamma ^t Is time t R _t+1 Attenuation factor of (2);

and 3, adopting a deep reinforcement learning BDQ algorithm to solve a model, and finally performing slicing according to the number of the sliced blockchains by continuously exploring and learning complex relations between throughput of the blockchain system and the size, the time for outputting the blockchains and the number of the sliced blockchains, wherein nodes in the slicing process the transactions in parallel according to the size and the time for outputting the blockchains, so that the number of the transactions processed by the blockchains is maximized.

Compared with the prior art, the invention has the beneficial effects that:

the algorithm provides an optimal slicing strategy for the block chain system by using a deep reinforcement learning BDQ algorithm according to the dynamically changed block chain system environment, and changes the original DQN algorithm into the BDQ algorithm. The invention can solve the problem that the neural network is difficult to train caused by the action space explosion, and can reduce the time cost of the neural network training. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.

Drawings

FIG. 1 is a block chain slice simulation system block diagram.

Fig. 2 is a neural network structure diagram of the BDQ algorithm.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

The invention relates to a method for improving performance of a block chain slicing system combined with deep reinforcement learning, which is characterized by establishing a Markov decision process model for a block chain slicing problem, providing a deep reinforcement learning BDQ algorithm as a core of a block chain slicing strategy selection algorithm, and designing a block chain slicing optimal selection strategy (Branching Dueling Q-Network card-Based block chain, BDQSB) Based on the deep reinforcement learning. The solution of the model constructed by the invention is that the optimal behavior is continuously selected in a series of system states, so that the accumulated rewards of the system are maximized, and finally, the throughput of the blockchain is improved. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.

FIG. 1 is a block chain slicing simulation system architecture diagram, wherein the simulation system comprises N nodes, all nodes have transmission rates, the nodes have computing power, and malicious nodes exist in the nodes. The slicing process of the system is as follows: according to the number K of the blockchain fragments in the behavior, firstly selecting the nodes in the catalogue committee, and the number of the nodes in the catalogue committee

Then N-C nodes except nodes in the directory committee are segmented, the nodes are divided into different segments according to the last L bits of the node ID, and L=log ₂ The K node ID and the slice number are binary coded characters. After the block chain slicing is completed, the block chain system obtains K slices and then performs the following stepsThe transactions are distributed to different slices for processing, the on-chip nodes package the transactions into blocks with the size of B, and broadcast the blocks to other on-chip nodes for consensus, and a consensus history H is generated in the consensus process. The K fragments send the blocks passing the verification in the fragments to a catalogue committee, and the catalogue committee finally packages the K blocks into final blocks and broadcasts the final blocks to other nodes in the catalogue committee for final consensus to form a consensus history. The probability P of a malicious node in the blockchain may be calculated from the intra-chip consensus history and the consensus history in the catalogue committee. After the above process is finished, the state of the block chain system is changed.

Fig. 2 is a neural Network structure diagram of a BDQ (Branching Dueling Q-Network) algorithm. The existing research of applying the deep reinforcement learning algorithm to the blockchain slicing system is mostly a DQN algorithm, but compared with the traditional DQN algorithm, the BDQ algorithm provides a new neural network structure, and the behavior space has a plurality of sub-behaviors corresponding to a plurality of network branches and a shared decision module. The BDQ algorithm provides a certain degree of independence for each independent action dimension, and has good expandability. BDQ algorithm will block chain state S _t ＝(R _t ,C _t ,H _t ,P _t ) The state is abstracted by a shared decision module (namely a hidden layer of the neural network) and is output into two branches, namely a state branch and a behavior branch. Behavior branching outputs a dominance function for each sub-behavior, and state branching outputs a state value function V (S _t ) And combining the dominance function and the state value function of the sub-behaviors to obtain a Q function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each sub-behavior when the block chain slicing system makes a decision.

The performance optimization method specifically comprises the following steps:

1. establishing a Markov decision process model for the block chain slicing problem, wherein the model comprises the following four parts:

state space: the state space is defined as the system state S _t I.e. set of transmission rates R between nodes _t Computing power set C of nodes _t Consensus of nodesHistory set H _t And probability P of malicious node _t The formula is as follows:

S _t ＝{R _t ,C _t ,H _t ,P _t }

wherein R_t ＝{R _i,j I, j e N, rij represents the transmission speed of the link between node i and node j; c (C) _t ＝{C _i },i∈N，C _i Computing resources for blockchain node i; h _t ＝{H _i }，H _i For consensus history of node i, H _i =1 or H _i ＝0，H _i =1 indicates that node i is not legal for block verification, H _i =0 indicates that node i is valid for block verification; p (P) _t And calculating according to the consensus history.

Behavioral space: the behavior space is denoted as a, which includes a block size B, a block out time TI, and a blockchain slicing number K, with the formula:

A＝{B,TI,K}

bonus function: i.e. rewards R _t+1 Which represents rewards obtained after the execution of the behavior of the blockchain sharding system at the time t, and is represented by the system state S at the time t _t The following benefits obtained by taking action, namely the number of transactions processed by the blockchain per second, are formed by the following formulas:

wherein ,B_H The size of the block header; b is the average size of the transaction. (B-B) _H ) Representing the size of each block processing transaction. (B-B) _H ) And/b is the size of the transaction divided by the average transaction size, indicating the number of transactions per slice.

Representing the total number of K sliced transactions. />

Dividing the total number of transactions by the time to get out of the block, representing the number of transactions per second processed by the blockchain, i.eThroughput of transactions.

Cost function: defined as Q (S) _t ,a _t ) The formula is as follows:

cost function Q (S) _t ,a _t ) Also known as Q function, a _t E a is the action taken by the system at time t,

as a desired function, y is the future time relative to time t, R _t+y+1 Representing rewards obtained after the system takes action at the time t+y, wherein gamma represents attenuation factors, and represents the degree of importance of taking action in a certain state to future rewards of the system, namely environmental influence, wherein gamma is more than or equal to 0 and less than 1 ^y Y to the power of γ, is the time R of t+y _t+y+1 Is a factor of attenuation of (a).

Thus, the established Markov decision process model can be summarized as the system state S at any time t _t And then, the system accumulated rewards are maximized by selecting the optimal behavior, and the model formula is as follows:

constrained to

wherein ,γ^t Is time t R _t+1 Is a factor of attenuation of (a).

2. Model solution and solving algorithm

The solution of the model a is that the optimal cost function is obtained by calculation, namely the system state S at any time t according to the optimal cost function _t And selecting the optimal behavior to maximize the accumulated rewards, wherein the optimal cost function calculation formula is as follows:

at any time t, the optimal behavior selection formula is:

wherein Q^* (S _t ,a _t ) Represents an optimal cost function S _t+1 Represents the system state at time t+1, a _t+1 Representing any of all the actions that the system may take at time t+1, i.e. a certain action in action space a.

And b, calculating to obtain an optimal cost function and selecting optimal behaviors in the decision so as to maximize the accumulated rewards.

The solution algorithm of the invention selects a deep reinforcement learning BDQ algorithm which accumulates through continuous decision (S _t ,a _t ,R _t+1 ,S _t+1 ) Training a neural network using sample records, such that the neural network approximates a cost function, thereby selecting optimal behavior such that cumulative rewards of the model are maximized, where R _t+1 Take a for the system _t The obtained rewards S _t+1 The system state is at time t+1.

In the neural network training process, the BDQ algorithm provides a new neural network structure, and the neural network of the BDQ algorithm is shown in fig. 2, and the behavior space has several sub-behaviors corresponding to several network branches, namely, the network branches are in one-to-one correspondence with the sub-behaviors of the behavior space A, and the BDQ algorithm has a shared decision module (a hidden layer of the neural network). The BDQ algorithm provides a certain degree of independence for each independent action dimension, and has good expandability. BDQ algorithm will block chain state S at time t _t ＝(R _t ,C _t ,H _t ,P _t ) The state is abstracted by a shared decision module and is divided into two branches, namely a state branch and a behavior branch. Behavior branching outputs each sub-behaviorDominance function, i.e. dominance function A of block size B ₁ (S _t ,a ₁ ) Dominance function A of out-block time TI ₂ (S _t ,a ₂ ) Dominance function A for the number of blockchain slices K ₃ (S _t ,a ₃ ) The state branch outputs a state value function V (S _t ) And combining the dominance function and the state value function of the sub-behaviors to obtain a value function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each outputted sub-behavior when the block chain slicing system makes a decision.

The updating process of the neural network is to randomly extract experiences of miniband size from an experience pool and update the neural network parameters in a gradient descent mode. The update formula of the BDQ algorithm on the loss function is as follows

/>

wherein y_d The definition is as follows,

represented at Q _d According to state S in network _t+1 Selecting sub-behavior a corresponding to maximum Q value _d Then according to the state-behavior pair to +.>

The network selects the corresponding Q value, and the cost function in the BDQ algorithm is represented by a state value function V (S _t ) Dominance function A of sum behavior _d (S _t ,a _d ) The composition, cost function is:

Q _d (S _t ,a _d )＝V(S _t )+A _d (S _t ,a _d )

two neural networks with the same structure exist in the BDQ algorithm, wherein the online network is updated in real time, the target network is updated once every C steps, and the online network parameter value is assigned to the target network.

3. Through continuously exploring and learning the complex relation between the throughput of the block chain system and the size, the block outlet time and the block chain slicing number, finally slicing is carried out according to the block chain slicing number, and the nodes in the slicing process the transactions in parallel according to the block size and the block outlet time, so that the transaction number of the block chain processing is maximized, and the performance of the block chain is improved.

The running logic for performing intelligent control of performance optimization based on the BDQ algorithm is as follows:

1) Initializing an experience playback pool D with the size of N, and storing a system state S of a blockchain slicing system at the moment t by the experience playback pool D _t Behavior a _t Awards R _t+1 And system state S of the blockchain system at the next time _t+1 ；

2) Initializing two networks with the same structure of an online network and a target network; the weights of the two networks are respectively theta and theta ^- ；

3) Setting an initial time t=0, and recording the time of sample record in the experience playback pool D as tau;

4) Initializing the exploration probability epsilon of behaviors, wherein the exploration rate reduces delta along with t _ε Minimum search probability ε _min ；

5) Starting a circulating body;

6) Acquiring the current t-moment block chain system state S _t ＝{R _t ,C _t ,H _t ,P _t }；

7) Selecting behavior using an ε -greedy policy:

8) Execution behavior a _t The block chain system performs slicing, processes transactions in parallel in the slicing, packages the transactions into blocks, performs consensus on the blocks, and sends the blocks passing the consensus to the catalogue committee for final packaging and consensus; and obtains the system environment S of the block chain system at the next time _t+1 Calculating R _t+1 ；

9) Will (S) _t ,a _t ,R _t+1 ,S _t+1 ) Storing into a cache array;

10 Randomly extracting Y-bar sample records from the cache array (S) _τ ,a _τ ,R _τ+1 ,S _τ+1 )；

11 Using Y records, the Q sample value is calculated as follows:

Q _d (S _τ ,a _d )＝V(S _τ )+A _d (S _τ ,a _d )

12 Updating the neural network using the following loss function:

/>

wherein ,

13 Finding the probability ε to ε - Δ _ε and ε_min Is the minimum value of (a);

14 If t mod c=0, then the target network replicates the online network parameter,

15 Time t) is increased by 1;

16 The cycle body is finished.

The invention provides a specific blockchain slicing optimization embodiment, which comprises the following steps:

the blockchain slicing system has 200 nodes, and the state S of the blockchain system at the moment of t is calculated _t Inputting the behaviors into a neural network of a BDQ algorithm, outputting Q values of three sub-behaviors B, TI and K, and selecting the sub-behaviors with the largest Q value to form a behavior a to be executed by the block chain slicing system _t . Hypothesized behavior a _t The block chain slicing system selects nodes in the catalogue committee according to the number K=4 of the block chain slices in the behavior, and the number of the nodes in the catalogue committee

The nodes were then sliced into 4 pieces, 65 per piece, with N-c=180 nodes excluding nodes in the directory committee. After the block chain is sliced, the transaction is distributed to different slices for processing, and the intra-slice nodes pack the transaction into blocks with the size of B=4 according to the block out time TI=8. The blockchain node is divided into 4 different slices, so that the 4 slices process transactions in parallel, and the throughput of the blockchain slicing system is improved. And a BDQ algorithm is used for providing a slicing strategy for the block chain slicing system which is dynamic in real time, so that the performance of the block chain slicing system is improved. The BDQ algorithm results in improved blockchain throughput as compared to using the DQN algorithm. />

Claims

1. The block chain slicing system performance optimization method combined with deep reinforcement learning is characterized by comprising the following steps of:

step 2, establishing a Markov decision process model for the block chain slicing problem, wherein the model is formed by a system state S _t Behavioral space A, reward R _t+1 And a cost function Q (S _t ，a _t ) Four parts;

system state S at time t _t Defined as the set of transmission rates R between nodes _t Computing power set C of nodes _t Consensus history set H of nodes _t And probability P of malicious node _t The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:

S _t ＝{R _t ，C _t ，H _t ，P _t }

wherein R_t ＝{R _i，j }，i，j∈N，R _ij Represented as the transmission rate of the link between node i and node j; c (C) _t ＝{C _i }，i∈N，C _i Computing resources for blockchain node i; h _t ＝{H _i }，H _i For consensus history of node i, H _i =1 or H _i ＝0，H _i Table=1Node i is shown to verify that the block is illegal, H _i =0 indicates that node i is valid for block verification; p (P) _t According to the consensus history, calculating;

the behavior space A comprises a block size B, a block outlet time TI and a block chain segmentation quantity K; the formula is:

A＝{B，TI，K}

rewards R _t+1 Representing rewards obtained after the execution of behaviors of the blockchain slicing system at the time t, and determining the state S of the system at the time t _t The benefits obtained by taking action, namely the number of transactions per second processed by the blockchain; the formula is:

wherein ,B_H The size of the block header; b is the average size of the transaction, (B-B) _H ) Representing the size of each block transaction, (B-B) _H ) B represents the number of transactions per fragment,

representing the total number of K slicing transactions;

dividing the total number of the transactions by the time of outputting the blocks to represent the number of the transactions processed by the blockchain per second, namely the throughput of the transactions;

cost function Q (S) _t ，a _t ) The formula of (2) is:

wherein ,a_t E a is the action taken by the system at time t,

as a desired function, less future time relative to time t, R _t+y+1 Representing rewards obtained after the system takes action at the time t+y, wherein gamma represents attenuation factors, and represents the degree of importance of taking action in a certain state to future rewards of the system, namely environmental influence, wherein gamma is more than or equal to 0 and less than 1 ^y Is t+y time R _t+y+1 Attenuation factor of (2);

constrained to

wherein: calculating to obtain an optimal cost function, namely, according to the optimal cost function, the system state S at any time t _t And selecting the optimal behavior to maximize the accumulated rewards, wherein the calculation formula of the optimal cost function is as follows:

at any time t, the optimal behavior selection formula is:

wherein Q^* (S _t ，a _t ) Represents an optimal cost function S _t+1 Represents the system state at time t+1, a _t+1 Representing any of all the actions that the system may take at time t+1, i.e. the action spaceA certain behavior in A;

step 3, adopting a deep reinforcement learning BDQ algorithm to solve a model, and finally performing slicing according to the number of the sliced blockchains by continuously exploring and learning complex relations between throughput of a blockchain system, the size of a block, the block outlet time and the number of the sliced blockchains, wherein nodes in the slicing process transactions in parallel according to the size of the block and the block outlet time, so that the number of the transactions processed by the blockchains is maximized;

the running logic for performance optimization based on the BDQ algorithm is as follows:

4) Initializing the exploration probability delta of behaviors, wherein the exploration rate reduces delta along with t _ε Minimum search probability ε _min ；

5) Starting a circulating body;

6) Acquiring the current t-moment block chain system state S _t ＝{R _t ，C _t ，H _t ，P _t }；

7) Selecting behavior using an ε -greedy policy:

9) Will (S) _t ，a _t ，R _t+1 ，S _t+1 ) Storing into a cache array;

10 Randomly extracting Y-bar sample records from the cache array (S) _τ ，a _τ ，R _τ+1 ，S _τ+1 )；

11 Using Y records, the Q sample value is calculated as follows:

Q _d (S _τ ，a _d )＝V(S _τ )+A _d (S _τ ，a _d )

12 Updating the neural network using the following loss function:

wherein ,

15 Time t) is increased by 1;

16 Ending the circulation body;

the BDQ algorithm accumulates (S _t ，a _t ，R _t+1 ，S _t+1 ) Training the neural network with the sample records such that the neural network approximates a cost function, thereby selecting optimal behavior such that cumulative rewards of the model are maximized, wherein S _t+1 The system state at time t+1;

the neural network of the BDQ algorithm has network branches in one-to-one correspondence with sub-behaviors of the behavior space A and is provided with a shared decision module, namely a hidden layer of the neural network; system state S at t time _t ＝{R _t ，C _t ，H _t ，P _t Input into neural network, and the state is processed by shared decision moduleThe abstract, output is divided into two branches, namely a state branch and a behavior branch, the behavior branch outputs the dominance function of each sub-behavior, namely the dominance function A of the block size B ₁ (S _t ，a ₁ ) Dominance function A of out-block time TI ₂ (S _t ，a ₂ ) Dominance function A for the number of blockchain slices K ₃ (S _t ，a ₃ ) The state branch outputs a state value function V (S _t ) Combining the dominance function and the state value function of the sub-behaviors to obtain a value function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each outputted sub-behavior when the block chain slicing system makes a decision;

the updating process of the neural network is to randomly extract experiences of miniband size from an experience pool, update the neural network parameters in a gradient descent mode, and update formulas of the BDQ algorithm on loss functions are as follows:

wherein y_d The definition is as follows:

The network selects the corresponding Q value, and the cost function in the BDQ algorithm is represented by a state value function V (S _t ) Dominance function A of sum behavior _d (S _t ，a _d ) Composition, cost function is

Q _d (S _t ，a _d )＝V(S _t )+A _d (S _t ，a _d )