CN115102867B - Block chain slicing system performance optimization method combining deep reinforcement learning - Google Patents

Block chain slicing system performance optimization method combining deep reinforcement learning Download PDF

Info

Publication number
CN115102867B
CN115102867B CN202210505118.4A CN202210505118A CN115102867B CN 115102867 B CN115102867 B CN 115102867B CN 202210505118 A CN202210505118 A CN 202210505118A CN 115102867 B CN115102867 B CN 115102867B
Authority
CN
China
Prior art keywords
time
block chain
behavior
slicing
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210505118.4A
Other languages
Chinese (zh)
Other versions
CN115102867A (en
Inventor
万剑雄
姚冰冰
李雷孝
刘楚仪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University of Technology
Original Assignee
Inner Mongolia University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University of Technology filed Critical Inner Mongolia University of Technology
Priority to CN202210505118.4A priority Critical patent/CN115102867B/en
Publication of CN115102867A publication Critical patent/CN115102867A/en
Application granted granted Critical
Publication of CN115102867B publication Critical patent/CN115102867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The block chain slicing system performance optimization method combining with deep reinforcement learning establishes a block chain slicing selection problem as a Markov decision process model, wherein the model consists of four parts of system states, behaviors, rewards and cost functions, and the solution of the model is to continuously select the optimal behaviors under the dynamic block chain slicing system environment so as to maximize throughput of the block chain slicing system. The BDQSB algorithm can select the most suitable slicing strategy and improve the performance of the block chain slicing system according to the transmission rate among nodes, the settlement capability of the nodes, the consensus history of the nodes and the probability of malicious nodes by continuously exploring and learning the block size, the block outlet time, the block chain slicing number and the complex relation with the block chain slicing system. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.

Description

Block chain slicing system performance optimization method combining deep reinforcement learning
Technical Field
The invention belongs to the technical field of data management and evidence storage, relates to intelligent control of block chain system segmentation, and particularly relates to a block chain segmentation system performance optimization method combined with deep reinforcement learning.
Background
Blockchain shards, i.e., partitioning nodes in a blockchain system into different shards. The ability of the blockchain to handle transactions is improved by nodes within the chip processing the transactions in parallel, i.e., improving the performance of the blockchain.
The block chain can be fragmented by adopting a static optimization method, wherein the static optimization method means that the used fragmentation strategy is always fixed when the block chain system adopts the fragmentation technology. However, the blockchain system is time-of-day, and the blockchain system does not conform to the dynamic blockchain environment by adopting a static optimization method.
Currently, a dynamic optimization method is adopted for slicing the blockchain system, for example, a deep reinforcement learning algorithm is used for dynamically providing a slicing strategy for the blockchain system. And providing an optimal slicing strategy for the current state by the reinforcement learning algorithm according to the current system state of the blockchain, so that the throughput of the blockchain system is maximized.
The dynamic optimization method provides a slicing strategy according to the dynamic blockchain system environment, and is more suitable for the dynamic blockchain system than the static optimization method. At present, a deep reinforcement learning algorithm is added into a blockchain slicing system, most of researches use DQN (Deep Q Network) algorithm to solve the defects of a static blockchain slicing strategy and the problem of state space explosion, but the method using DQN algorithm cannot solve the problem of behavior space explosion caused by behavior combination after behavior dimension expansion.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a method for optimizing the performance of a blockchain slicing system combined with deep reinforcement learning, which combines a blockchain slicing technology with a deep reinforcement learning BDQ algorithm in a dynamic blockchain environment so as to solve the problem of behavior space explosion caused by behavior combination after behavior dimension expansion and further solve the problem of low blockchain throughput.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the block chain slicing system performance optimization method combined with deep reinforcement learning comprises the following steps:
step 1, a block chain simulation system comprises N nodes, wherein all nodes have transmission rates, the nodes have computing power, and malicious nodes exist in the nodes;
step 2, establishing a Markov decision process model for the block chain slicing problem, wherein the model is formed by a system state S t Behavioral space A, reward R t+1 And a cost function Q (S t ,a t ) Four parts;
system state S at time t t Defined as transmission between nodesSet of transmission rates R t Computing power set C of nodes t Consensus history set H of nodes t And probability P of malicious node t
The behavior space A comprises a block size B, a block outlet time TI and a block chain segmentation quantity K;
rewards R t+1 Representing rewards obtained after the execution of behaviors of the blockchain slicing system at the time t, and determining the state S of the system at the time t t The benefits obtained by taking action, namely the number of transactions per second processed by the blockchain;
the Markov decision process model is summarized as: system state S at arbitrary time t t And then, the system accumulated rewards are maximized by selecting the optimal behavior, wherein the formula is as follows:
Figure BDA0003637168180000021
constrained to
Figure BDA0003637168180000022
wherein ,at For the action taken by the system at time t, gamma t Is time t R t+1 Attenuation factor of (2);
and 3, adopting a deep reinforcement learning BDQ algorithm to solve a model, and finally performing slicing according to the number of the sliced blockchains by continuously exploring and learning complex relations between throughput of the blockchain system and the size, the time for outputting the blockchains and the number of the sliced blockchains, wherein nodes in the slicing process the transactions in parallel according to the size and the time for outputting the blockchains, so that the number of the transactions processed by the blockchains is maximized.
Compared with the prior art, the invention has the beneficial effects that:
the algorithm provides an optimal slicing strategy for the block chain system by using a deep reinforcement learning BDQ algorithm according to the dynamically changed block chain system environment, and changes the original DQN algorithm into the BDQ algorithm. The invention can solve the problem that the neural network is difficult to train caused by the action space explosion, and can reduce the time cost of the neural network training. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.
Drawings
FIG. 1 is a block chain slice simulation system block diagram.
Fig. 2 is a neural network structure diagram of the BDQ algorithm.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.
The invention relates to a method for improving performance of a block chain slicing system combined with deep reinforcement learning, which is characterized by establishing a Markov decision process model for a block chain slicing problem, providing a deep reinforcement learning BDQ algorithm as a core of a block chain slicing strategy selection algorithm, and designing a block chain slicing optimal selection strategy (Branching Dueling Q-Network card-Based block chain, BDQSB) Based on the deep reinforcement learning. The solution of the model constructed by the invention is that the optimal behavior is continuously selected in a series of system states, so that the accumulated rewards of the system are maximized, and finally, the throughput of the blockchain is improved. Compared with other schemes, the invention can further improve the performance of the block chain slicing system, solve the problem of behavior space explosion and reduce the training time cost of the neural network.
FIG. 1 is a block chain slicing simulation system architecture diagram, wherein the simulation system comprises N nodes, all nodes have transmission rates, the nodes have computing power, and malicious nodes exist in the nodes. The slicing process of the system is as follows: according to the number K of the blockchain fragments in the behavior, firstly selecting the nodes in the catalogue committee, and the number of the nodes in the catalogue committee
Figure BDA0003637168180000031
Then N-C nodes except nodes in the directory committee are segmented, the nodes are divided into different segments according to the last L bits of the node ID, and L=log 2 The K node ID and the slice number are binary coded characters. After the block chain slicing is completed, the block chain system obtains K slices and then performs the following stepsThe transactions are distributed to different slices for processing, the on-chip nodes package the transactions into blocks with the size of B, and broadcast the blocks to other on-chip nodes for consensus, and a consensus history H is generated in the consensus process. The K fragments send the blocks passing the verification in the fragments to a catalogue committee, and the catalogue committee finally packages the K blocks into final blocks and broadcasts the final blocks to other nodes in the catalogue committee for final consensus to form a consensus history. The probability P of a malicious node in the blockchain may be calculated from the intra-chip consensus history and the consensus history in the catalogue committee. After the above process is finished, the state of the block chain system is changed.
Fig. 2 is a neural Network structure diagram of a BDQ (Branching Dueling Q-Network) algorithm. The existing research of applying the deep reinforcement learning algorithm to the blockchain slicing system is mostly a DQN algorithm, but compared with the traditional DQN algorithm, the BDQ algorithm provides a new neural network structure, and the behavior space has a plurality of sub-behaviors corresponding to a plurality of network branches and a shared decision module. The BDQ algorithm provides a certain degree of independence for each independent action dimension, and has good expandability. BDQ algorithm will block chain state S t =(R t ,C t ,H t ,P t ) The state is abstracted by a shared decision module (namely a hidden layer of the neural network) and is output into two branches, namely a state branch and a behavior branch. Behavior branching outputs a dominance function for each sub-behavior, and state branching outputs a state value function V (S t ) And combining the dominance function and the state value function of the sub-behaviors to obtain a Q function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each sub-behavior when the block chain slicing system makes a decision.
The performance optimization method specifically comprises the following steps:
1. establishing a Markov decision process model for the block chain slicing problem, wherein the model comprises the following four parts:
state space: the state space is defined as the system state S t I.e. set of transmission rates R between nodes t Computing power set C of nodes t Consensus of nodesHistory set H t And probability P of malicious node t The formula is as follows:
S t ={R t ,C t ,H t ,P t }
wherein Rt ={R i,j I, j e N, rij represents the transmission speed of the link between node i and node j; c (C) t ={C i },i∈N,C i Computing resources for blockchain node i; h t ={H i },H i For consensus history of node i, H i =1 or H i =0,H i =1 indicates that node i is not legal for block verification, H i =0 indicates that node i is valid for block verification; p (P) t And calculating according to the consensus history.
Behavioral space: the behavior space is denoted as a, which includes a block size B, a block out time TI, and a blockchain slicing number K, with the formula:
A={B,TI,K}
bonus function: i.e. rewards R t+1 Which represents rewards obtained after the execution of the behavior of the blockchain sharding system at the time t, and is represented by the system state S at the time t t The following benefits obtained by taking action, namely the number of transactions processed by the blockchain per second, are formed by the following formulas:
Figure BDA0003637168180000051
wherein ,BH The size of the block header; b is the average size of the transaction. (B-B) H ) Representing the size of each block processing transaction. (B-B) H ) And/b is the size of the transaction divided by the average transaction size, indicating the number of transactions per slice.
Figure BDA0003637168180000052
Representing the total number of K sliced transactions. />
Figure BDA0003637168180000053
Dividing the total number of transactions by the time to get out of the block, representing the number of transactions per second processed by the blockchain, i.eThroughput of transactions.
Cost function: defined as Q (S) t ,a t ) The formula is as follows:
Figure BDA0003637168180000054
cost function Q (S) t ,a t ) Also known as Q function, a t E a is the action taken by the system at time t,
Figure BDA0003637168180000057
as a desired function, y is the future time relative to time t, R t+y+1 Representing rewards obtained after the system takes action at the time t+y, wherein gamma represents attenuation factors, and represents the degree of importance of taking action in a certain state to future rewards of the system, namely environmental influence, wherein gamma is more than or equal to 0 and less than 1 y Y to the power of γ, is the time R of t+y t+y+1 Is a factor of attenuation of (a).
Thus, the established Markov decision process model can be summarized as the system state S at any time t t And then, the system accumulated rewards are maximized by selecting the optimal behavior, and the model formula is as follows:
Figure BDA0003637168180000055
constrained to
Figure BDA0003637168180000056
wherein ,γt Is time t R t+1 Is a factor of attenuation of (a).
2. Model solution and solving algorithm
The solution of the model a is that the optimal cost function is obtained by calculation, namely the system state S at any time t according to the optimal cost function t And selecting the optimal behavior to maximize the accumulated rewards, wherein the optimal cost function calculation formula is as follows:
Figure BDA0003637168180000061
at any time t, the optimal behavior selection formula is:
Figure BDA0003637168180000062
wherein Q* (S t ,a t ) Represents an optimal cost function S t+1 Represents the system state at time t+1, a t+1 Representing any of all the actions that the system may take at time t+1, i.e. a certain action in action space a.
And b, calculating to obtain an optimal cost function and selecting optimal behaviors in the decision so as to maximize the accumulated rewards.
The solution algorithm of the invention selects a deep reinforcement learning BDQ algorithm which accumulates through continuous decision (S t ,a t ,R t+1 ,S t+1 ) Training a neural network using sample records, such that the neural network approximates a cost function, thereby selecting optimal behavior such that cumulative rewards of the model are maximized, where R t+1 Take a for the system t The obtained rewards S t+1 The system state is at time t+1.
In the neural network training process, the BDQ algorithm provides a new neural network structure, and the neural network of the BDQ algorithm is shown in fig. 2, and the behavior space has several sub-behaviors corresponding to several network branches, namely, the network branches are in one-to-one correspondence with the sub-behaviors of the behavior space A, and the BDQ algorithm has a shared decision module (a hidden layer of the neural network). The BDQ algorithm provides a certain degree of independence for each independent action dimension, and has good expandability. BDQ algorithm will block chain state S at time t t =(R t ,C t ,H t ,P t ) The state is abstracted by a shared decision module and is divided into two branches, namely a state branch and a behavior branch. Behavior branching outputs each sub-behaviorDominance function, i.e. dominance function A of block size B 1 (S t ,a 1 ) Dominance function A of out-block time TI 2 (S t ,a 2 ) Dominance function A for the number of blockchain slices K 3 (S t ,a 3 ) The state branch outputs a state value function V (S t ) And combining the dominance function and the state value function of the sub-behaviors to obtain a value function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each outputted sub-behavior when the block chain slicing system makes a decision.
The updating process of the neural network is to randomly extract experiences of miniband size from an experience pool and update the neural network parameters in a gradient descent mode. The update formula of the BDQ algorithm on the loss function is as follows
Figure BDA0003637168180000071
/>
wherein yd The definition is as follows,
Figure BDA0003637168180000072
Figure BDA0003637168180000073
represented at Q d According to state S in network t+1 Selecting sub-behavior a corresponding to maximum Q value d Then according to the state-behavior pair to +.>
Figure BDA0003637168180000074
The network selects the corresponding Q value, and the cost function in the BDQ algorithm is represented by a state value function V (S t ) Dominance function A of sum behavior d (S t ,a d ) The composition, cost function is:
Q d (S t ,a d )=V(S t )+A d (S t ,a d )
two neural networks with the same structure exist in the BDQ algorithm, wherein the online network is updated in real time, the target network is updated once every C steps, and the online network parameter value is assigned to the target network.
3. Through continuously exploring and learning the complex relation between the throughput of the block chain system and the size, the block outlet time and the block chain slicing number, finally slicing is carried out according to the block chain slicing number, and the nodes in the slicing process the transactions in parallel according to the block size and the block outlet time, so that the transaction number of the block chain processing is maximized, and the performance of the block chain is improved.
The running logic for performing intelligent control of performance optimization based on the BDQ algorithm is as follows:
1) Initializing an experience playback pool D with the size of N, and storing a system state S of a blockchain slicing system at the moment t by the experience playback pool D t Behavior a t Awards R t+1 And system state S of the blockchain system at the next time t+1
2) Initializing two networks with the same structure of an online network and a target network; the weights of the two networks are respectively theta and theta -
3) Setting an initial time t=0, and recording the time of sample record in the experience playback pool D as tau;
4) Initializing the exploration probability epsilon of behaviors, wherein the exploration rate reduces delta along with t ε Minimum search probability ε min
5) Starting a circulating body;
6) Acquiring the current t-moment block chain system state S t ={R t ,C t ,H t ,P t };
7) Selecting behavior using an ε -greedy policy:
Figure BDA0003637168180000081
8) Execution behavior a t The block chain system performs slicing, processes transactions in parallel in the slicing, packages the transactions into blocks, performs consensus on the blocks, and sends the blocks passing the consensus to the catalogue committee for final packaging and consensus; and obtains the system environment S of the block chain system at the next time t+1 Calculating R t+1
9) Will (S) t ,a t ,R t+1 ,S t+1 ) Storing into a cache array;
10 Randomly extracting Y-bar sample records from the cache array (S) τ ,a τ ,R τ+1 ,S τ+1 );
11 Using Y records, the Q sample value is calculated as follows:
Q d (S τ ,a d )=V(S τ )+A d (S τ ,a d )
12 Updating the neural network using the following loss function:
Figure BDA0003637168180000082
/>
wherein ,
Figure BDA0003637168180000083
13 Finding the probability ε to ε - Δ ε and εmin Is the minimum value of (a);
14 If t mod c=0, then the target network replicates the online network parameter,
15 Time t) is increased by 1;
16 The cycle body is finished.
The invention provides a specific blockchain slicing optimization embodiment, which comprises the following steps:
the blockchain slicing system has 200 nodes, and the state S of the blockchain system at the moment of t is calculated t Inputting the behaviors into a neural network of a BDQ algorithm, outputting Q values of three sub-behaviors B, TI and K, and selecting the sub-behaviors with the largest Q value to form a behavior a to be executed by the block chain slicing system t . Hypothesized behavior a t The block chain slicing system selects nodes in the catalogue committee according to the number K=4 of the block chain slices in the behavior, and the number of the nodes in the catalogue committee
Figure BDA0003637168180000091
The nodes were then sliced into 4 pieces, 65 per piece, with N-c=180 nodes excluding nodes in the directory committee. After the block chain is sliced, the transaction is distributed to different slices for processing, and the intra-slice nodes pack the transaction into blocks with the size of B=4 according to the block out time TI=8. The blockchain node is divided into 4 different slices, so that the 4 slices process transactions in parallel, and the throughput of the blockchain slicing system is improved. And a BDQ algorithm is used for providing a slicing strategy for the block chain slicing system which is dynamic in real time, so that the performance of the block chain slicing system is improved. The BDQ algorithm results in improved blockchain throughput as compared to using the DQN algorithm. />

Claims (1)

1. The block chain slicing system performance optimization method combined with deep reinforcement learning is characterized by comprising the following steps of:
step 1, a block chain simulation system comprises N nodes, wherein all nodes have transmission rates, the nodes have computing power, and malicious nodes exist in the nodes;
step 2, establishing a Markov decision process model for the block chain slicing problem, wherein the model is formed by a system state S t Behavioral space A, reward R t+1 And a cost function Q (S t ,a t ) Four parts;
system state S at time t t Defined as the set of transmission rates R between nodes t Computing power set C of nodes t Consensus history set H of nodes t And probability P of malicious node t The method comprises the steps of carrying out a first treatment on the surface of the The formula is as follows:
S t ={R t ,C t ,H t ,P t }
wherein Rt ={R i,j },i,j∈N,R ij Represented as the transmission rate of the link between node i and node j; c (C) t ={C i },i∈N,C i Computing resources for blockchain node i; h t ={H i },H i For consensus history of node i, H i =1 or H i =0,H i Table=1Node i is shown to verify that the block is illegal, H i =0 indicates that node i is valid for block verification; p (P) t According to the consensus history, calculating;
the behavior space A comprises a block size B, a block outlet time TI and a block chain segmentation quantity K; the formula is:
A={B,TI,K}
rewards R t+1 Representing rewards obtained after the execution of behaviors of the blockchain slicing system at the time t, and determining the state S of the system at the time t t The benefits obtained by taking action, namely the number of transactions per second processed by the blockchain; the formula is:
Figure FDA0004139021340000011
wherein ,BH The size of the block header; b is the average size of the transaction, (B-B) H ) Representing the size of each block transaction, (B-B) H ) B represents the number of transactions per fragment,
Figure FDA0004139021340000012
representing the total number of K slicing transactions;
Figure FDA0004139021340000021
dividing the total number of the transactions by the time of outputting the blocks to represent the number of the transactions processed by the blockchain per second, namely the throughput of the transactions;
cost function Q (S) t ,a t ) The formula of (2) is:
Figure FDA0004139021340000022
wherein ,at E a is the action taken by the system at time t,
Figure FDA0004139021340000023
as a desired function, less future time relative to time t, R t+y+1 Representing rewards obtained after the system takes action at the time t+y, wherein gamma represents attenuation factors, and represents the degree of importance of taking action in a certain state to future rewards of the system, namely environmental influence, wherein gamma is more than or equal to 0 and less than 1 y Is t+y time R t+y+1 Attenuation factor of (2);
the Markov decision process model is summarized as: system state S at arbitrary time t t And then, the system accumulated rewards are maximized by selecting the optimal behavior, wherein the formula is as follows:
Figure FDA0004139021340000024
constrained to
Figure FDA0004139021340000025
wherein ,at For the action taken by the system at time t, gamma t Is time t R t+1 Attenuation factor of (2);
wherein: calculating to obtain an optimal cost function, namely, according to the optimal cost function, the system state S at any time t t And selecting the optimal behavior to maximize the accumulated rewards, wherein the calculation formula of the optimal cost function is as follows:
Figure FDA0004139021340000026
at any time t, the optimal behavior selection formula is:
Figure FDA0004139021340000027
wherein Q* (S t ,a t ) Represents an optimal cost function S t+1 Represents the system state at time t+1, a t+1 Representing any of all the actions that the system may take at time t+1, i.e. the action spaceA certain behavior in A;
step 3, adopting a deep reinforcement learning BDQ algorithm to solve a model, and finally performing slicing according to the number of the sliced blockchains by continuously exploring and learning complex relations between throughput of a blockchain system, the size of a block, the block outlet time and the number of the sliced blockchains, wherein nodes in the slicing process transactions in parallel according to the size of the block and the block outlet time, so that the number of the transactions processed by the blockchains is maximized;
the running logic for performance optimization based on the BDQ algorithm is as follows:
1) Initializing an experience playback pool D with the size of N, and storing a system state S of a blockchain slicing system at the moment t by the experience playback pool D t Behavior a t Awards R t+1 And system state S of the blockchain system at the next time t+1
2) Initializing two networks with the same structure of an online network and a target network; the weights of the two networks are respectively theta and theta -
3) Setting an initial time t=0, and recording the time of sample record in the experience playback pool D as tau;
4) Initializing the exploration probability delta of behaviors, wherein the exploration rate reduces delta along with t ε Minimum search probability ε min
5) Starting a circulating body;
6) Acquiring the current t-moment block chain system state S t ={R t ,C t ,H t ,P t };
7) Selecting behavior using an ε -greedy policy:
Figure FDA0004139021340000031
8) Execution behavior a t The block chain system performs slicing, processes transactions in parallel in the slicing, packages the transactions into blocks, performs consensus on the blocks, and sends the blocks passing the consensus to the catalogue committee for final packaging and consensus; and obtains the system environment S of the block chain system at the next time t+1 Calculating R t+1
9) Will (S) t ,a t ,R t+1 ,S t+1 ) Storing into a cache array;
10 Randomly extracting Y-bar sample records from the cache array (S) τ ,a τ ,R τ+1 ,S τ+1 );
11 Using Y records, the Q sample value is calculated as follows:
Q d (S τ ,a d )=V(S τ )+A d (S τ ,a d )
12 Updating the neural network using the following loss function:
Figure FDA0004139021340000041
wherein ,
Figure FDA0004139021340000042
13 Finding the probability ε to ε - Δ ε and εmin Is the minimum value of (a);
14 If t mod c=0, then the target network replicates the online network parameter,
15 Time t) is increased by 1;
16 Ending the circulation body;
the BDQ algorithm accumulates (S t ,a t ,R t+1 ,S t+1 ) Training the neural network with the sample records such that the neural network approximates a cost function, thereby selecting optimal behavior such that cumulative rewards of the model are maximized, wherein S t+1 The system state at time t+1;
the neural network of the BDQ algorithm has network branches in one-to-one correspondence with sub-behaviors of the behavior space A and is provided with a shared decision module, namely a hidden layer of the neural network; system state S at t time t ={R t ,C t ,H t ,P t Input into neural network, and the state is processed by shared decision moduleThe abstract, output is divided into two branches, namely a state branch and a behavior branch, the behavior branch outputs the dominance function of each sub-behavior, namely the dominance function A of the block size B 1 (S t ,a 1 ) Dominance function A of out-block time TI 2 (S t ,a 2 ) Dominance function A for the number of blockchain slices K 3 (S t ,a 3 ) The state branch outputs a state value function V (S t ) Combining the dominance function and the state value function of the sub-behaviors to obtain a value function of the sub-behaviors, and selecting corresponding behaviors according to the Q value of each outputted sub-behavior when the block chain slicing system makes a decision;
the updating process of the neural network is to randomly extract experiences of miniband size from an experience pool, update the neural network parameters in a gradient descent mode, and update formulas of the BDQ algorithm on loss functions are as follows:
Figure FDA0004139021340000043
wherein yd The definition is as follows:
Figure FDA0004139021340000044
Figure FDA0004139021340000051
represented at Q d According to state S in network t+1 Selecting sub-behavior a corresponding to maximum Q value d Then according to the state-behavior pair to +.>
Figure FDA0004139021340000052
The network selects the corresponding Q value, and the cost function in the BDQ algorithm is represented by a state value function V (S t ) Dominance function A of sum behavior d (S t ,a d ) Composition, cost function is
Q d (S t ,a d )=V(S t )+A d (S t ,a d )
Two neural networks with the same structure exist in the BDQ algorithm, wherein the online network is updated in real time, the target network is updated once every C steps, and the online network parameter value is assigned to the target network.
CN202210505118.4A 2022-05-10 2022-05-10 Block chain slicing system performance optimization method combining deep reinforcement learning Active CN115102867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210505118.4A CN115102867B (en) 2022-05-10 2022-05-10 Block chain slicing system performance optimization method combining deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210505118.4A CN115102867B (en) 2022-05-10 2022-05-10 Block chain slicing system performance optimization method combining deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115102867A CN115102867A (en) 2022-09-23
CN115102867B true CN115102867B (en) 2023-04-25

Family

ID=83287942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210505118.4A Active CN115102867B (en) 2022-05-10 2022-05-10 Block chain slicing system performance optimization method combining deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115102867B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116702583B (en) * 2023-04-20 2024-03-19 北京科技大学 Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning
CN116506444B (en) * 2023-06-28 2023-10-17 北京科技大学 Block chain stable slicing method based on deep reinforcement learning and reputation mechanism

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112261674A (en) * 2020-09-30 2021-01-22 北京邮电大学 Performance optimization method of Internet of things scene based on mobile edge calculation and block chain collaborative enabling
CN113645702A (en) * 2021-07-30 2021-11-12 同济大学 Internet of things system supporting block chain and optimized by strategy gradient technology

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1860743A (en) * 2003-11-25 2006-11-08 飞思卡尔半导体公司 Network message processing using pattern matching
CN110389591A (en) * 2019-08-29 2019-10-29 哈尔滨工程大学 A kind of paths planning method based on DBQ algorithm
CN111132175B (en) * 2019-12-18 2022-04-05 西安电子科技大学 Cooperative computing unloading and resource allocation method and application
EP3985579A1 (en) * 2020-10-14 2022-04-20 Bayerische Motoren Werke Aktiengesellschaft Regional batching technique with reinforcement learning based decision controller for shared autonomous mobility fleet
CN113361706A (en) * 2021-05-18 2021-09-07 深圳大数点科技有限公司 Data processing method and system combining artificial intelligence application and block chain
CN113297310B (en) * 2021-06-15 2023-03-21 广东工业大学 Method for selecting block chain fragmentation verifier in Internet of things
CN113570039B (en) * 2021-07-22 2024-02-06 同济大学 Block chain system based on reinforcement learning optimization consensus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112261674A (en) * 2020-09-30 2021-01-22 北京邮电大学 Performance optimization method of Internet of things scene based on mobile edge calculation and block chain collaborative enabling
CN113645702A (en) * 2021-07-30 2021-11-12 同济大学 Internet of things system supporting block chain and optimized by strategy gradient technology

Also Published As

Publication number Publication date
CN115102867A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN115102867B (en) Block chain slicing system performance optimization method combining deep reinforcement learning
WO2021155713A1 (en) Weight grafting model fusion-based facial recognition method, and related device
CN112491818B (en) Power grid transmission line defense method based on multi-agent deep reinforcement learning
CN109388565B (en) Software system performance optimization method based on generating type countermeasure network
Mahfoud Finite Markov chain models of an alternative selection strategy for the genetic algorithm
CN112884149B (en) Random sensitivity ST-SM-based deep neural network pruning method and system
CN113691594B (en) Method for solving data imbalance problem in federal learning based on second derivative
CN113254719B (en) Online social network information propagation method based on status theory
CN109145107B (en) Theme extraction method, device, medium and equipment based on convolutional neural network
CN115374853A (en) Asynchronous federal learning method and system based on T-Step polymerization algorithm
CN106372101A (en) Video recommendation method and apparatus
CN116362329A (en) Cluster federation learning method and device integrating parameter optimization
CN115437795A (en) Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
CN111626404A (en) Deep network model compression training method based on generation of antagonistic neural network
CN116975778A (en) Social network information propagation influence prediction method based on information cascading
CN116055209A (en) Network attack detection method based on deep reinforcement learning
CN115829029A (en) Channel attention-based self-distillation implementation method
CN108388942A (en) Information intelligent processing method based on big data
CN108417204A (en) Information security processing method based on big data
CN113342474B (en) Method, equipment and storage medium for predicting customer flow and training model
CN114611721A (en) Federal learning method, device, equipment and medium based on partitioned block chain
CN107256425B (en) Random weight network generalization capability improvement method and device
CN111210009A (en) Information entropy-based multi-model adaptive deep neural network filter grafting method, device and system and storage medium
CN113763167B (en) Blacklist mining method based on complex network
CN116702583B (en) Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant