CN116702583B - Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning - Google Patents

Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning Download PDF

Info

Publication number
CN116702583B
CN116702583B CN202310428183.6A CN202310428183A CN116702583B CN 116702583 B CN116702583 B CN 116702583B CN 202310428183 A CN202310428183 A CN 202310428183A CN 116702583 B CN116702583 B CN 116702583B
Authority
CN
China
Prior art keywords
block
nodes
internet
block chain
things
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310428183.6A
Other languages
Chinese (zh)
Other versions
CN116702583A (en
Inventor
罗熊
马铃
李耀宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN202310428183.6A priority Critical patent/CN116702583B/en
Publication of CN116702583A publication Critical patent/CN116702583A/en
Application granted granted Critical
Publication of CN116702583B publication Critical patent/CN116702583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • G06F18/295Markov models or related models, e.g. semi-Markov models; Markov random fields; Networks embedding Markov models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a device for optimizing the performance of a block chain under the Internet of things based on deep reinforcement learning, and relates to the technical field of the Internet of things. Comprising the following steps: initializing a block chain simulation system in an Internet of things scene; constructing a performance optimization model of the block chain simulation system; wherein the performance optimization model is built as a Markov decision process model; and solving the performance optimization model by adopting a deep reinforcement learning algorithm to obtain the optimal expandability configuration of the blockchain simulation system in the scene of the Internet of things. According to the average transaction size, the computing resources of the nodes and the transmission rate of the nodes, the invention adopts a dual-depth Q network algorithm to dynamically adjust the number of fragments, the size of the blocks and the interval of the blocks, and obtains the optimal expandability configuration of the block chain system on the premise of not sacrificing other necessary performance indexes. The block chain technology is introduced in the field of the Internet of things, the performance optimization of a block chain system is realized based on a deep reinforcement learning algorithm, and the requirements of high safety and high efficiency of the Internet of things are met.

Description

Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning
Technical Field
The invention relates to the technical field of the Internet of things, in particular to a method and a device for optimizing the performance of a blockchain under the Internet of things based on deep reinforcement learning.
Background
The internet of things is a network which extends and expands on the basis of the traditional internet, and is a huge network formed by combining various information sensing devices with the internet. The Internet of things is widely applied to the fields of logistics transportation and industrial production in the traditional industry, intelligent home furnishing and intelligent medical treatment in the emerging industry and the like. The traditional internet of things is mostly composed of distributed equipment and centralized data processing nodes. However, in the context of rapid development of mobile communication technologies represented by 5G, the proliferation of the amount of computation tasks in the internet of things scenario causes data to face security risks, and generates higher network delay and operation costs.
The rise of the block chain technology provides an effective scheme for solving the problems of the traditional Internet of things. The first application of blockchains is bitcoin, which ensures the security and efficiency of data by enabling anonymous and trusted transactions to eliminate intervening. However, its impact is far beyond the field of cryptocurrency. Blockchains are essentially a distributed storage system consisting of a plurality of time stamped blocks in the form of P2P (Peer to Peer) accounts. Therefore, it has the characteristics of decentralization, anonymity, security, non-tamper property and the like. In the scene of the Internet of things, the data is safely and reliably stored through the block chain technology, so that the data safety and the processing efficiency of the whole system are ensured, and more efficient data storage, data exchange and data management can be realized. Efficient consensus algorithms are key to application of blockchain technology to the internet of things. The PoW (Proof of Work) consensus algorithm is the earliest and safest public chain consensus algorithm, has the characteristics of decentralization and high safety, and can meet the high-safety requirement of the Internet of things. However, the PoW consensus algorithm has a significant drawback, since each node in the network needs to calculate the hash value of the block header, which results in a lot of resource waste. To solve this problem, poDL (Proof of Deep Learning, deep learning proof) consensus algorithm is proposed, which replaces the conventional hash collision with a deep learning computing task, thereby effectively avoiding meaningless resource waste. However, how to provide the scalability necessary to meet the high transaction throughput requirements of the internet of things remains a challenge.
Currently, methods for improving scalability of blockchain systems can be divided into two modes, on-chain and off-chain. The first is a sharding technique that divides blockchain nodes into different shards, each of which can process transactions in parallel. Another on-chain approach is parameter adjustment, which improves system performance by adjusting parameters such as block size and block spacing. The method under the chain mainly adopts a multi-chain technology, and reduces the load and redundancy of the main chain by migrating trivial matters of the main chain to other sub-chains. However, the under-chain approach is supported by an incompletely distributed local off-line system, in which malicious nodes easily hook up with each other, linking the wrong blockchain to the system, thereby reducing the security level and performance of the blockchain.
There is a well-known triple paradox in blockchain systems, namely that the blockchain system can only have two at the same time in the three properties of decentralization, security and scalability. The bit coin system based on the PoW consensus algorithm preferentially pursues decentralization and safety, thereby sacrificing the characteristic of expandability. However, most blockchain platforms in the internet of things scenario only enhance the scalability of the system by sacrificing performance metrics such as security, latency, etc.
The blockchain application in the scene of the Internet of things has the characteristics of dynamic and high dimension, and the DRL (Deep Reinforcement Learning ) algorithm has natural advantages in solving the complex optimization decision problem of the application of the Internet of things. Deep reinforcement learning combines the perceptibility of deep learning with the decision-making capability of reinforcement learning, and directly controls the behavior of an agent by learning high-dimensional perceptual input. Currently, most of the common optimization strategies employ DQN (Deep Q Network) algorithms. However, the DQN algorithm has a problem of overestimation, i.e., the estimated value is larger than the true value. Therefore, on the premise of not sacrificing other performance indexes, the method for finding the optimal extensibility configuration of the blockchain system by using a proper DRL algorithm has important significance.
Disclosure of Invention
The invention provides the invention aiming at the problem of finding out the optimal expandability configuration of the block chain system by using a proper DRL algorithm on the premise of not sacrificing other performance indexes.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the invention provides a method for optimizing the performance of a blockchain under the internet of things based on deep reinforcement learning, which is realized by electronic equipment, and comprises the following steps:
s1, initializing a block chain simulation system in an Internet of things scene.
S2, constructing a performance optimization model of the block chain simulation system according to the block chain simulation system; wherein the performance optimization model is built as a Markov decision process model.
And S3, solving the performance optimization model by adopting a deep reinforcement learning algorithm to obtain the optimal expandability configuration of the blockchain simulation system in the scene of the Internet of things.
Optionally, the initializing the blockchain simulation system in the internet of things scene in S1 includes:
setting the total number N of nodes, the number F of malicious nodes and the average transaction size X of a blockchain simulation system in the scene of the Internet of things.
The N nodes all have computing resources, and a data path exists among all the N nodes.
The N nodes are divided into K slices, each of the K slices containing one full node to generate a block.
Optionally, the markov decision process in S2 is a five-tuple (S, a, P, R, γ).
Wherein S is a set of states, and the state at decision time t is S t =[X,C,D] t X represents the average transaction size, c= { C i The computing resource of node i, d= { D }, is represented i,j And represents the data transmission rate between node i and node j.
A is a set of actions, and the action at decision time t is a t =[K,S B ,T B ] t K represents the number of fragments, S B Representing block size, T B Representing the block interval.
P is a state transition matrix, R is a reward function, gamma is an attenuation coefficient, and gamma is [0,1].
Optionally, the objective function of the performance optimization model of the blockchain simulation system in S2 is as shown in the following formula (1):
wherein E is a desired function, gamma t For the attenuation coefficient at decision time t, r t To be in state s t Lower selection action a t Generated rewards s t For deciding the state at time t, a t Is the action at decision time t.
Optionally, solving the performance optimization model in S3 by using a deep reinforcement learning algorithm includes:
s31, initializing an experience playback pool B, a current network and a target network.
S32, initializing parameters of a deep reinforcement learning algorithm; wherein the parameters include the exploration probability epsilon and the maximum round number T.
S33, starting a circulation body and initializing a state S t
S34, state S t As input to the current network, and employ an e-greedy policy to select action a t
S35, in state S t Lower execution action a t Obtaining a new state s t+1 Sum prize r t
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to experience playback pool B.
S37, randomly extracting experience information (S) of a certain batch from the experience playback pool B i ,a i ,r i ,s i+1 ) Learning and calculating target value y i And updates the parameter ω of the current network using a gradient back-propagation approach.
S38, setting a fixed time interval C, and copying parameters of the current network after each time of C iterations is completed ω To the target network to update the parameter omega of the target network -
S39, repeatedly executing the steps S33 to S38 until the maximum round number T is reached, and ending the cycle body.
Alternatively, the network structures of the current network and the target network in S31 are the same.
The parameters of the current network and the target network are omega and omega respectively -
Optionally, at state S in S35 t Lower execution action a t Obtaining a new state s t+1 Sum prize r t Further comprising:
and adopting an S-PoDL (Separate Proof of Deep Learning, fragment deep learning proving) consensus algorithm to complete consensus verification.
Optionally, the quadruple (S in S36 t ,a t ,r t ,s t+1 ) Store to experience playback pool B, further comprising:
when the experience information stored in the experience playback pool B reaches the maximum storage amount and new experience information arrives, the experience information which is first entered into the experience playback pool is popped up and deleted to record the new experience information.
Alternatively, the target value y is calculated in S37 i The following formula (2) shows:
y i =r i +γQ(s i+1 ,argmgx a Q(s i+1 ,a i ;ω);ω - ) (2)
wherein r is i To be in state s i Lower selection action a i The generated reward, gamma is the attenuation coefficient, s i+1 For deciding the state at time i+1, a i Is the action at decision time i.
On the other hand, the invention provides a device for optimizing the performance of the block chain under the Internet of things based on deep reinforcement learning, which is applied to realizing a method for optimizing the performance of the block chain under the Internet of things based on the deep reinforcement learning, and comprises the following steps:
and the initialization module is used for initializing the blockchain simulation system in the scene of the Internet of things.
The building module is used for building a performance optimization model of the block chain simulation system according to the block chain simulation system; wherein the performance optimization model is built as a Markov decision process model.
And the output module is used for solving the performance optimization model by adopting a deep reinforcement learning algorithm to obtain the optimal expandability configuration of the blockchain simulation system in the scene of the Internet of things.
Optionally, the initialization module is further configured to:
setting total number N of nodes, number F of malicious nodes and average transaction size X of block chain simulation system in Internet of things scene
The N nodes all have computing resources, and a data path exists among all the N nodes.
The N nodes are divided into K slices, each of the K slices containing one full node to generate a block.
Alternatively, the markov decision process is a five-tuple (S, a, P, R, γ).
Wherein S is a set of states, and the state at decision time t is S t =[X,C,D] t X represents the average transaction size, c= { C i The computing resource of node i, d= { D }, is represented i,j And represents the data transmission rate between node i and node j.
A is a set of actions, and the action at decision time t is a t =[K,S B ,T B ] t K represents the number of fragments, S B Representing block size, T B Representing the block interval.
P is a state transition matrix, R is a reward function, gamma is an attenuation coefficient, and gamma is [0,1].
Optionally, an objective function of a performance optimization model of the blockchain simulation system is shown in the following formula (1):
wherein E is a desired function, gamma t For the attenuation coefficient at decision time t, r t To be in state s t Lower selection action a t Generated rewards s t For deciding the state at time t, a t Is the action at decision time t.
Optionally, the output module is further configured to:
s31, initializing an experience playback pool B, a current network and a target network.
S32, initializing parameters of a deep reinforcement learning algorithm; wherein the parameters include the exploration probability epsilon and the maximum round number T.
S33, starting a circulation body and initializing a state S t
S34, state S t As input to the current network and using epsilon-greedy policy to select action a t
S35, in state S t Lower execution action a t Obtaining a new state s t+1 Sum prize r t
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to experience playback pool B.
S37, randomly extracting experience information (S) of a certain batch from the experience playback pool B i ,a i ,r i ,s i+1 ) Learning and calculating target value y i And updates the parameter ω of the current network using a gradient back-propagation approach.
S38, setting a fixed time interval C, and copying the parameter omega of the current network to the target network after each time of C iterations is completed so as to update the parameter omega of the target network -
S39, repeatedly executing the steps S33 to S38 until the maximum round number T is reached, and ending the cycle body.
Optionally, the network structures of the current network and the target network are the same.
The parameters of the current network and the target network are omega and omega respectively -
Optionally, the output module is further configured to:
and adopting an S-PoDL consensus algorithm to complete the consensus verification.
Optionally, the output module is further configured to:
when the experience information stored in the experience playback pool B reaches the maximum storage amount and new experience information arrives, the experience information which is first entered into the experience playback pool is popped up and deleted to record the new experience information.
Optionally, a target value y is calculated i The following formula (2) shows:
y i =r i +γQ(s i+1 ,argmax a Q(s i+1 ,a i ;ω);ω - ) (2)
wherein r is i To be in state s i Lower selection action a i The generated reward, gamma is the attenuation coefficient, s i+1 For deciding the state at time i+1, a i Is the action at decision time i.
In one aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above-mentioned deep reinforcement learning-based method for optimizing the performance of a blockchain under the internet of things.
In one aspect, a computer readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above-mentioned method for optimizing the performance of a blockchain under the internet of things based on deep reinforcement learning.
Compared with the prior art, the technical scheme has at least the following beneficial effects:
according to the scheme, the block chain system performance optimization method based on deep reinforcement learning in the Internet of things scene is provided. Specifically, the invention quantifies the performance of the blockchain system in the scene of the Internet of things from three aspects of expandability, safety and time delay, and obtains a more comprehensive optimization scheme. Then, the performance of the block chain system is improved by adopting a slicing mechanism and a parameter adjustment technology, and the high expandability requirement of the Internet of things system is met. In order to obtain optimal expandability configuration without sacrificing other necessary performance indexes, the invention adopts a DDQN (Double Deep Q Network, double-depth Q network) algorithm to dynamically optimize the performance of the system, and the algorithm uses different networks to calculate target values, decouples the selection and evaluation of actions, and solves the inherent overestimation problem of the DQN.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a block chain performance optimization method under the Internet of things based on deep reinforcement learning provided by the embodiment of the invention;
fig. 2 is a schematic diagram of a block chain system in an internet of things scenario according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an architecture of an S-PoDL consensus algorithm according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a deep reinforcement learning algorithm according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a performance optimization method according to an embodiment of the present invention;
FIG. 6 is a block diagram of a block chain performance optimization device based on deep reinforcement learning under the Internet of things, provided by an embodiment of the invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
As shown in fig. 1, the embodiment of the invention provides a method for optimizing the performance of a blockchain under the internet of things based on deep reinforcement learning. The block chain performance optimization method flow chart under the internet of things based on the deep reinforcement learning as shown in fig. 1, the processing flow of the method can comprise the following steps:
fig. 2 shows that the blockchain system in the internet of things scenario of the present invention includes an internet of things network and a blockchain system. In the internet of things, intelligent devices refer to sensors, monitoring devices, personal terminals and the like, and are responsible for data collection and connection with other devices for data sharing. The hierarchical structure of the internet of things can be divided into three layers from bottom to top: a perception layer, a network layer and an application layer. The sensing layer is mainly responsible for data collection, and the network layer uses a wireless network or a wired network to store and share data information from the sensing layer. As a structural model, the application layer processes the obtained data through the cloud computing platform, which provides the user with a data-based application. Therefore, there are multiple transactions in the internet of things network, such as data storage, processing, sharing, and so on.
With the proliferation of data of the internet of things, blockchain systems that safely and reliably process transactions are widely used. In the block chain system, a large number of transactions are processed in parallel by adopting a slicing technology, so that the processing efficiency of the data of the Internet of things is improved. When a transaction generated by the internet of things is securely transmitted to the blockchain system and stored in the distributed ledger, the blockchain will complete the transaction request by the following steps. First, all nodes in a blockchain system are partitioned into different slices, each containing one full node to generate a block. Second, after each slice completes the consensus verification using the S-PoDL consensus algorithm, the new block is linked to the blockchain network.
As shown in fig. 3, the S-PoDL consensus algorithm of the present invention is split into two phases.
Stage 1: first, the model requestor publishes multiple deep learning models and training sets to nodes in different shards. The purpose of the model requester is to obtain an optimal training model, so the present invention assumes that the model requester is honest. Next, all nodes start training the model without obtaining the test set, which can effectively prevent overfitting. After the node completes training, the block head is generated according to the rule of the bottom layer block chain system. Finally, the node submits the block header to the corresponding full node.
Stage 2: first, the model requestor publishes the test set to nodes in different shards, each node calculates the accuracy of the training model and generates a block. Next, the node submits a block and training model containing precision to the full node. The full node then verifies the validity of the block by comparing the hash values submitted in the two phases. In this process, the full node will ignore the model that did not receive the block header in phase 1. Then, the eligible blocks are sorted in descending order of precision. And finally, sequentially verifying the precision submitted by the blocks by the full nodes, and receiving the first verified block.
Referring to fig. 4 and 5, the method for optimizing the performance of the blockchain under the internet of things based on the deep reinforcement learning provided by the invention specifically comprises the following steps:
s1, initializing a block chain simulation system in an Internet of things scene.
Optionally, the step S1 may include:
setting the total number N of nodes, the number F of malicious nodes and the average transaction size X of a blockchain simulation system in the scene of the Internet of things.
The N nodes all have computing resources, and a data path exists among all the N nodes.
The N nodes are divided into K slices, each of the K slices containing one full node to generate a block.
S2, constructing a performance optimization model of the block chain simulation system according to the block chain simulation system.
Wherein the performance optimization model is built as a Markov decision process model.
Optionally, the markov decision process in S2 is a five-tuple (S, a, P, R, γ).
Wherein S is a set of states, and the state at decision time t is S t =[X,C,D] t X represents the average transaction size, c= { C i The computing resource of node i, d= { D }, is represented i,j And represents the data transmission rate between node i and node j.
A is a set of actions, and the action at decision time t is a t =[K,S B ,T B ] t K represents the number of fragments, S B Representing the block size (number of bytes contained in each block), T B Representing the block interval (the average time required to generate a new block).
P is a state transition matrix, satisfying: p (t) =pr [ D ] i,j (t+1)=D d |D i,j (t)=D c ],D c ,D d ∈D。
R is a reward function, when the decision time is t, R t Represented in state s t Lower selection motionAct as a t The generated rewards. At decision time t, the rewards for fragments that meet the constraint are defined as the following equation (1):
gamma is the attenuation coefficient and gamma is 0, 1.
Further, the action decisions in the Markov decision process have Markov properties, i.e. the next state s t+1 Dependent only on the current state s t Irrespective of the history state, it satisfies the formula (2):
P(s t+1 |s t )=P(s t+1 |s t ,s t-1 ,...,s 1 ,s 0 ) (2)
optionally, the optimization objective of the model is to maximize the scalability of the blockchain system without sacrificing security and latency, and the objective function is set to the following formula (3):
wherein Q (s, a) is an action cost function, and can be expressed as the following formula (4):
wherein E is a desired function, gamma t For the attenuation coefficient at decision time t, r t To be in state s t Lower selection action a t Generated rewards s t For deciding the state at time t, a t Is the action at decision time t.
And S3, solving the performance optimization model by adopting a deep reinforcement learning algorithm to obtain the optimal expandability configuration of the blockchain simulation system in the scene of the Internet of things.
In one possible implementation, a deep reinforcement learning algorithm is used to solve the model according to the optimization objective. Currently, most of the common optimization strategies employ DQN algorithms. However, the DQN algorithm has a problem of overestimation, i.e., the estimated value is larger than the true value. Therefore, the invention adopts the DDQN algorithm to dynamically optimize the performance of the system, the algorithm uses different networks to calculate the target value, the selection and the evaluation of the action are decoupled, and the inherent overestimation problem of the DQN is solved. Solving the model using the deep reinforcement learning algorithm may include the following steps S31-S39:
s31, initializing an experience playback pool B, a current network and a target network.
Alternatively, the network structures of the current network and the target network in S31 are the same.
The parameters of the current network and the target network are omega and omega respectively -
S32, initializing parameters of a deep reinforcement learning algorithm; wherein the parameters include the exploration probability epsilon and the maximum round number T.
S33, starting a circulation body and initializing a state S t
S34, state S t As input to the current network and using epsilon-greedy policy to select action a t
In a possible implementation manner, the blockchain environment in the scene of the internet of things provides the state s of the decision time t for the current network t ,s t E S, i.e. the average transaction size, the computational resources of the nodes and the data transfer rate between the nodes. The current network outputs all possible actions and corresponding values and adopts epsilon-greedy strategy to select action a t The formula (5) is as follows:
s35, in state S t Lower execution action a t Obtaining a new state s t+1 Sum prize r t
In a possible embodiment, action a is performed t I.e. the number of slices, the block size and the block interval are selected. Action a t After execution, the S-PoDL consensus algorithm is adopted to complete the sharingAnd the knowledge verification avoids the waste of resources by using a multi-stage collaborative optimization strategy.
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to experience playback pool B.
In a possible implementation, the experience information stored in the experience playback pool B is: state s at decision time t t Action a t Rewards r t New state s t+1 . When the stored experience information reaches the maximum storage amount of the experience playback pool and new experience information arrives, the experience information that first enters the experience playback pool is ejected and deleted to record the new experience information.
S37, randomly extracting experience information (S) of a certain batch from the experience playback pool B i ,a i ,r i ,s i+1 ) Learning and calculating target value y i And updates the parameter ω of the current network using a gradient back-propagation approach.
Alternatively, the target value y is calculated in S37 i The following formula (6) shows:
y i =r i +γQ(s i+1 ,argmax a Q(s i+1 ,a i ;ω);ω - ) (6)
wherein r is i To be in state s i Lower selection action a i The generated reward, gamma is the attenuation coefficient, s i+1 For deciding the state at time i+1, a i Is the action at decision time i.
In a possible implementation, the parameter ω of the current network is updated in a gradient back-propagation manner, and the loss function is defined as the following formula (7):
L=||y i -Q(s i ,a i ;ω)|| 2 (7)
s38, setting a fixed time interval C, and copying parameters of the current network after each time of C iterations is completed ω To the target network to update the parameter omega of the target network -
S39, repeatedly executing the steps S33 to S38 until the maximum round number T is reached, and ending the cycle body.
The invention provides a block chain system performance optimization method based on deep reinforcement learning in an Internet of things scene, which adopts a DDQN algorithm to search the optimal expandability configuration of the block chain system in the Internet of things scene on the premise of considering the necessary performance indexes such as the safety and the time delay of the system, thereby realizing the requirements of high safety and high efficiency of the application of the Internet of things.
In the embodiment of the invention, a block chain system performance optimization method based on deep reinforcement learning in an Internet of things scene is provided. Specifically, the invention quantifies the performance of the blockchain system in the scene of the Internet of things from three aspects of expandability, safety and time delay, and obtains a more comprehensive optimization scheme. Then, the performance of the block chain system is improved by adopting a slicing mechanism and a parameter adjustment technology, and the high expandability requirement of the Internet of things system is met. In order to obtain optimal expandability configuration without sacrificing other necessary performance indexes, the invention adopts a DDQN algorithm to dynamically optimize the performance of the system, and the algorithm uses different networks to calculate target values, so that the selection and evaluation of actions are decoupled, and the inherent overestimation problem of DQN is solved.
As shown in fig. 6, an embodiment of the present invention provides a device 600 for optimizing performance of a blockchain under the internet of things based on deep reinforcement learning, where the device 600 is applied to implement a method for optimizing performance of a blockchain under the internet of things based on deep reinforcement learning, and the device 600 includes:
an initialization module 610 is configured to initialize a blockchain simulation system in an internet of things scenario.
A construction module 620, configured to construct a performance optimization model of the blockchain simulation system according to the blockchain simulation system; wherein the performance optimization model is built as a Markov decision process model.
And the output module 630 is configured to solve the performance optimization model by using a deep reinforcement learning algorithm, so as to obtain an optimal extensibility configuration of the blockchain simulation system in the scene of the internet of things.
Optionally, the initialization module 610 is further configured to:
setting the total number N of nodes, the number F of malicious nodes and the average transaction size X of a blockchain simulation system in the scene of the Internet of things.
The N nodes all have computing resources, and a data path exists among all the N nodes.
The N nodes are divided into K slices, each of the K slices containing one full node to generate a block.
Alternatively, the markov decision process is a five-tuple (S, a, P, R, γ).
Wherein S is a set of states, and the state at decision time t is S t =[X,C,D] t X represents the average transaction size, c= { C i The computing resource of node i, d= { D }, is represented i,j And represents the data transmission rate between node i and node j.
A is a set of actions, and the action at decision time t is a t =[K,S B ,T B ] t K represents the number of fragments, S B Representing block size, T B Representing the block interval.
P is a state transition matrix, R is a reward function, gamma is an attenuation coefficient, and gamma is [0,1].
Optionally, an objective function of a performance optimization model of the blockchain simulation system is shown in the following formula (1):
wherein E is a desired function, gamma t For the attenuation coefficient at decision time t, r t To be in state s t Lower selection action a t Generated rewards s t For deciding the state at time t, a t Is the action at decision time t.
Optionally, the output module 630 is further configured to:
s31, initializing an experience playback pool B, a current network and a target network.
S32, initializing parameters of a deep reinforcement learning algorithm; wherein the parameters include the exploration probability epsilon and the maximum round number T.
S33, starting a circulation body and initializing a state S t
S34, state S t As input to the current network and using epsilon-greedy policy to select action a t
S35, in state S t Lower execution action a t Obtaining a new state s t+1 Sum prize r t
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to experience playback pool B.
S37, randomly extracting experience information (S) of a certain batch from the experience playback pool B i ,a i ,r i ,s i+1 ) Learning and calculating target value y i And updates the parameter ω of the current network using a gradient back-propagation approach.
S38, setting a fixed time interval C, and copying the parameter omega of the current network to the target network after each time of C iterations is completed so as to update the parameter omega of the target network -
S39, repeatedly executing the steps S33 to S38 until the maximum round number T is reached, and ending the cycle body.
Optionally, the network structures of the current network and the target network are the same.
The parameters of the current network and the target network are omega and omega respectively -
Optionally, the output module 630 is further configured to:
and adopting the slice deep learning to prove that the S-PoDL consensus algorithm completes the consensus verification.
Optionally, the output module 630 is further configured to:
when the experience information stored in the experience playback pool B reaches the maximum storage amount and new experience information arrives, the experience information which is first entered into the experience playback pool is popped up and deleted to record the new experience information.
Optionally, a target value y is calculated i The following formula (2) shows:
y i =r i +γQ(s i+1 ,argmax a Q(s i+1 ,a i ;ω);ω - ) (2)
wherein r is i To be in state s i Lower selection action a i The generated reward, gamma is the attenuation coefficient, s i+1 For deciding the state at time i+1, a i Is the action at decision time i.
In the embodiment of the invention, a block chain system performance optimization method based on deep reinforcement learning in an Internet of things scene is provided. Specifically, the invention quantifies the performance of the blockchain system in the scene of the Internet of things from three aspects of expandability, safety and time delay, and obtains a more comprehensive optimization scheme. Then, the performance of the block chain system is improved by adopting a slicing mechanism and a parameter adjustment technology, and the high expandability requirement of the Internet of things system is met. In order to obtain optimal expandability configuration without sacrificing other necessary performance indexes, the invention adopts a DDQN algorithm to dynamically optimize the performance of the system, and the algorithm uses different networks to calculate target values, so that the selection and evaluation of actions are decoupled, and the inherent overestimation problem of DQN is solved.
Fig. 7 is a schematic structural diagram of an electronic device 700 according to an embodiment of the present invention, where the electronic device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 701 and one or more memories 702, where at least one instruction is stored in the memories 702, and the at least one instruction is loaded and executed by the processors 701 to implement the following method for optimizing the performance of the blockchain under the internet of things based on deep reinforcement learning:
s1, initializing a block chain simulation system in an Internet of things scene.
S2, constructing a performance optimization model of the block chain simulation system according to the block chain simulation system; wherein the performance optimization model is built as a Markov decision process model.
And S3, solving the performance optimization model by adopting a deep reinforcement learning algorithm to obtain the optimal expandability configuration of the blockchain simulation system in the scene of the Internet of things.
In an exemplary embodiment, a computer readable storage medium, such as a memory including instructions executable by a processor in a terminal to perform the above-described deep reinforcement learning based method of blockchain performance optimization under the internet of things, is also provided. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (2)

1. The method for optimizing the performance of the block chain under the Internet of things based on deep reinforcement learning is characterized by comprising the following steps:
s1, constructing a block chain system based on a slice deep learning proof S-PoDL consensus algorithm in an Internet of things scene; wherein, the performance evaluation index system of the block chain system comprises: scalability, security, and latency;
s2, constructing a Markov performance optimization model according to the block chain system;
s3, solving the performance optimization model by adopting a double-depth Q network DDQN algorithm to obtain the optimal expandability configuration of the block chain system in the Internet of things scene;
the slice deep learning in S1 proves an S-PoDL consensus algorithm, which comprises the following steps:
s11, dividing N nodes in a block chain system into K fragments, wherein each fragment contains a full node to generate a block, and the number of the nodes in each fragment is N/K;
s12, distributing training tasks of the deep learning model to nodes of each patch;
s13, verifying the validity and accuracy of the block by the full node, and receiving the first verified block;
the step S12 of assigning the training task of the deep learning model to the node of each patch includes:
the model requester issues a plurality of deep learning models and training sets to nodes in different segments for obtaining an optimal deep learning model, wherein the model requester is set as honest; n nodes in the blockchain system start training the model under the condition that the test set is not obtained; after the node finishes training, generating a block head according to the rule of the bottom layer block chain system; the node submits the block header to the corresponding full node;
the full node in S13 verifies the validity and accuracy of the block and accepts the first verified block, including:
the model requester issues the test set to nodes in different fragments, and each node calculates the precision of the deep learning model; the nodes submit blocks and training models containing precision to the whole nodes; the full node verifies the validity of the block by comparing the hash values submitted in the two stages; sequencing the blocks with effectiveness according to the descending order of precision; the full node sequentially verifies the precision submitted by the blocks and receives the first verified block;
wherein the two phases comprise: distributing training tasks of the deep learning model to nodes of each patch and verifying validity and accuracy of the block by all nodes;
and in the step S2, constructing a Markov performance optimization model according to the blockchain system, wherein the method comprises the following steps of:
the Markov performance optimization model comprises: state space S (t), action space a (t), reward function R;
the state space S (t) includes an average transaction size, the computing resource c= { C of the node i Data transmission rate d= { D between nodes } i,j -the state space S (t), as shown in the following formula (1):
S(t)=[X,C,D] t (1)
the motion space A (t) comprises the number K of fragments and the block size S B Block interval T B The motion space a (t) is represented by the following formula (2):
A(t)=[K,S B ,T B ] t (2)
the reward function is represented by the following formula (3):
wherein r is t (s t ,a t ) Represented in state s t Lower selection action a t Generated rewards S B Representing the block size, i.e., the number of bytes contained in each block, determines how many transactions, T, are contained in a block B Representing the block interval, i.e., the average time required by the block producer to generate a new block, representing the block release rate, and X representing the average transaction size;
the step S3 of solving the performance optimization model by adopting a double-depth Q network DDQN algorithm comprises the following steps:
s31, initializing a parameter omega of a current Q network, initializing a parameter omega' =omega of a target Q network, and emptying an experience playback pool B;
s32, initializing a search probability epsilon, a time interval C and a maximum round number T;
s33, setting an initial time slot t=0, initializing a state S t
S34, putting the state S t As input to the current Q network, and employing epsilon-greedy policy selection action a t
S35, in the state S t Executing the action a t Obtaining a new state s t+1 Sum prize r t
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to the experience playback pool B;
s37, randomly extracting q pieces of experience information from the experience playback pool B (S i ,a i ,r i ,s i+1 ) Learning and calculating the target Q value y i And updating parameters omega, y of the current network by using a gradient back propagation mode i The following formula (4):
y i =r i +γQ(s i+1 ,argmax a Q(s i+1 ,a i ;ω);ω - ) (4)
s38, after each C iterations are completed, making the parameter omega' =omega of the target Q network;
s39, order S t =s t+1 T=t+1, go to step S34 until the maximum round number T is reached.
2. The utility model provides a thing networking lower block chain performance optimizing device based on degree of depth reinforcement study which characterized in that, the device includes:
the initialization module is used for constructing a blockchain system based on a slice deep learning proof S-PoDL consensus algorithm in the scene of the Internet of things; wherein, the performance evaluation index system of the block chain system comprises: scalability, security, and latency;
the building module is used for building a Markov performance optimization model according to the block chain system;
the output module is used for solving the performance optimization model by adopting a double-depth Q network DDQN algorithm to obtain the optimal expandability configuration of the block chain system in the scene of the Internet of things;
the slice deep learning proves an S-PoDL consensus algorithm, which comprises the following steps:
s11, dividing N nodes in a block chain system into K fragments, wherein each fragment contains a full node to generate a block, and the number of the nodes in each fragment is N/K;
s12, distributing training tasks of the deep learning model to nodes of each patch;
s13, verifying the validity and accuracy of the block by the full node, and receiving the first verified block;
the step S12 of assigning the training task of the deep learning model to the node of each patch includes:
the model requester issues a plurality of deep learning models and training sets to nodes in different segments for obtaining an optimal deep learning model, wherein the model requester is set as honest; n nodes in the blockchain system start training the model under the condition that the test set is not obtained; after the node finishes training, generating a block head according to the rule of the bottom layer block chain system; the node submits the block header to the corresponding full node;
the full node in S13 verifies the validity and accuracy of the block and accepts the first verified block, including:
the model requester issues the test set to nodes in different fragments, and each node calculates the precision of the deep learning model; the nodes submit blocks and training models containing precision to the whole nodes; the full node verifies the validity of the block by comparing the hash values submitted in the two stages; sequencing the blocks with effectiveness according to the descending order of precision; the full node sequentially verifies the precision submitted by the blocks and receives the first verified block;
wherein the two phases comprise: distributing training tasks of the deep learning model to nodes of each patch and verifying validity and accuracy of the block by all nodes;
the establishing a Markov performance optimization model according to the blockchain system comprises the following steps:
the Markov performance optimization model comprises: state space S (t), action space a (t), reward function R;
the state space S (t) includes an average transaction size, the computing resource c= { C of the node i Data transmission rate d= { D between nodes } i,j -the state space S (t), as shown in the following formula (1):
S(t)=[X,C,D] t (1)
the motion space A (t) comprises the number K of fragments and the block size S B Block interval T B The motion space a (t) is represented by the following formula (2):
A(t)=[K,S B ,T B ] t (2)
the reward function is represented by the following formula (3):
wherein r is t (s t ,a t ) Represented in state s t Lower selection action a t Generated rewards S B Representing the block size, i.e., the number of bytes contained in each block, determines how many transactions, T, are contained in a block B Representing the block interval, i.e., the average time required by the block producer to generate a new block, representing the block release rate, and X representing the average transaction size;
the method for solving the performance optimization model by adopting the double-depth Q network DDQN algorithm comprises the following steps:
s31, initializing a parameter omega of a current Q network, initializing a parameter omega' =omega of a target Q network, and emptying an experience playback pool B;
s32, initializing a search probability epsilon, a time interval C and a maximum round number T;
s33, setting an initial time slot t=0, initializing a state S t
S34, putting the state S t As input to the current Q network, and employing epsilon-greedy policy selection action a t
S35, in the state S t Executing the action a t Obtaining a new state s t+1 Sum prize r t
S36, four-element group (S) t ,a t ,r t ,s t+1 ) Store to the experience playback pool B;
s37, randomly extracting q pieces of experience information from the experience playback pool B (S i ,a i ,r i ,S i+1 ) Learning and calculating the target Q value y i And updating parameters omega, y of the current network by using a gradient back propagation mode i The following formula (4):
y i =r i +γQ(s i+1 ,argmax a Q(s i+1 ,a i ;ω);ω - ) (4)
s38, after each C iterations are completed, making the parameter omega' =omega of the target Q network;
s39, order S t =s t+1 T=t+1, go to step S34 until the maximum round number T is reached.
CN202310428183.6A 2023-04-20 2023-04-20 Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning Active CN116702583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310428183.6A CN116702583B (en) 2023-04-20 2023-04-20 Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310428183.6A CN116702583B (en) 2023-04-20 2023-04-20 Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN116702583A CN116702583A (en) 2023-09-05
CN116702583B true CN116702583B (en) 2024-03-19

Family

ID=87830071

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310428183.6A Active CN116702583B (en) 2023-04-20 2023-04-20 Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116702583B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102867A (en) * 2022-05-10 2022-09-23 内蒙古工业大学 Block chain fragmentation system performance optimization method combined with deep reinforcement learning
CN115935442A (en) * 2022-12-09 2023-04-07 湖南天河国云科技有限公司 Block chain performance optimization method based on multi-agent deep reinforcement learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102867A (en) * 2022-05-10 2022-09-23 内蒙古工业大学 Block chain fragmentation system performance optimization method combined with deep reinforcement learning
CN115935442A (en) * 2022-12-09 2023-04-07 湖南天河国云科技有限公司 Block chain performance optimization method based on multi-agent deep reinforcement learning

Also Published As

Publication number Publication date
CN116702583A (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN115935442A (en) Block chain performance optimization method based on multi-agent deep reinforcement learning
EP4350572A1 (en) Method, apparatus and system for generating neural network model, devices, medium and program product
CN113485826B (en) Load balancing method and system for edge server
CN112491818B (en) Power grid transmission line defense method based on multi-agent deep reinforcement learning
CN114626547A (en) Group collaborative learning method based on block chain
CN111800274B (en) Verifiable calculation energy consumption optimization method based on block chain
CN116541106B (en) Computing task unloading method, computing device and storage medium
CN115102867B (en) Block chain slicing system performance optimization method combining deep reinforcement learning
CN114330754A (en) Strategy model training method, device and equipment
CN114760308B (en) Edge calculation unloading method and device
CN115481441A (en) Difference privacy protection method and device for federal learning
TWI763120B (en) Computer-implemented method of an execution device, system for performing a software-implementated application and apparatus for generating an action selection policy for a software-implementated application
TWI770671B (en) Method for generating action selection policies, system and device for generating action selection policies for software-implemented application
CN116702583B (en) Method and device for optimizing performance of block chain under Internet of things based on deep reinforcement learning
CN111340623A (en) Data storage method and device
CN112312299A (en) Service unloading method, device and system
CN111461188A (en) Target service control method, device, computing equipment and storage medium
CN116486192A (en) Federal learning method and system based on deep reinforcement learning
CN114997400A (en) Neural network acceleration reasoning method
TWI757971B (en) Determining action selection policies of an execution device
CN114995157A (en) Anti-synchronization optimization control method of multi-agent system under cooperative competition relationship
CN106033434A (en) Virtual asset data replica processing method based on data size and popularity
CN116506444B (en) Block chain stable slicing method based on deep reinforcement learning and reputation mechanism
CN117812564B (en) Federal learning method, device, equipment and medium applied to Internet of vehicles
CN115190135B (en) Distributed storage system and copy selection method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant