CN112419064A - Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain - Google Patents

Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain Download PDF

Info

Publication number
CN112419064A
CN112419064A CN202011420188.7A CN202011420188A CN112419064A CN 112419064 A CN112419064 A CN 112419064A CN 202011420188 A CN202011420188 A CN 202011420188A CN 112419064 A CN112419064 A CN 112419064A
Authority
CN
China
Prior art keywords
matrix
transaction
energy
network
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011420188.7A
Other languages
Chinese (zh)
Other versions
CN112419064B (en
Inventor
吴嘉婧
张如筱
郑子彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huihui Intelligent Technology Co.,Ltd.
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN202011420188.7A priority Critical patent/CN112419064B/en
Publication of CN112419064A publication Critical patent/CN112419064A/en
Application granted granted Critical
Publication of CN112419064B publication Critical patent/CN112419064B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/04Payment circuits
    • G06Q20/06Private payment circuits, e.g. involving electronic currency used among participants of a common payment scheme
    • G06Q20/065Private payment circuits, e.g. involving electronic currency used among participants of a common payment scheme using e-cash
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/08Payment architectures
    • G06Q20/10Payment architectures specially adapted for electronic funds transfer [EFT] systems; specially adapted for home banking systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • Finance (AREA)
  • General Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Marketing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computer Security & Cryptography (AREA)
  • Biomedical Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • Water Supply & Treatment (AREA)
  • Technology Law (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an energy transaction method, a device and equipment based on deep reinforcement learning and alliance chain, the first state matrix is formed by collecting N state vectors affecting buyers and sellers in the energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.

Description

Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain
Technical Field
The invention relates to the technical field of vehicle networking, in particular to an energy transaction method, device and equipment based on deep reinforcement learning and an alliance chain.
Background
With the continuous development of automobile electromotion, new force of automobile construction in China is prominent, a traditional automobile enterprise is transferred to new energy, and the acceptance of users on electric automobiles is gradually increased. However, the increase in the number of electric vehicles presents relatively few challenges to the charging problem of large-scale electric vehicles. Firstly, the problem of charging station shortage of the electric automobile is solved; in addition, the existing electric automobiles are generally charged at night, and the problems of overlarge power loss, voltage drop and overload of a power supply grid are easily caused by excessive automobiles charged at the same time period.
In order to solve the above problems, a mechanism for point-to-point (P2P) electric quantity transaction between electric vehicles is proposed. In the P2P electricity quantity transaction, the participating electric vehicles are regarded as "production and sales persons", and the electricity quantity required by other electric vehicles can be directly purchased or surplus electricity quantity can be sold according to self conditions, and after the buyer and seller negotiate and agree with transaction prices, the electricity quantity transfer is realized through the smart grid. The P2P electric quantity transaction mechanism not only can relieve the electric network load of the electricity consumption peak, but also can reduce the cost of the electricity purchasing automobile and increase the income of the electricity selling automobile. However, the P2P power trading is simultaneously exposed to privacy disclosure of the participants. Because the blockchain is a distributed shared account book and database, the blockchain has the characteristics of decentralization, openness, independence, safety, anonymity and the like. Therefore, the P2P electricity trading based on the blockchain is now in the world, which not only can solve the problem of asymmetry of information between the buyer and the seller, create a trustable trading environment for the participants, but also allows the participants to participate in the trading anonymously, thereby protecting the privacy of the trader to the maximum extent. However, the safety of blockchains builds on excessive computational complexity, while electric vehicles generally do not have sufficient computational power, and many researchers choose efficient, scalable alliance chains to replace. The alliance chain is a special blockchain, and is different from a consensus mechanism that each node in the blockchain is a bookkeeper, only a plurality of preselected authoritative nodes in the alliance chain serve as the bookkeepers, and local transaction records are collected and managed to be completed at moderate cost; while other access nodes may participate in the transaction without having to ask for a billing process, without requiring a consensus mechanism that consumes significant computing power and additional time to complete.
Currently, only the current income of the charging automobile or a parking lot is discussed to be maximized by the alliance-chain-based P2P electric quantity transaction mechanism of the charging automobile, and the long-term income of the charging automobile is not taken into consideration. In real life, the number of cars in the parking lot is dynamically changed, and if a seller car sells the surplus electric quantity at any moment, the seller car is likely to lose the opportunity of trading with the car with higher bid price at any moment, so that higher profit cannot be obtained. Therefore, in the P2P electricity quantity transaction based on the alliance chain, how to provide the buyer and seller matching strategy with the largest long-term income for the charging automobile is a problem to be solved urgently.
Disclosure of Invention
The embodiment of the invention provides an energy transaction method, device and equipment based on deep reinforcement learning and a alliance chain, which are used for solving the technical problem of how to enable a buyer and a seller to obtain maximum long-term benefit in P2P electricity transaction based on the alliance chain.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
an energy transaction method based on deep reinforcement learning and alliance chain is applied to electric quantity transaction of an electric vehicle and comprises the following steps:
s10, collecting transaction characteristics of an energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;
s20, inputting the first state matrix into a deep reinforcement learning neural network model, and outputting an action matrix;
s30, calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
s40, acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, and obtaining a trained neural network training model;
and S50, in the energy trading place, inputting a state matrix formed by the trading features required by the buyer and the seller into the neural network training model to obtain the trading energy of the buyer and the seller.
Preferably, before storing the training matrix into the replay pool of the neural network model, the method further includes: processing abnormal values of the first state matrix, the action matrix and the second state matrix;
deleting the numerical values of the first state matrix and the second state matrix which are not in the preset value range, and supplementing 0;
and deleting the element values in the action matrix, which meet the condition that the price of the buyer is less than that of the seller, and supplementing 0.
Preferably, the outputting the action matrix specifically includes: cutting a vector output by the neural network model into N vectors, each vector comprising N elements, the N vectors constituting an N x N of the motion matrices; wherein each vector is the energy of one electric vehicle and the electric vehicle in the N-1 vehicles for trading.
Preferably, the state transfer function f (S)t,At) The expression of (a) is:
Figure BDA0002820528850000031
in the formula (I), the compound is shown in the specification,
Figure BDA0002820528850000032
is a second state matrix of the electric vehicle i,
Figure BDA0002820528850000033
respectively indicating the remaining parking time, the transaction electric quantity, the transaction price and the buying and selling label of the electric automobile i at the moment t + 1;
the expression of the transaction electric quantity required by the electric automobile i at the moment t +1 is as follows:
Figure BDA0002820528850000034
in the formula (I), the compound is shown in the specification,
Figure BDA0002820528850000035
the transaction power amount required by the electric automobile i at the moment t,
Figure BDA0002820528850000036
for the buying and selling label of the electric automobile i at the time t,
Figure BDA0002820528850000037
energy purchased by the electric automobile i to the electric automobile j at the moment t for the elements in the action matrix;
electric automobile i trades at time t +1 and marks price
Figure BDA0002820528850000038
The expression is as follows:
Figure BDA0002820528850000039
Figure BDA00028205288500000310
in the formula (I), the compound is shown in the specification,
Figure BDA00028205288500000311
respectively, the mean and variance of the normal distribution satisfied by the variable x;
wherein when
Figure BDA00028205288500000312
Then, the remaining parking time and the buying and selling label expression of the electric automobile i at the moment t +1 are as follows:
Figure BDA00028205288500000313
Figure BDA00028205288500000314
in the formula (I), the compound is shown in the specification,
Figure BDA00028205288500000315
the rest parking time of the electric automobile i at the moment t is obtained;
when in use
Figure BDA00028205288500000316
Then, the expressions of the remaining parking time, the transaction electric quantity and the buying and selling label of the electric automobile i at the moment t +1 are as follows:
Figure BDA00028205288500000317
Figure BDA00028205288500000318
Figure BDA0002820528850000041
in the formula, mu2And
Figure BDA0002820528850000042
respectively the stay time of the electric automobile in the energy trading field
Figure BDA0002820528850000043
Mean and variance, μ, of the satisfied normal distribution3And
Figure BDA0002820528850000044
is the energy that the electric automobile needs to trade
Figure BDA0002820528850000045
The mean and variance of the satisfied normal distribution.
Preferably, the reward matrix RtThe expression of (a) is:
Figure BDA0002820528850000046
Figure BDA0002820528850000047
in the formula (I), the compound is shown in the specification,
Figure BDA0002820528850000048
is the reward of the electric automobile i at the moment t, k1,k2,k3,k4Is a constant number stIs as follows
Figure BDA0002820528850000049
Figure BDA00028205288500000410
The penalty factor at the time t is given,
Figure BDA00028205288500000411
the energy purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix,
Figure BDA00028205288500000412
the energy required by the electric vehicle i at time t,
Figure BDA00028205288500000413
pricing energy for electric car i at time t,
Figure BDA00028205288500000414
the value of the buying and selling label of the electric automobile i at the moment t.
Preferably, the training of the neural network model specifically includes: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic evaluation network and a Critic target network, and the Actor network comprises an Actor evaluation network and an Actor target network;
wherein the loss function of the criticic network is as follows:
Figure BDA00028205288500000415
the loss function of the Actor network is as follows:
Figure BDA00028205288500000416
in the formula, L1Loss of the Actor network; l is2Is the loss of the Critic network; gamma is the discount coefficient, qkThe output of the Critic evaluation network is used for representing the Q value corresponding to the sample k; q's'kIs the output of the Critic target network and represents the Q value corresponding to the next moment of the sample k; k is equal to {1,2, …, m }, qkAnd q'kThe mathematical expression of (a) is:
qk=ReLU(W2Sk+W3Ak+b2)
q′k=ReLU(W′2Sk+1+W′3μ′(Sk+1)+b′2)
W2and W3Weight matrices for the output layers of the network, both Critic evaluation values, b2Estimating network output layer bias for CriticIs location vector, W'2And W'3Are weight matrix of Critic target network output layer, b'2Is the offset vector, μ' (S) of the output layer of the Critic target networkk+1) Is to mix Sk+1The output obtained by the target network of the Actor is input, and the representation state is Sk+1A corresponding optimal action matrix;
the back propagation algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
Figure BDA0002820528850000051
the soft update algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
W′←τW+(1-τ)W′
wherein W is the parameter W of Critic evaluation network and Actor evaluation network2,W3,b2
Figure BDA0002820528850000052
For two loss functions L1,L2The gradient of (a) is that W 'represents parameters of a Critic target network and an Actor target network, alpha is a training factor and has a value range of [0,1 ], and tau is a coefficient for controlling the influence of an old parameter W' of the target network and a parameter W of an evaluation network on the target network.
Preferably, in the energy trading floor, obtaining the matching transaction between the buyer and the seller specifically further includes: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into the neural network training model, and the neural network training model outputs an action matrix of the trading.
The invention also provides an energy transaction device based on deep reinforcement learning and alliance chain, which comprises a data acquisition module, a first processing module, a second processing module, a training module and an output module;
the data acquisition module is used for acquiring transaction characteristics of the energy transaction field and forming the transaction characteristics into a state vector, and at the moment t, N state vectors in the energy transaction field form a first state matrix; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;
the first processing module is used for inputting the first state matrix into a deep reinforcement learning neural network model and outputting an action matrix;
the second processing module is used for calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
the training module is used for acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum times, so as to obtain a trained neural network training model;
and the output module is used for inputting a state matrix formed by transaction characteristics required by the buyer and the seller in the energy trading field into the neural network training model to obtain the energy of the transaction of the buyer and the seller.
The present invention also provides a computer-readable storage medium for storing computer instructions that, when executed on a computer, cause the computer to perform the deep reinforcement learning and federation chain based energy trading method described above.
The invention also provides terminal equipment, which comprises a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is used for executing the energy trading method based on the deep reinforcement learning and the alliance chain according to instructions in the program code.
According to the technical scheme, the embodiment of the invention has the following advantages: the energy trading method, the device and the equipment based on the deep reinforcement learning and the alliance chain form a first state matrix by collecting N state vectors influencing buyers and sellers in an energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.
Fig. 1 is a flowchart illustrating steps of a deep reinforcement learning and federation chain-based energy transaction method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a neural network model of an energy trading method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention.
Fig. 3 is a framework diagram of an energy trading method federation chain based on deep reinforcement learning and federation chain according to an embodiment of the present invention.
Fig. 4 is a block diagram of an energy transaction apparatus based on deep reinforcement learning and federation chain according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application provides an energy transaction method, device and equipment based on deep reinforcement learning and an alliance chain, and the method, device and equipment are used for solving the technical problem of how to enable a buyer and a seller to obtain maximum long-term benefit in P2P electricity transaction based on the alliance chain. In this embodiment, a parking lot is used as an energy transaction case, and the energy of the transaction is the electric quantity of the electric vehicle transaction in the parking lot.
The first embodiment is as follows:
fig. 1 is a flowchart illustrating steps of an energy transaction method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention, and fig. 2 is a schematic structural diagram illustrating a neural network model of the energy transaction method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention.
As shown in fig. 1 and fig. 2, an embodiment of the present invention provides an energy transaction method based on deep reinforcement learning and a federation chain, including the following steps:
s10, collecting transaction characteristics of an energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining in the energy transaction field, a trading label, transaction energy and a transaction price;
s20, inputting the first state matrix into a deep reinforcement learning neural network model, and outputting an action matrix;
s30, calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the t +1 moment; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
s40, acquiring data of m training matrixes from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, and obtaining a trained neural network training model;
and S50, in an energy trading field, inputting a state matrix formed by the trading features required by the buyer and the seller into a neural network training model to obtain the trading energy of the buyer and the seller.
In step S10 of the embodiment of the present invention, characteristics of the electric vehicle in the parking lot are mainly collected, including remaining parking time δ of the electric vehicle, a marketing label z (z values corresponding to a buyer and a seller are-1 and +1, respectively), electric quantity e for performing a transaction, and a transaction price p; the characteristics are combined into a state vector of one electric vehicle, and at the time t, the state vectors of all N electric vehicles in the parking lot form a first state matrix StI.e. by
Figure BDA0002820528850000081
It should be noted that, in step S10, the factors affecting the long-term profit of the two parties in the energy trading floor are mainly obtained.
In steps S20 to S40 of the embodiment of the present invention, the features in the state matrix are processed and learned in the neural network model to obtain the action matrix, and the action matrix and the first state matrix S are further processed and learnedtAfter the processing of the state transition function and the reward function, the state matrix (namely, the second state matrix) and the reward matrix R at the moment of t +1 are obtainedtAnd the first state matrix StMotion matrix AtA second state matrix St+1And a reward matrix RtForm a training matrix (S)t,At,St+1,Rt) As training data for training a neural network model of an energy trading place, a neural network training model is obtained through training the neural network model, and the neural network training model can learn the parking lotThe dynamic change of the state (energy trading field) and the optimal action matrix adapting to the change enable the long-term income of trading electric vehicles under the dynamic change to reach the maximum value.
It should be noted that, as shown in fig. 2, the neural network model includes a criticic estimation network, a criticic target network, an Actor estimation network, and an Actor target network. The Critic evaluation network and the Critic target network have the same structure and are mainly used for calculating the Q values of states and actions, and the Actor evaluation network and the Actor target network have the same structure and are mainly responsible for selecting the action with the highest Q value according to the states and outputting an action matrix. Inputting the state data obtained by the neural network model into an Actor estimation network, and outputting the action by the Actor estimation network; the action output by the Actor valuation network interacts with the environment to obtain next state data and reward data; storing the data; and if the data volume is enough, randomly sampling the data, and updating network parameters of the Critic evaluation network, the Critic target network, the Actor evaluation network and the Actor target network. When the Critic evaluation network and the Actor evaluation network are updated, a gradient descent algorithm is used; when the Critic target network and the Actor target network are updated, a soft update algorithm is used.
In step S40 of the embodiment of the present invention, step 10 to step 30 are repeated to collect a sufficient number of training data to train the neural network model, so as to improve the accuracy of the result output by the neural network training model.
In step S50 of the embodiment of the present invention, the obtained neural network training model is mainly combined with the alliance chain, the state matrix that the buyer and the seller need to trade is input into the neural network training model, the neural network training model outputs the energy of the business transaction of the buyer and the seller, the buyer and the seller perform the energy transaction according to the energy and the price, and then the buyer and the seller perform the money transaction in the alliance chain, so that the privacy of the business transaction of the buyer and the seller is ensured.
The electric automobile outputs the energy of business and buyer transaction to transmit corresponding electric quantity to the charging/discharging pile of the parking lot according to the neural network training model, and the electric quantity transfer of the transaction is completed; the method comprises the steps that a buyer electric automobile conducts currency transaction to a seller automobile on an alliance chain, and the seller automobile obtains the income of transaction electric quantity and then signs and confirms to complete the transaction; at intervals, different parking lots pack the transaction records of the interval and gather the transaction records into blocks, the blocks are used as bookkeepers in turn, and the blocks are added to the tail end of a block chain after signature confirmation of other parking lots. If the state of the charging automobile in the parking lot changes at the moment of t +1, the neural network training model acquires new state data of the state matrix, and a new round of electric automobile electric quantity transaction is started.
The invention provides an energy trading method based on deep reinforcement learning and alliance chain, which forms a first state matrix by collecting N state vectors influencing buyers and sellers in an energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.
In an embodiment of the present invention, before storing the training matrix into the replay pool of the neural network model, the method further includes: for the first state matrix StMotion matrix AtA second state matrix St+1Processing abnormal values;
wherein, for the first state matrix StAnd a second state matrix St+1Deleting the numerical values which are not in the preset value range, and supplementing 0;
to action matrix AtAnd deleting the element value meeting the condition that the price of the buyer is less than that of the seller, and supplementing 0.
Note that the matrix (S) is to be trainedt,At,St+1,Rt) Loop stored in neural network modelBefore the pool is placed, abnormal value processing is carried out, including deleting the state matrix StAnd St+1If the value is not in the preset range, the value is supplemented with 0, if the value z in the state matrix is not-1 and 1, or if the value p in the state matrix is not in the preset price range required by the buyer, the value in the state matrix is deleted and then replaced with 0. Further comprises deleting action matrix AtIs not satisfied with
Figure BDA0002820528850000101
Value of (2)
Figure BDA0002820528850000102
(
Figure BDA0002820528850000103
The electric vehicle i purchases the price of energy at time t for the purchaser,
Figure BDA00028205288500001010
the amount of power purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix) and 0 is complemented.
In one embodiment of the invention, the action matrix A is outputtThe method specifically comprises the following steps: cutting a vector a output by the neural network model into N vectors, wherein each vector comprises N elements, and the N vectors form an N multiplied by N action matrix At(ii) a Wherein each vector
Figure BDA00028205288500001011
Is the energy that one electric vehicle i trades with the electric vehicle j in the N-1 vehicles.
The action matrix atThe expression output by an Actor network of a neural network model is as follows:
Figure BDA0002820528850000104
wherein alpha ishIs the output of the h-th hidden layer of the Actor network, and is also the output of the h-th hidden layer of the Actor networkThe input of the h +1 th hidden layer,
Figure BDA0002820528850000105
and
Figure BDA0002820528850000106
weight matrix and bias, Act, for the h +1 th hidden layer, respectively1(. is) the activation function of the Actor network hidden layer. And the activation function Act of the hidden layer of the Actor network1(. about) use of ReLU function, Act1The mathematical expression of (x) is:
Figure BDA0002820528850000107
where x is a function variable. The ReLU function is a hidden layer activation function in the neural network model, and compared with other types of activation functions such as sigmoid of the neural network model, the ReLU function only needs to activate a few neurons each time, so that sparsity of the neural network is guaranteed, calculation is efficient, network weight convergence speed is increased, and the problem of gradient disappearance can be solved; if the neural network model has the problem of neuron death, the ReLU function can be replaced by a similar LeakyReLU function, and the mathematical expression is as follows:
Figure BDA0002820528850000108
where C is a constant.
The expression of the output layer in the Actor network is:
Figure BDA0002820528850000109
where H is the total number of hidden layers, αHThe output of the last hidden layer, and at the same time the input of the output layer,
Figure BDA0002820528850000111
and
Figure BDA0002820528850000112
weight matrix and bias, Act, respectively, of the output layer2(. is) the activation function of the Actor network output layer. While activating the function Act2(. one) uses the Softmax function, the mathematical expression is:
Figure BDA0002820528850000113
wherein xiIs the ith node of the network output layer of the input Actor. Compared with other activation functions of a neural network model, the sum of the final obtained output values of the Softmax function is 1, and the energy transaction method based on the deep reinforcement learning and the alliance chain can be directly regarded as the percentage of the electric quantity purchased by the buyer electric vehicle to all other seller electric vehicles to the total required electric quantity, no additional calculation is needed, and the efficiency is high.
In one embodiment of the invention, the state transfer function f (S)t,At) The expression of (a) is:
Figure BDA0002820528850000114
in the formula (I), the compound is shown in the specification,
Figure BDA0002820528850000115
is a second state matrix of the electric vehicle i,
Figure BDA0002820528850000116
respectively indicating the remaining parking time, the transaction electric quantity, the transaction price and the buying and selling label of the electric automobile i at the moment t + 1;
the expression of the transaction electric quantity required by the electric automobile i at the moment t +1 is as follows:
Figure BDA0002820528850000117
in the formula (I), the compound is shown in the specification,
Figure BDA0002820528850000118
the transaction power amount required by the electric automobile i at the moment t,
Figure BDA0002820528850000119
for the buying and selling label of the electric automobile i at the time t,
Figure BDA00028205288500001110
energy purchased by the electric automobile i to the electric automobile j at the moment t for the elements in the action matrix;
electric automobile i trades at time t +1 and marks price
Figure BDA00028205288500001111
The expression is as follows:
Figure BDA00028205288500001112
Figure BDA00028205288500001113
in the formula (I), the compound is shown in the specification,
Figure BDA00028205288500001114
respectively, the mean and variance of the normal distribution satisfied by the variable x;
wherein when
Figure BDA00028205288500001115
Then, the remaining parking time and the buying and selling label expression of the electric automobile i at the moment t +1 are as follows:
Figure BDA00028205288500001116
Figure BDA00028205288500001117
in the formula (I), the compound is shown in the specification,
Figure BDA00028205288500001118
the rest parking time of the electric automobile i at the moment t is obtained;
when in use
Figure BDA0002820528850000121
Then, the expressions of the remaining parking time, the transaction electric quantity and the buying and selling label of the electric automobile i at the moment t +1 are as follows:
Figure BDA0002820528850000122
Figure BDA0002820528850000123
Figure BDA0002820528850000124
in the formula, mu2And
Figure BDA0002820528850000125
respectively the stay time of the electric automobile in the energy trading field
Figure BDA0002820528850000126
Mean and variance, μ, of the satisfied normal distribution3And
Figure BDA0002820528850000127
is the energy that the electric automobile needs to trade
Figure BDA0002820528850000128
The mean and variance of the satisfied normal distribution.
In addition, the state matrix StTo the state matrix St+1The remaining dwell time δ of the features in the state matrix and the amount of electricity e to trade can be determined by the action matrix A taken when transitioningtDetermine to buyThe selling label z is related to the residual electric quantity of the automobile, and the trading price p of the electric quantity can fluctuate randomly. In particular, when the state matrix StTo the state matrix St+1During transfer, the relation between the remaining stop time delta at the characteristic moment t +1 in the state matrix, the remaining stop time delta at the moment e for carrying out transaction and the t and the e for carrying out transaction is embodied in a state transfer function, and the transaction price p of the electric quantity is represented according to an uncertain state transfer function f in the state transfer function (S)t,At) Determining, not determining, the state transfer function f (S)t,At) The method comprises the steps that whether the buying and selling label z changes or not, the trading price p of the electric quantity fluctuates along with the change of time, and the leaving of the electric automobile and the adding of a new electric automobile after the electric quantity trading are calculated in different modes. For example: for electric vehicles with incomplete transactions, i.e.
Figure BDA0002820528850000129
For electric vehicles which have completed a transaction, i.e. are
Figure BDA00028205288500001210
The variable x is a random variable whose N state vectors satisfy a normal distribution. U (0, 1) represents that the variable x satisfies a uniform distribution within the interval (0, 1), i.e., x is a random number within the interval (0, 1). For example: at time t +1, the probability that the electric vehicle becomes the electricity buyer or the seller is 0.5.
In one embodiment of the invention, the reward matrix RtThe expression of (a) is:
Figure BDA00028205288500001211
Figure BDA00028205288500001212
in the formula (I), the compound is shown in the specification,
Figure BDA00028205288500001213
is the reward of the electric automobile i at the moment t, k1,k2,k3,k4Is a constant number stIs as follows
Figure BDA00028205288500001214
The penalty factor at the time t is given,
Figure BDA00028205288500001215
the energy purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix,
Figure BDA00028205288500001216
the energy required by the electric vehicle i at time t,
Figure BDA00028205288500001217
pricing energy for electric car i at time t,
Figure BDA0002820528850000131
the value of the buying and selling label of the electric automobile i at the moment t.
In an embodiment of the present invention, the training of the neural network model specifically includes: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic valuation network and a Critic target network, and the Actor network comprises an Actor valuation network and an Actor target network;
wherein, the loss function of the criticic network is as follows:
Figure BDA0002820528850000132
the loss function for an Actor network is:
Figure BDA0002820528850000133
in the formula, L1Loss of the Actor network; l is2Is the loss of the Critic network; gamma is the discount coefficient, qkThe output of the Critic evaluation network is used for representing the Q value corresponding to the sample k; q's'kIs the output of the Critic target network and represents the Q value corresponding to the next moment of the sample k; k is equal to {1,2, …, m }, qkAnd q'kThe mathematical expression of (a) is:
qk=ReLU(W2Sk+W3Ak+b2)
q′k=ReLU(W′2Sk+1+W′3μ′(Sk+1)+b′2)
W2and W3Weight matrices for the output layers of the network, both Critic evaluation values, b2Is the offset vector, W ', of the output layer of the Critic estimation network'2And W'3Are weight matrix of Critic target network output layer, b'2Is the offset vector, μ' (S) of the output layer of the Critic target networkk+1) Is to mix Sk+1The output obtained by the target network of the Actor is input, and the representation state is Sk+1A corresponding optimal action matrix;
the back propagation algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
Figure BDA0002820528850000134
the soft update algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
W′←τW+(1-τ)W′
wherein W is the parameter W of Critic evaluation network and Actor evaluation network2,W3,b2
Figure BDA0002820528850000135
For two loss functions L1,L2The gradient of (a) is that W 'represents parameters of a Critic target network and an Actor target network, alpha is a training factor and has a value range of [0,1 ], and tau is a coefficient for controlling the influence of an old parameter W' of the target network and a parameter W of an evaluation network on the target network.
M is a natural number. Loss function in Critic networksL2In this embodiment, the value of γ may be 0.95, 0.98, or 0.999, and the discount coefficient is mainly used to control the influence of the Q value at the time t +1 on the action at the current time t, where γ is 0.95 in this embodiment.
In the embodiment of the invention, parameters of an Actor estimation network and a criticic estimation network are updated by using a back propagation algorithm, wherein the value of alpha can be 0.5, 0.1 or 0.01 and the like, a training factor is mainly used for controlling the training speed and effect of a neural network model, and the training factor alpha is preferably 0.01.
In the embodiment of the invention, the parameters of the Critic evaluation network and the Actor evaluation network are updated k times each time, the soft update algorithm is used for updating the parameters of the Critic target network and the Actor target network, compared with other DRL algorithms, the soft update algorithm directly copies the network parameters of the evaluation network to the target network, and combines the old target network parameters and the new evaluation network parameters, so that a more accurate neural network training model is obtained through training.
Fig. 3 is a framework diagram of an energy trading method federation chain based on deep reinforcement learning and federation chain according to an embodiment of the present invention.
In an embodiment of the present invention, in the energy trading field, obtaining the matching transaction between the buyer and the seller specifically further includes: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into a neural network training model, and the neural network training model outputs an action matrix of the trading.
It should be noted that, in the energy transaction method based on deep reinforcement learning and the federation chain, in order to ensure privacy and security of the transaction and ensure that the transaction result is public and transparent and is not falsifiable, the final transaction is realized through the federation chain. As shown in fig. 3, the alliance chain model is composed of an electric car and an energy trading ground (i.e., parking ground). After the electric automobile enters the parking lot, initiating a registration application to an energy trading place, and joining an alliance chain; at each moment, the electric automobile uploads the state of the electric automobile to an energy trading field, the energy trading field collects the states of all the charged automobiles and inputs the states into a neural network training model to obtain the optimal action, namely the pairing of a buyer and a seller and the specific trading electric quantity, and then the result and the corresponding electronic wallet address are returned to the electric automobiles of the two parties of the trading; the transaction electric automobile transmits the electric quantity to the charging/discharging pile in the parking lot according to the obtained matching result; at the next moment, the buyer electric automobile transfers accounts to the electronic purse of the seller electric automobile, and the seller electric automobile obtains the actually sold electric quantity cost from the electronic purse of the seller electric automobile; and finally, at intervals, the energy trading fields of different parking lots pack the trading records at the interval and gather the trading records into a block of the alliance chain, the blocks are alternately used as the accountants, and after signature confirmation of other energy trading fields, the block is added to the tail end of the block chain of the alliance chain. Wherein the energy trading floor is also referred to as an intermediary.
In the embodiment of the invention, the energy trading method based on deep reinforcement learning and alliance chain processes the change of the electric vehicle P2P electric quantity trading scene by adopting the state, action, state transfer function and reward function of a deep reinforcement learning neural network model, so that the state matrix S in the obtained neural network training modeltMotion matrix AtReward matrix RtThe method conforms to the P2P power transaction of the charging electric automobile in the parking lot.
Example two:
fig. 4 is a block diagram of an energy transaction apparatus based on deep reinforcement learning and federation chain according to an embodiment of the present invention.
As shown in fig. 4, an embodiment of the present invention further provides an energy trading device based on deep reinforcement learning and league chain, which includes a data acquisition module 10, a first processing module 20, a second processing module 30, a training module 40, and an output module 50:
the data acquisition module 10 is used for acquiring transaction characteristics of the energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining in the energy transaction field, a trading label, transaction energy and a transaction price;
the first processing module 20 is used for inputting the first state matrix into the deep reinforcement learning neural network model and outputting an action matrix;
the second processing module 30 is configured to calculate the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at a time t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
the training module 40 is configured to acquire m pieces of data of the training matrix from the playback pool of the neural network model every Δ t time to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, so as to obtain a trained neural network training model;
and the output module 50 is used for inputting the state matrix formed by the transaction characteristics required by the buyer and the seller in the energy trading field into the neural network training model to obtain the energy traded by the buyer and the seller.
It should be noted that, the modules in the apparatus according to the second embodiment correspond to the steps in the method according to the first embodiment, the steps of the method have been described in detail in the first embodiment, and the contents of the modules are not described in detail in the second embodiment.
Example three:
embodiments of the present invention provide a computer-readable storage medium for storing computer instructions that, when executed on a computer, cause the computer to perform the above-described deep reinforcement learning and federation chain-based energy trading method.
Example four:
the embodiment of the invention provides terminal equipment, which comprises a processor and a memory;
a memory for storing the program code and transmitting the program code to the processor;
and the processor is used for executing the energy trading method based on the deep reinforcement learning and the alliance chain according to the instructions in the program codes.
It should be noted that the processor is configured to execute the steps in one of the above-described embodiments of the energy trading device based on deep reinforcement learning and federation chains according to the instructions in the program code. Alternatively, the processor, when executing the computer program, implements the functions of each module/unit in each system/apparatus embodiment described above.
Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of a computer program in a terminal device.
The terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the terminal device is not limited and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the terminal device may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing computer programs and other programs and data required by the terminal device. The memory may also be used to temporarily store data that has been output or is to be output.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An energy transaction method based on deep reinforcement learning and alliance chain is applied to electric quantity transaction of an electric vehicle, and is characterized by comprising the following steps:
s10, collecting transaction characteristics of an energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;
s20, inputting the first state matrix into a deep reinforcement learning neural network model, and outputting an action matrix;
s30, calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
s40, acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, and obtaining a trained neural network training model;
and S50, in the energy trading place, inputting a state matrix formed by the trading features required by the buyer and the seller into the neural network training model to obtain the trading energy of the buyer and the seller.
2. The deep reinforcement learning and league chain based energy trading method of claim 1, wherein before storing the training matrix into a replay pool of the neural network model, further comprising: processing abnormal values of the first state matrix, the action matrix and the second state matrix;
deleting the numerical values of the first state matrix and the second state matrix which are not in the preset value range, and supplementing 0;
and deleting the element values in the action matrix, which meet the condition that the price of the buyer is less than that of the seller, and supplementing 0.
3. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein outputting the action matrix specifically comprises: cutting a vector output by the neural network model into N vectors, each vector comprising N elements, the N vectors constituting an N x N of the motion matrices; wherein each vector is the energy of one electric vehicle and the electric vehicle in the N-1 vehicles for trading.
4. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein the state transfer function f (S)t,At) The expression of (a) is:
Figure FDA0002820528840000021
in the formula (I), the compound is shown in the specification,
Figure FDA0002820528840000022
is a second state matrix of the electric vehicle i,
Figure FDA0002820528840000023
respectively indicating the remaining parking time, the transaction electric quantity, the transaction price and the buying and selling label of the electric automobile i at the moment t + 1;
the expression of the transaction electric quantity required by the electric automobile i at the moment t +1 is as follows:
Figure FDA0002820528840000024
in the formula (I), the compound is shown in the specification,
Figure FDA0002820528840000025
the transaction power amount required by the electric automobile i at the moment t,
Figure FDA0002820528840000026
for the buying and selling label of the electric automobile i at the time t,
Figure FDA0002820528840000027
energy purchased by the electric automobile i to the electric automobile j at the moment t for the elements in the action matrix;
electric automobile i trades at time t +1 and marks price
Figure FDA0002820528840000028
The expression is as follows:
Figure FDA0002820528840000029
Figure FDA00028205288400000210
in the formula, mu1,
Figure FDA00028205288400000211
Respectively, the mean and variance of the normal distribution satisfied by the variable x;
wherein when
Figure FDA00028205288400000212
Then, the remaining parking time and the buying and selling label expression of the electric automobile i at the moment t +1 are as follows:
Figure FDA00028205288400000213
Figure FDA00028205288400000214
in the formula (I), the compound is shown in the specification,
Figure FDA00028205288400000215
the rest parking time of the electric automobile i at the moment t is obtained;
when in use
Figure FDA00028205288400000216
Then, the expressions of the remaining parking time, the transaction electric quantity and the buying and selling label of the electric automobile i at the moment t +1 are as follows:
Figure FDA00028205288400000217
Figure FDA00028205288400000218
Figure FDA00028205288400000219
in the formula, mu2And
Figure FDA00028205288400000220
respectively the stay time of the electric automobile in the energy trading field
Figure FDA00028205288400000221
Mean and variance, μ, of the satisfied normal distribution3And
Figure FDA00028205288400000222
is the energy that the electric automobile needs to trade
Figure FDA00028205288400000223
The mean and variance of the satisfied normal distribution.
5. The deep reinforcement learning and alliance chain based energy trading method of claim 1 wherein the incentive matrix RtThe expression of (a) is:
Figure FDA0002820528840000031
Figure FDA0002820528840000032
in the formula (I), the compound is shown in the specification,
Figure FDA0002820528840000033
is the reward of the electric automobile i at the moment t, k1,k2,k3,k4Is a constant number stIs as follows
Figure FDA0002820528840000034
The penalty factor at the time t is given,
Figure FDA0002820528840000035
the energy purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix,
Figure FDA0002820528840000036
the energy required by the electric vehicle i at time t,
Figure FDA0002820528840000037
pricing energy for electric car i at time t,
Figure FDA0002820528840000038
the value of the buying and selling label of the electric automobile i at the moment t.
6. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein the training of the neural network model specifically comprises: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic evaluation network and a Critic target network, and the Actor network comprises an Actor evaluation network and an Actor target network;
wherein the loss function of the criticic network is as follows:
Figure FDA0002820528840000039
the loss function of the Actor network is as follows:
Figure FDA00028205288400000310
in the formula, L1Loss of the Actor network; l is2Is the loss of the Critic network; gamma is the discount coefficient, qkThe output of the Critic evaluation network is used for representing the Q value corresponding to the sample k; q's'kIs the output of the Critic target network and represents the Q value corresponding to the next moment of the sample k; k is equal to {1,2, …, m }, qkAnd q'kThe mathematical expression of (a) is:
qk=ReLU(W2Sk+W3Ak+b2)
q′k=ReLU(W′2Sk+1+W′3μ′(Sk+1)+b′2)
W2and W3Weight matrices for the output layers of the network, both Critic evaluation values, b2Is the offset vector, W ', of the output layer of the Critic estimation network'2And W'3Are weight matrix of Critic target network output layer, b'2Is the offset vector, μ' (S) of the output layer of the Critic target networkk+1) Is to mix Sk+1The output obtained by the target network of the Actor is input, and the representation state is Sk+1A corresponding optimal action matrix;
the back propagation algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
Figure FDA0002820528840000041
the soft update algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:
W′←τW+(1-τ)W′
wherein W is the parameter W of Critic evaluation network and Actor evaluation network2,W3,b2
Figure FDA0002820528840000042
For two loss functions L1,L2W' represents the parameters of the Critic target network and the Actor target network, alpha is a training factor and has the value range of 0,1, and tau is controlAnd the old parameter W' of the target network and the coefficient of the influence of the parameter W of the estimation network on the target network.
7. The energy trading method based on deep reinforcement learning and alliance chain as claimed in claim 1 wherein obtaining matching buyer and seller trades in the energy trading floor further comprises: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into the neural network training model, and the neural network training model outputs an action matrix of the trading.
8. The energy transaction device is characterized by comprising a data acquisition module, a first processing module, a second processing module, a training module and an output module, wherein the data acquisition module is used for acquiring data of a user, and the data acquisition module is used for acquiring the data of the user:
the data acquisition module is used for acquiring transaction characteristics of the energy transaction field and forming the transaction characteristics into a state vector, and at the moment t, N state vectors in the energy transaction field form a first state matrix; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;
the first processing module is used for inputting the first state matrix into a deep reinforcement learning neural network model and outputting an action matrix;
the second processing module is used for calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;
the training module is used for acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum times, so as to obtain a trained neural network training model;
and the output module is used for inputting a state matrix formed by transaction characteristics required by the buyer and the seller in the energy trading field into the neural network training model to obtain the energy of the transaction of the buyer and the seller.
9. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the deep reinforcement learning and federation chain based energy trading method of any one of claims 1 to 7.
10. A terminal device comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor, configured to execute the method for energy trading based on deep reinforcement learning and federation chain according to any one of claims 1 to 7 according to instructions in the program code.
CN202011420188.7A 2020-12-07 2020-12-07 Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain Active CN112419064B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011420188.7A CN112419064B (en) 2020-12-07 2020-12-07 Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011420188.7A CN112419064B (en) 2020-12-07 2020-12-07 Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain

Publications (2)

Publication Number Publication Date
CN112419064A true CN112419064A (en) 2021-02-26
CN112419064B CN112419064B (en) 2022-02-08

Family

ID=74775865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011420188.7A Active CN112419064B (en) 2020-12-07 2020-12-07 Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain

Country Status (1)

Country Link
CN (1) CN112419064B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202229A (en) * 2021-12-20 2022-03-18 南方电网数字电网研究院有限公司 Method and device for determining energy management strategy, computer equipment and storage medium
US20230063075A1 (en) * 2021-07-27 2023-03-02 Tata Consultancy Services Limited Method and system to generate pricing for charging electric vehicles
CN117078347A (en) * 2023-08-28 2023-11-17 合肥工业大学 Electric-carbon integrated transaction method based on alliance chain

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423978A (en) * 2017-06-16 2017-12-01 郑州大学 A kind of distributed energy business confirmation method based on alliance's block chain
CN108038545A (en) * 2017-12-06 2018-05-15 湖北工业大学 Fast learning algorithm based on Actor-Critic neutral net continuous controls
CN108985940A (en) * 2018-07-18 2018-12-11 国网能源研究院有限公司 Power exchange management system and method between a kind of user based on block chain technology
CN109003082A (en) * 2018-07-24 2018-12-14 电子科技大学 PHEV power exchange system and its method of commerce based on alliance's block chain
CN109784926A (en) * 2019-01-22 2019-05-21 华北电力大学(保定) A kind of virtual plant internal market method of commerce and system based on alliance's block chain
CN110349027A (en) * 2019-07-19 2019-10-18 湘潭大学 Pairs trade system based on deeply study
CN110378693A (en) * 2019-07-11 2019-10-25 合肥工业大学 Distributed energy weak center trade managing system based on alliance's block chain
CN110458443A (en) * 2019-08-07 2019-11-15 南京邮电大学 A kind of wisdom home energy management method and system based on deeply study
US20200160411A1 (en) * 2018-11-16 2020-05-21 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Optimal Joint Bidding and Pricing of Load Serving Entity
CN111815369A (en) * 2020-07-31 2020-10-23 上海交通大学 Multi-energy system energy scheduling method based on deep reinforcement learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423978A (en) * 2017-06-16 2017-12-01 郑州大学 A kind of distributed energy business confirmation method based on alliance's block chain
CN108038545A (en) * 2017-12-06 2018-05-15 湖北工业大学 Fast learning algorithm based on Actor-Critic neutral net continuous controls
CN108985940A (en) * 2018-07-18 2018-12-11 国网能源研究院有限公司 Power exchange management system and method between a kind of user based on block chain technology
CN109003082A (en) * 2018-07-24 2018-12-14 电子科技大学 PHEV power exchange system and its method of commerce based on alliance's block chain
US20200160411A1 (en) * 2018-11-16 2020-05-21 Mitsubishi Electric Research Laboratories, Inc. Methods and Systems for Optimal Joint Bidding and Pricing of Load Serving Entity
CN109784926A (en) * 2019-01-22 2019-05-21 华北电力大学(保定) A kind of virtual plant internal market method of commerce and system based on alliance's block chain
CN110378693A (en) * 2019-07-11 2019-10-25 合肥工业大学 Distributed energy weak center trade managing system based on alliance's block chain
CN110349027A (en) * 2019-07-19 2019-10-18 湘潭大学 Pairs trade system based on deeply study
CN110458443A (en) * 2019-08-07 2019-11-15 南京邮电大学 A kind of wisdom home energy management method and system based on deeply study
CN111815369A (en) * 2020-07-31 2020-10-23 上海交通大学 Multi-energy system energy scheduling method based on deep reinforcement learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JIAWEN KANG ET AL.: "Enabling Localized Peer-to-Peer Electricity Trading Among Plug-in Hybrid Electric Vehicles Using Consortium Blockchains", 《IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS》 *
TIMOTHY P. LILLICRAP ET AL.: "CONTINUOUS CONTROL WITH DEEP REINFORCEMENT", 《ARXIV》 *
YANG LI ET AL.: "Deep Robust Reinforcement Learning for Practical Algorithmic Trading", 《IEEE ACCESS》 *
刘建伟等: "基于值函数和策略梯度的深度强化学习综述", 《计算机学报》 *
齐岳等: "基于深度强化学习DDPG算法的投资组合管理", 《计算机与现代化》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230063075A1 (en) * 2021-07-27 2023-03-02 Tata Consultancy Services Limited Method and system to generate pricing for charging electric vehicles
CN114202229A (en) * 2021-12-20 2022-03-18 南方电网数字电网研究院有限公司 Method and device for determining energy management strategy, computer equipment and storage medium
CN114202229B (en) * 2021-12-20 2023-06-30 南方电网数字电网研究院有限公司 Determining method of energy management strategy of micro-grid based on deep reinforcement learning
CN117078347A (en) * 2023-08-28 2023-11-17 合肥工业大学 Electric-carbon integrated transaction method based on alliance chain

Also Published As

Publication number Publication date
CN112419064B (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN112419064B (en) Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain
CN109034915B (en) Artificial intelligent electronic commerce system capable of using digital assets or points as transaction media
Bergemann et al. Sequential information disclosure in auctions
CA3177410A1 (en) Market orchestration system for facilitating electronic marketplace transactions
Qiu et al. Multi-Agent Reinforcement Learning for Automated Peer-to-Peer Energy Trading in Double-Side Auction Market.
Yassine et al. Double auction mechanisms for dynamic autonomous electric vehicles energy trading
Zhang et al. EV charging bidding by multi-DQN reinforcement learning in electricity auction market
Backus et al. Dynamic demand estimation in auction markets
Gong et al. Split-award contracts with investment
Keniston Bargaining and welfare: A dynamic structural analysis
Ray et al. Supplier behavior modeling and winner determination using parallel MDP
Alsenani The participation of electric vehicles in a peer-to-peer energy-backed token market
Carvalho On a participation structure that ensures representative prices in prediction markets
KR20140100632A (en) Resale system for repeating sale goods and method of the same
Clempner A dynamic mechanism design for controllable and ergodic markov games
Fostel et al. Endogenous leverage: VaR and beyond
CN110782338A (en) Loan transaction risk prediction method and device, computer equipment and storage medium
Cheng et al. Recent studies of agent incentives in internet resource allocation and pricing
Withanawasam et al. Characterising trader manipulation in a limit-order driven market
Özer et al. Multi-unit differential auction–barter model for electronic marketplaces
Kim Maximizing sellers’ welfare in online auction by simulating bidders’ proxy bidding agents
Dong et al. Unilateral counterparty risk valuation of CDS using a regime-switching intensity model
Uhryn et al. Modelling a System for Intelligent Forecasting of Trading on Stock Exchanges
Zhang et al. A deep reinforcement learning-based bidding strategy for participants in a peer-to-peer energy trading scenario
Laskey et al. Combinatorial prediction markets for fusing information from distributed experts and models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230407

Address after: Room 601, Building B1, 136 Kaiyuan Avenue, Huangpu District, Guangzhou City, Guangdong Province, 510000

Patentee after: Guangzhou Huihui Intelligent Technology Co.,Ltd.

Address before: 510275 No. 135 West Xingang Road, Guangdong, Guangzhou

Patentee before: SUN YAT-SEN University