CN112419064A

CN112419064A - Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain

Info

Publication number: CN112419064A
Application number: CN202011420188.7A
Authority: CN
Inventors: 吴嘉婧; 张如筱; 郑子彬
Original assignee: National Sun Yat Sen University
Current assignee: Guangzhou Huihui Intelligent Technology Co.,Ltd.
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-02-26
Anticipated expiration: 2040-12-07
Also published as: CN112419064B

Abstract

The invention relates to an energy transaction method, a device and equipment based on deep reinforcement learning and alliance chain, the first state matrix is formed by collecting N state vectors affecting buyers and sellers in the energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.

Description

Energy transaction method, device and equipment based on deep reinforcement learning and alliance chain

Technical Field

The invention relates to the technical field of vehicle networking, in particular to an energy transaction method, device and equipment based on deep reinforcement learning and an alliance chain.

Background

With the continuous development of automobile electromotion, new force of automobile construction in China is prominent, a traditional automobile enterprise is transferred to new energy, and the acceptance of users on electric automobiles is gradually increased. However, the increase in the number of electric vehicles presents relatively few challenges to the charging problem of large-scale electric vehicles. Firstly, the problem of charging station shortage of the electric automobile is solved; in addition, the existing electric automobiles are generally charged at night, and the problems of overlarge power loss, voltage drop and overload of a power supply grid are easily caused by excessive automobiles charged at the same time period.

In order to solve the above problems, a mechanism for point-to-point (P2P) electric quantity transaction between electric vehicles is proposed. In the P2P electricity quantity transaction, the participating electric vehicles are regarded as "production and sales persons", and the electricity quantity required by other electric vehicles can be directly purchased or surplus electricity quantity can be sold according to self conditions, and after the buyer and seller negotiate and agree with transaction prices, the electricity quantity transfer is realized through the smart grid. The P2P electric quantity transaction mechanism not only can relieve the electric network load of the electricity consumption peak, but also can reduce the cost of the electricity purchasing automobile and increase the income of the electricity selling automobile. However, the P2P power trading is simultaneously exposed to privacy disclosure of the participants. Because the blockchain is a distributed shared account book and database, the blockchain has the characteristics of decentralization, openness, independence, safety, anonymity and the like. Therefore, the P2P electricity trading based on the blockchain is now in the world, which not only can solve the problem of asymmetry of information between the buyer and the seller, create a trustable trading environment for the participants, but also allows the participants to participate in the trading anonymously, thereby protecting the privacy of the trader to the maximum extent. However, the safety of blockchains builds on excessive computational complexity, while electric vehicles generally do not have sufficient computational power, and many researchers choose efficient, scalable alliance chains to replace. The alliance chain is a special blockchain, and is different from a consensus mechanism that each node in the blockchain is a bookkeeper, only a plurality of preselected authoritative nodes in the alliance chain serve as the bookkeepers, and local transaction records are collected and managed to be completed at moderate cost; while other access nodes may participate in the transaction without having to ask for a billing process, without requiring a consensus mechanism that consumes significant computing power and additional time to complete.

Currently, only the current income of the charging automobile or a parking lot is discussed to be maximized by the alliance-chain-based P2P electric quantity transaction mechanism of the charging automobile, and the long-term income of the charging automobile is not taken into consideration. In real life, the number of cars in the parking lot is dynamically changed, and if a seller car sells the surplus electric quantity at any moment, the seller car is likely to lose the opportunity of trading with the car with higher bid price at any moment, so that higher profit cannot be obtained. Therefore, in the P2P electricity quantity transaction based on the alliance chain, how to provide the buyer and seller matching strategy with the largest long-term income for the charging automobile is a problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides an energy transaction method, device and equipment based on deep reinforcement learning and a alliance chain, which are used for solving the technical problem of how to enable a buyer and a seller to obtain maximum long-term benefit in P2P electricity transaction based on the alliance chain.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

an energy transaction method based on deep reinforcement learning and alliance chain is applied to electric quantity transaction of an electric vehicle and comprises the following steps:

s10, collecting transaction characteristics of an energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;

s20, inputting the first state matrix into a deep reinforcement learning neural network model, and outputting an action matrix;

s30, calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;

s40, acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, and obtaining a trained neural network training model;

and S50, in the energy trading place, inputting a state matrix formed by the trading features required by the buyer and the seller into the neural network training model to obtain the trading energy of the buyer and the seller.

Preferably, before storing the training matrix into the replay pool of the neural network model, the method further includes: processing abnormal values of the first state matrix, the action matrix and the second state matrix;

deleting the numerical values of the first state matrix and the second state matrix which are not in the preset value range, and supplementing 0;

and deleting the element values in the action matrix, which meet the condition that the price of the buyer is less than that of the seller, and supplementing 0.

Preferably, the outputting the action matrix specifically includes: cutting a vector output by the neural network model into N vectors, each vector comprising N elements, the N vectors constituting an N x N of the motion matrices; wherein each vector is the energy of one electric vehicle and the electric vehicle in the N-1 vehicles for trading.

Preferably, the state transfer function f (S)_t,A_t) The expression of (a) is:

in the formula (I), the compound is shown in the specification,

is a second state matrix of the electric vehicle i,

respectively indicating the remaining parking time, the transaction electric quantity, the transaction price and the buying and selling label of the electric automobile i at the moment t + 1;

the expression of the transaction electric quantity required by the electric automobile i at the moment t +1 is as follows:

in the formula (I), the compound is shown in the specification,

the transaction power amount required by the electric automobile i at the moment t,

for the buying and selling label of the electric automobile i at the time t,

energy purchased by the electric automobile i to the electric automobile j at the moment t for the elements in the action matrix;

electric automobile i trades at time t +1 and marks price

The expression is as follows:

in the formula (I), the compound is shown in the specification,

respectively, the mean and variance of the normal distribution satisfied by the variable x;

wherein when

Then, the remaining parking time and the buying and selling label expression of the electric automobile i at the moment t +1 are as follows:

in the formula (I), the compound is shown in the specification,

the rest parking time of the electric automobile i at the moment t is obtained;

when in use

Then, the expressions of the remaining parking time, the transaction electric quantity and the buying and selling label of the electric automobile i at the moment t +1 are as follows:

in the formula, mu₂And

respectively the stay time of the electric automobile in the energy trading field

Mean and variance, μ, of the satisfied normal distribution₃And

is the energy that the electric automobile needs to trade

The mean and variance of the satisfied normal distribution.

Preferably, the reward matrix R_tThe expression of (a) is:

in the formula (I), the compound is shown in the specification,

is the reward of the electric automobile i at the moment t, k₁,k₂,k₃,k₄Is a constant number s_tIs as follows

The penalty factor at the time t is given,

the energy purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix,

the energy required by the electric vehicle i at time t,

pricing energy for electric car i at time t,

the value of the buying and selling label of the electric automobile i at the moment t.

Preferably, the training of the neural network model specifically includes: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic evaluation network and a Critic target network, and the Actor network comprises an Actor evaluation network and an Actor target network;

wherein the loss function of the criticic network is as follows:

the loss function of the Actor network is as follows:

in the formula, L₁Loss of the Actor network; l is₂Is the loss of the Critic network; gamma is the discount coefficient, q_kThe output of the Critic evaluation network is used for representing the Q value corresponding to the sample k; q's'_kIs the output of the Critic target network and represents the Q value corresponding to the next moment of the sample k; k is equal to {1,2, …, m }, q_kAnd q'_kThe mathematical expression of (a) is:

q_k＝ReLU(W₂S_k+W₃A_k+b₂)

q′_k＝ReLU(W′₂S_k+1+W′₃μ′(S_k+1)+b′₂)

W₂and W₃Weight matrices for the output layers of the network, both Critic evaluation values, b₂Estimating network output layer bias for CriticIs location vector, W'₂And W'₃Are weight matrix of Critic target network output layer, b'₂Is the offset vector, μ' (S) of the output layer of the Critic target network_k+1) Is to mix S_k+1The output obtained by the target network of the Actor is input, and the representation state is S_k+1A corresponding optimal action matrix;

the back propagation algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:

the soft update algorithm expression for iteratively updating the parameters of the Actor valuation network and the criticic valuation network is as follows:

W′←τW+(1-τ)W′

wherein W is the parameter W of Critic evaluation network and Actor evaluation network₂,W₃,b₂，

For two loss functions L₁,L₂The gradient of (a) is that W 'represents parameters of a Critic target network and an Actor target network, alpha is a training factor and has a value range of [0,1 ], and tau is a coefficient for controlling the influence of an old parameter W' of the target network and a parameter W of an evaluation network on the target network.

Preferably, in the energy trading floor, obtaining the matching transaction between the buyer and the seller specifically further includes: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into the neural network training model, and the neural network training model outputs an action matrix of the trading.

The invention also provides an energy transaction device based on deep reinforcement learning and alliance chain, which comprises a data acquisition module, a first processing module, a second processing module, a training module and an output module;

the data acquisition module is used for acquiring transaction characteristics of the energy transaction field and forming the transaction characteristics into a state vector, and at the moment t, N state vectors in the energy transaction field form a first state matrix; the transaction characteristics comprise the time of the electric automobile remaining stopped in an energy transaction field, a trading label, transaction energy and a transaction price;

the first processing module is used for inputting the first state matrix into a deep reinforcement learning neural network model and outputting an action matrix;

the second processing module is used for calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the moment of t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;

the training module is used for acquiring m pieces of data of the training matrix from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum times, so as to obtain a trained neural network training model;

and the output module is used for inputting a state matrix formed by transaction characteristics required by the buyer and the seller in the energy trading field into the neural network training model to obtain the energy of the transaction of the buyer and the seller.

The present invention also provides a computer-readable storage medium for storing computer instructions that, when executed on a computer, cause the computer to perform the deep reinforcement learning and federation chain based energy trading method described above.

The invention also provides terminal equipment, which comprises a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is used for executing the energy trading method based on the deep reinforcement learning and the alliance chain according to instructions in the program code.

According to the technical scheme, the embodiment of the invention has the following advantages: the energy trading method, the device and the equipment based on the deep reinforcement learning and the alliance chain form a first state matrix by collecting N state vectors influencing buyers and sellers in an energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a deep reinforcement learning and federation chain-based energy transaction method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a neural network model of an energy trading method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention.

Fig. 3 is a framework diagram of an energy trading method federation chain based on deep reinforcement learning and federation chain according to an embodiment of the present invention.

Fig. 4 is a block diagram of an energy transaction apparatus based on deep reinforcement learning and federation chain according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application provides an energy transaction method, device and equipment based on deep reinforcement learning and an alliance chain, and the method, device and equipment are used for solving the technical problem of how to enable a buyer and a seller to obtain maximum long-term benefit in P2P electricity transaction based on the alliance chain. In this embodiment, a parking lot is used as an energy transaction case, and the energy of the transaction is the electric quantity of the electric vehicle transaction in the parking lot.

The first embodiment is as follows:

fig. 1 is a flowchart illustrating steps of an energy transaction method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention, and fig. 2 is a schematic structural diagram illustrating a neural network model of the energy transaction method based on deep reinforcement learning and a federation chain according to an embodiment of the present invention.

As shown in fig. 1 and fig. 2, an embodiment of the present invention provides an energy transaction method based on deep reinforcement learning and a federation chain, including the following steps:

s10, collecting transaction characteristics of an energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining in the energy transaction field, a trading label, transaction energy and a transaction price;

s30, calculating the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at the t +1 moment; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;

s40, acquiring data of m training matrixes from a playback pool of the neural network model every delta t moment to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, and obtaining a trained neural network training model;

and S50, in an energy trading field, inputting a state matrix formed by the trading features required by the buyer and the seller into a neural network training model to obtain the trading energy of the buyer and the seller.

In step S10 of the embodiment of the present invention, characteristics of the electric vehicle in the parking lot are mainly collected, including remaining parking time δ of the electric vehicle, a marketing label z (z values corresponding to a buyer and a seller are-1 and +1, respectively), electric quantity e for performing a transaction, and a transaction price p; the characteristics are combined into a state vector of one electric vehicle, and at the time t, the state vectors of all N electric vehicles in the parking lot form a first state matrix S_tI.e. by

It should be noted that, in step S10, the factors affecting the long-term profit of the two parties in the energy trading floor are mainly obtained.

In steps S20 to S40 of the embodiment of the present invention, the features in the state matrix are processed and learned in the neural network model to obtain the action matrix, and the action matrix and the first state matrix S are further processed and learned_tAfter the processing of the state transition function and the reward function, the state matrix (namely, the second state matrix) and the reward matrix R at the moment of t +1 are obtained_tAnd the first state matrix S_tMotion matrix A_tA second state matrix S_t+1And a reward matrix R_tForm a training matrix (S)_t,A_t,S_t+1,R_t) As training data for training a neural network model of an energy trading place, a neural network training model is obtained through training the neural network model, and the neural network training model can learn the parking lotThe dynamic change of the state (energy trading field) and the optimal action matrix adapting to the change enable the long-term income of trading electric vehicles under the dynamic change to reach the maximum value.

It should be noted that, as shown in fig. 2, the neural network model includes a criticic estimation network, a criticic target network, an Actor estimation network, and an Actor target network. The Critic evaluation network and the Critic target network have the same structure and are mainly used for calculating the Q values of states and actions, and the Actor evaluation network and the Actor target network have the same structure and are mainly responsible for selecting the action with the highest Q value according to the states and outputting an action matrix. Inputting the state data obtained by the neural network model into an Actor estimation network, and outputting the action by the Actor estimation network; the action output by the Actor valuation network interacts with the environment to obtain next state data and reward data; storing the data; and if the data volume is enough, randomly sampling the data, and updating network parameters of the Critic evaluation network, the Critic target network, the Actor evaluation network and the Actor target network. When the Critic evaluation network and the Actor evaluation network are updated, a gradient descent algorithm is used; when the Critic target network and the Actor target network are updated, a soft update algorithm is used.

In step S40 of the embodiment of the present invention, step 10 to step 30 are repeated to collect a sufficient number of training data to train the neural network model, so as to improve the accuracy of the result output by the neural network training model.

In step S50 of the embodiment of the present invention, the obtained neural network training model is mainly combined with the alliance chain, the state matrix that the buyer and the seller need to trade is input into the neural network training model, the neural network training model outputs the energy of the business transaction of the buyer and the seller, the buyer and the seller perform the energy transaction according to the energy and the price, and then the buyer and the seller perform the money transaction in the alliance chain, so that the privacy of the business transaction of the buyer and the seller is ensured.

The electric automobile outputs the energy of business and buyer transaction to transmit corresponding electric quantity to the charging/discharging pile of the parking lot according to the neural network training model, and the electric quantity transfer of the transaction is completed; the method comprises the steps that a buyer electric automobile conducts currency transaction to a seller automobile on an alliance chain, and the seller automobile obtains the income of transaction electric quantity and then signs and confirms to complete the transaction; at intervals, different parking lots pack the transaction records of the interval and gather the transaction records into blocks, the blocks are used as bookkeepers in turn, and the blocks are added to the tail end of a block chain after signature confirmation of other parking lots. If the state of the charging automobile in the parking lot changes at the moment of t +1, the neural network training model acquires new state data of the state matrix, and a new round of electric automobile electric quantity transaction is started.

The invention provides an energy trading method based on deep reinforcement learning and alliance chain, which forms a first state matrix by collecting N state vectors influencing buyers and sellers in an energy trading field, processing and analyzing the state matrix in the neural network model to obtain an action matrix, a second state matrix and an incentive matrix, training the neural network model by adopting the first state matrix, the action matrix, the second state matrix and the incentive matrix to obtain a neural network training model, and maximizing the long-term income of the electric vehicle participating in the transaction in the P2P electric quantity transaction of the electric vehicle applied on the basis of the neural network training model and the energy transaction method of the alliance chain, and an alliance chain is introduced, so that the privacy and safety of electric vehicle electricity quantity transaction are ensured, and the technical problem of how to enable a buyer and a seller to obtain the maximum long-term benefit in P2P electricity quantity transaction based on the alliance chain is solved.

In an embodiment of the present invention, before storing the training matrix into the replay pool of the neural network model, the method further includes: for the first state matrix S_tMotion matrix A_tA second state matrix S_t+1Processing abnormal values;

wherein, for the first state matrix S_tAnd a second state matrix S_t+1Deleting the numerical values which are not in the preset value range, and supplementing 0;

to action matrix A_tAnd deleting the element value meeting the condition that the price of the buyer is less than that of the seller, and supplementing 0.

Note that the matrix (S) is to be trained_t,A_t,S_t+1,R_t) Loop stored in neural network modelBefore the pool is placed, abnormal value processing is carried out, including deleting the state matrix S_tAnd S_t+1If the value is not in the preset range, the value is supplemented with 0, if the value z in the state matrix is not-1 and 1, or if the value p in the state matrix is not in the preset price range required by the buyer, the value in the state matrix is deleted and then replaced with 0. Further comprises deleting action matrix A_tIs not satisfied with

Value of (2)

(

The electric vehicle i purchases the price of energy at time t for the purchaser,

the amount of power purchased by the electric vehicle i to the electric vehicle j at time t for the element in the action matrix) and 0 is complemented.

In one embodiment of the invention, the action matrix A is output_tThe method specifically comprises the following steps: cutting a vector a output by the neural network model into N vectors, wherein each vector comprises N elements, and the N vectors form an N multiplied by N action matrix A_t(ii) a Wherein each vector

Is the energy that one electric vehicle i trades with the electric vehicle j in the N-1 vehicles.

The action matrix a_tThe expression output by an Actor network of a neural network model is as follows:

wherein alpha is^hIs the output of the h-th hidden layer of the Actor network, and is also the output of the h-th hidden layer of the Actor networkThe input of the h +1 th hidden layer,

and

weight matrix and bias, Act, for the h +1 th hidden layer, respectively₁(. is) the activation function of the Actor network hidden layer. And the activation function Act of the hidden layer of the Actor network₁(. about) use of ReLU function, Act₁The mathematical expression of (x) is:

where x is a function variable. The ReLU function is a hidden layer activation function in the neural network model, and compared with other types of activation functions such as sigmoid of the neural network model, the ReLU function only needs to activate a few neurons each time, so that sparsity of the neural network is guaranteed, calculation is efficient, network weight convergence speed is increased, and the problem of gradient disappearance can be solved; if the neural network model has the problem of neuron death, the ReLU function can be replaced by a similar LeakyReLU function, and the mathematical expression is as follows:

where C is a constant.

The expression of the output layer in the Actor network is:

where H is the total number of hidden layers, α^HThe output of the last hidden layer, and at the same time the input of the output layer,

and

weight matrix and bias, Act, respectively, of the output layer₂(. is) the activation function of the Actor network output layer. While activating the function Act₂(. one) uses the Softmax function, the mathematical expression is:

wherein x_iIs the ith node of the network output layer of the input Actor. Compared with other activation functions of a neural network model, the sum of the final obtained output values of the Softmax function is 1, and the energy transaction method based on the deep reinforcement learning and the alliance chain can be directly regarded as the percentage of the electric quantity purchased by the buyer electric vehicle to all other seller electric vehicles to the total required electric quantity, no additional calculation is needed, and the efficiency is high.

In one embodiment of the invention, the state transfer function f (S)_t,A_t) The expression of (a) is:

in the formula (I), the compound is shown in the specification,

is a second state matrix of the electric vehicle i,

in the formula (I), the compound is shown in the specification,

for the buying and selling label of the electric automobile i at the time t,

electric automobile i trades at time t +1 and marks price

The expression is as follows:

in the formula (I), the compound is shown in the specification,

wherein when

in the formula (I), the compound is shown in the specification,

the rest parking time of the electric automobile i at the moment t is obtained;

when in use

in the formula, mu₂And

Mean and variance, μ, of the satisfied normal distribution₃And

is the energy that the electric automobile needs to trade

The mean and variance of the satisfied normal distribution.

In addition, the state matrix S_tTo the state matrix S_t+1The remaining dwell time δ of the features in the state matrix and the amount of electricity e to trade can be determined by the action matrix A taken when transitioning_tDetermine to buyThe selling label z is related to the residual electric quantity of the automobile, and the trading price p of the electric quantity can fluctuate randomly. In particular, when the state matrix S_tTo the state matrix S_t+1During transfer, the relation between the remaining stop time delta at the characteristic moment t +1 in the state matrix, the remaining stop time delta at the moment e for carrying out transaction and the t and the e for carrying out transaction is embodied in a state transfer function, and the transaction price p of the electric quantity is represented according to an uncertain state transfer function f in the state transfer function (S)_t,A_t) Determining, not determining, the state transfer function f (S)_t,A_t) The method comprises the steps that whether the buying and selling label z changes or not, the trading price p of the electric quantity fluctuates along with the change of time, and the leaving of the electric automobile and the adding of a new electric automobile after the electric quantity trading are calculated in different modes. For example: for electric vehicles with incomplete transactions, i.e.

For electric vehicles which have completed a transaction, i.e. are

The variable x is a random variable whose N state vectors satisfy a normal distribution. U (0, 1) represents that the variable x satisfies a uniform distribution within the interval (0, 1), i.e., x is a random number within the interval (0, 1). For example: at time t +1, the probability that the electric vehicle becomes the electricity buyer or the seller is 0.5.

In one embodiment of the invention, the reward matrix R_tThe expression of (a) is:

in the formula (I), the compound is shown in the specification,

The penalty factor at the time t is given,

the energy required by the electric vehicle i at time t,

pricing energy for electric car i at time t,

In an embodiment of the present invention, the training of the neural network model specifically includes: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic valuation network and a Critic target network, and the Actor network comprises an Actor valuation network and an Actor target network;

wherein, the loss function of the criticic network is as follows:

the loss function for an Actor network is:

q_k＝ReLU(W₂S_k+W₃A_k+b₂)

q′_k＝ReLU(W′₂S_k+1+W′₃μ′(S_k+1)+b′₂)

W₂and W₃Weight matrices for the output layers of the network, both Critic evaluation values, b₂Is the offset vector, W ', of the output layer of the Critic estimation network'₂And W'₃Are weight matrix of Critic target network output layer, b'₂Is the offset vector, μ' (S) of the output layer of the Critic target network_k+1) Is to mix S_k+1The output obtained by the target network of the Actor is input, and the representation state is S_k+1A corresponding optimal action matrix;

W′←τW+(1-τ)W′

M is a natural number. Loss function in Critic networksL₂In this embodiment, the value of γ may be 0.95, 0.98, or 0.999, and the discount coefficient is mainly used to control the influence of the Q value at the time t +1 on the action at the current time t, where γ is 0.95 in this embodiment.

In the embodiment of the invention, parameters of an Actor estimation network and a criticic estimation network are updated by using a back propagation algorithm, wherein the value of alpha can be 0.5, 0.1 or 0.01 and the like, a training factor is mainly used for controlling the training speed and effect of a neural network model, and the training factor alpha is preferably 0.01.

In the embodiment of the invention, the parameters of the Critic evaluation network and the Actor evaluation network are updated k times each time, the soft update algorithm is used for updating the parameters of the Critic target network and the Actor target network, compared with other DRL algorithms, the soft update algorithm directly copies the network parameters of the evaluation network to the target network, and combines the old target network parameters and the new evaluation network parameters, so that a more accurate neural network training model is obtained through training.

In an embodiment of the present invention, in the energy trading field, obtaining the matching transaction between the buyer and the seller specifically further includes: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into a neural network training model, and the neural network training model outputs an action matrix of the trading.

It should be noted that, in the energy transaction method based on deep reinforcement learning and the federation chain, in order to ensure privacy and security of the transaction and ensure that the transaction result is public and transparent and is not falsifiable, the final transaction is realized through the federation chain. As shown in fig. 3, the alliance chain model is composed of an electric car and an energy trading ground (i.e., parking ground). After the electric automobile enters the parking lot, initiating a registration application to an energy trading place, and joining an alliance chain; at each moment, the electric automobile uploads the state of the electric automobile to an energy trading field, the energy trading field collects the states of all the charged automobiles and inputs the states into a neural network training model to obtain the optimal action, namely the pairing of a buyer and a seller and the specific trading electric quantity, and then the result and the corresponding electronic wallet address are returned to the electric automobiles of the two parties of the trading; the transaction electric automobile transmits the electric quantity to the charging/discharging pile in the parking lot according to the obtained matching result; at the next moment, the buyer electric automobile transfers accounts to the electronic purse of the seller electric automobile, and the seller electric automobile obtains the actually sold electric quantity cost from the electronic purse of the seller electric automobile; and finally, at intervals, the energy trading fields of different parking lots pack the trading records at the interval and gather the trading records into a block of the alliance chain, the blocks are alternately used as the accountants, and after signature confirmation of other energy trading fields, the block is added to the tail end of the block chain of the alliance chain. Wherein the energy trading floor is also referred to as an intermediary.

In the embodiment of the invention, the energy trading method based on deep reinforcement learning and alliance chain processes the change of the electric vehicle P2P electric quantity trading scene by adopting the state, action, state transfer function and reward function of a deep reinforcement learning neural network model, so that the state matrix S in the obtained neural network training model_tMotion matrix A_tReward matrix R_tThe method conforms to the P2P power transaction of the charging electric automobile in the parking lot.

Example two:

As shown in fig. 4, an embodiment of the present invention further provides an energy trading device based on deep reinforcement learning and league chain, which includes a data acquisition module 10, a first processing module 20, a second processing module 30, a training module 40, and an output module 50:

the data acquisition module 10 is used for acquiring transaction characteristics of the energy transaction field, forming the transaction characteristics into a state vector, and forming a first state matrix by N state vectors in the energy transaction field at the moment t; the transaction characteristics comprise the time of the electric automobile remaining in the energy transaction field, a trading label, transaction energy and a transaction price;

the first processing module 20 is used for inputting the first state matrix into the deep reinforcement learning neural network model and outputting an action matrix;

the second processing module 30 is configured to calculate the action matrix and the first state matrix through a state transfer function and a reward function to obtain a second state matrix and a reward matrix at a time t + 1; the first state matrix, the action matrix, the second state matrix and the reward matrix form a training matrix, and the training matrix is stored in a playback pool of the neural network model;

the training module 40 is configured to acquire m pieces of data of the training matrix from the playback pool of the neural network model every Δ t time to train the neural network model until a loss function of the neural network model converges or iterates to the maximum number of times, so as to obtain a trained neural network training model;

and the output module 50 is used for inputting the state matrix formed by the transaction characteristics required by the buyer and the seller in the energy trading field into the neural network training model to obtain the energy traded by the buyer and the seller.

It should be noted that, the modules in the apparatus according to the second embodiment correspond to the steps in the method according to the first embodiment, the steps of the method have been described in detail in the first embodiment, and the contents of the modules are not described in detail in the second embodiment.

Example three:

embodiments of the present invention provide a computer-readable storage medium for storing computer instructions that, when executed on a computer, cause the computer to perform the above-described deep reinforcement learning and federation chain-based energy trading method.

Example four:

the embodiment of the invention provides terminal equipment, which comprises a processor and a memory;

a memory for storing the program code and transmitting the program code to the processor;

and the processor is used for executing the energy trading method based on the deep reinforcement learning and the alliance chain according to the instructions in the program codes.

It should be noted that the processor is configured to execute the steps in one of the above-described embodiments of the energy trading device based on deep reinforcement learning and federation chains according to the instructions in the program code. Alternatively, the processor, when executing the computer program, implements the functions of each module/unit in each system/apparatus embodiment described above.

Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of a computer program in a terminal device.

The terminal device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that the terminal device is not limited and may include more or fewer components than those shown, or some components may be combined, or different components, e.g., the terminal device may also include input output devices, network access devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage may be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device. Further, the memory may also include both an internal storage unit of the terminal device and an external storage device. The memory is used for storing computer programs and other programs and data required by the terminal device. The memory may also be used to temporarily store data that has been output or is to be output.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An energy transaction method based on deep reinforcement learning and alliance chain is applied to electric quantity transaction of an electric vehicle, and is characterized by comprising the following steps:

2. The deep reinforcement learning and league chain based energy trading method of claim 1, wherein before storing the training matrix into a replay pool of the neural network model, further comprising: processing abnormal values of the first state matrix, the action matrix and the second state matrix;

3. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein outputting the action matrix specifically comprises: cutting a vector output by the neural network model into N vectors, each vector comprising N elements, the N vectors constituting an N x N of the motion matrices; wherein each vector is the energy of one electric vehicle and the electric vehicle in the N-1 vehicles for trading.

4. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein the state transfer function f (S)_t,A_t) The expression of (a) is:

in the formula (I), the compound is shown in the specification,

is a second state matrix of the electric vehicle i,

in the formula (I), the compound is shown in the specification,

for the buying and selling label of the electric automobile i at the time t,

electric automobile i trades at time t +1 and marks price

The expression is as follows:

in the formula, mu₁,

wherein when

in the formula (I), the compound is shown in the specification,

the rest parking time of the electric automobile i at the moment t is obtained;

when in use

in the formula, mu₂And

Mean and variance, μ, of the satisfied normal distribution₃And

is the energy that the electric automobile needs to trade

The mean and variance of the satisfied normal distribution.

5. The deep reinforcement learning and alliance chain based energy trading method of claim 1 wherein the incentive matrix R_tThe expression of (a) is:

in the formula (I), the compound is shown in the specification,

The penalty factor at the time t is given,

the energy required by the electric vehicle i at time t,

pricing energy for electric car i at time t,

6. The deep reinforcement learning and federation chain-based energy trading method of claim 1, wherein the training of the neural network model specifically comprises: iteratively updating parameters of a Critic network and an Actor network until a loss function converges or iterates to the maximum number of times, wherein the Critic network comprises a Critic evaluation network and a Critic target network, and the Actor network comprises an Actor evaluation network and an Actor target network;

wherein the loss function of the criticic network is as follows:

the loss function of the Actor network is as follows:

q_k＝ReLU(W₂S_k+W₃A_k+b₂)

q′_k＝ReLU(W′₂S_k+1+W′₃μ′(S_k+1)+b′₂)

W′←τW+(1-τ)W′

For two loss functions L₁,L₂W' represents the parameters of the Critic target network and the Actor target network, alpha is a training factor and has the value range of 0,1, and tau is controlAnd the old parameter W' of the target network and the coefficient of the influence of the parameter W of the estimation network on the target network.

7. The energy trading method based on deep reinforcement learning and alliance chain as claimed in claim 1 wherein obtaining matching buyer and seller trades in the energy trading floor further comprises: at a certain moment, a state matrix formed by transaction characteristics of buyers and sellers needing to be traded is input into the neural network training model, and the neural network training model outputs an action matrix of the trading.

8. The energy transaction device is characterized by comprising a data acquisition module, a first processing module, a second processing module, a training module and an output module, wherein the data acquisition module is used for acquiring data of a user, and the data acquisition module is used for acquiring the data of the user:

9. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the deep reinforcement learning and federation chain based energy trading method of any one of claims 1 to 7.

10. A terminal device comprising a processor and a memory;

the processor, configured to execute the method for energy trading based on deep reinforcement learning and federation chain according to any one of claims 1 to 7 according to instructions in the program code.