TWI767868B

TWI767868B - Method and apparatus for planning energy usage of charging station based on reinforcement learning

Info

Publication number: TWI767868B
Application number: TW110141537A
Authority: TW
Inventors: 江坤諺; 邱偉育
Original assignee: 國立清華大學
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-06-11
Also published as: TW202320002A

Abstract

A method and an apparatus for planning energy usage of charging station based on reinforcement learning are provided. In the method, multiple system states are defined by using a power demand and a remaining battery energy of a charging station itself, and a global power demand and an internal power price of an energy sharing area, and expected returns for arranging energy use actions under each system state are estimated to construct a reinforcement learning table. According to the reinforcement learning table, an energy use action adapted for a current system state is selected and uploaded to a coordinator device, and a trading electricity arranged by the coordinator device and a reward of adopting the power use action calculated by the coordinator device are used to update the reinforcement learning table. The current system state, the power use action, the reward and a number of times the system state being selected are recorded and used to generate a simulation environment, so as to calculate an overall benefit of arranging the power use action under each system state and accordingly update the reinforcement learning table.

Description

Method and device for energy use planning of charging station based on reinforcement learning

本發明是有關於一種強化學習方法及裝置，且特別是有關於一種基於強化學習的充電站能源使用規劃方法及裝置。The present invention relates to a reinforcement learning method and device, and in particular, to a reinforcement learning-based energy usage planning method and device for a charging station.

近年來，由於環保意識的提高，許多人開始使用電動車輛，而隨著電動車輛用戶的大幅增加，電動車輛充電站的需求也同步提升。然而，由於電動車輛用戶的習慣不同，其對於充電站的需求有所差異，在大量電動車輛充電的情況下，將造成充電站充電的不協調，並且對整體電網有負面影響。In recent years, many people have started to use electric vehicles due to the increasing awareness of environmental protection, and with the substantial increase in electric vehicle users, the demand for electric vehicle charging stations has also increased. However, due to the different habits of electric vehicle users, their demand for charging stations is different. In the case of a large number of electric vehicles charging, the charging station will be uncoordinated and have a negative impact on the overall power grid.

先前用於多個電動車輛充電站之間的能源使用規劃係採用非線性規劃（nonlinear programming）演算法，其需要實時預測價格、電動車輛需求和可再生能源數據，從而導致性能難以提升。為了解決此問題，目前已有部分文獻提出無模型的強化學習演算法，但此種演算法的收斂速度低，結果將產生較高的花費並造成能源浪費，無法達到充電站整體利益的最大化。Energy usage planning previously used between multiple electric vehicle charging stations employed nonlinear programming algorithms that required real-time forecasting of prices, electric vehicle demand and renewable energy data, making performance improvements difficult. In order to solve this problem, some literatures have proposed a model-free reinforcement learning algorithm, but the convergence speed of this algorithm is low, resulting in higher costs and energy waste, which cannot maximize the overall benefits of the charging station. .

本發明提供一種基於強化學習的充電站能源使用規劃方法及裝置，可妥善安排各充電站之電池充放電與能源提供，達到充電站整體利益最大化。The present invention provides a method and device for planning the energy use of a charging station based on reinforcement learning, which can properly arrange the battery charging and discharging and energy supply of each charging station, so as to maximize the overall benefit of the charging station.

本發明提供一種基於強化學習的充電站能源使用規劃方法，適於由能源共享區域內多個充電站中的指定充電站規劃能源使用。所述方法包括使用自身的電力需求、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個系統狀態，並預估各個系統狀態下安排能源使用動作的期望報酬以建構一個強化學習表，其中全局電力需求是由合作者裝置整合各個充電站上傳的電力需求而得；依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置，根據合作者裝置所安排的交易電量及所計算的採用此能源使用動作的獎勵，以更新強化學習表；以及記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境，並在此模擬環境下，計算在各個系統狀態下安排電源使用動作所獲得的獎勵，據以更新強化學習表。The invention provides a method for planning energy use of a charging station based on reinforcement learning, which is suitable for planning energy use by a designated charging station among a plurality of charging stations in an energy sharing area. The method includes defining multiple system states using its own power demand, remaining battery power, and the global power demand and internal electricity price in the energy sharing area, and estimating the expected reward for arranging energy use actions in each system state to construct a reinforcement learning table , where the global power demand is obtained by integrating the power demand uploaded by each charging station by the partner device; according to the reinforcement learning table, select the energy use action suitable for the arrangement in the current system state and upload it to the partner device, according to the partner device Arranged transaction amount and calculated reward for adopting this energy usage action to update the reinforcement learning table; and record the current system state, energy usage action, reward and the number of epochs of the current system state to generate a simulated environment, and simulate here In the environment, the reward obtained by arranging the power usage action in each system state is calculated, and the reinforcement learning table is updated accordingly.

本發明提供一種基於強化學習的充電站能源使用規劃裝置，其係配置於指定充電站。此充電站能源使用規劃裝置包括連接裝置、儲存裝置及處理器。連接裝置是用以連接合作者裝置，此合作者裝置是用以管理能源共享區域內包括指定充電站在內的多個充電站。儲存裝置用以儲存電腦程式。處理器耦接連接裝置及儲存裝置，且經配置以載入並執行電腦程式以使用指定充電站的電力需求、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個系統狀態，並預估在各個系統狀態下安排能源使用動作的期望報酬以建構一個強化學習表，其中全局電力需求是由合作者裝置整合各個充電站上傳的電力需求而得；依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置，根據合作者裝置所安排的交易電量及所計算的採用能源使用動作的獎勵，以更新強化學習表；以及記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境，並在此模擬環境下，計算在各個系統狀態下安排電源使用動作所獲得的獎勵，據以更新強化學習表。The present invention provides an energy use planning device for a charging station based on reinforcement learning, which is arranged at a designated charging station. The charging station energy use planning device includes a connection device, a storage device and a processor. The connection device is used to connect the partner device, and the partner device is used to manage a plurality of charging stations including the designated charging station in the energy sharing area. The storage device is used to store computer programs. The processor is coupled to the connection device and the storage device, and is configured to load and execute a computer program to define a plurality of system states using the power demand of the specified charging station, the remaining battery capacity, and the global power demand and internal electricity price of the energy sharing area, and Estimate the expected rewards of arranging energy use actions in various system states to construct a reinforcement learning table, in which the global power demand is obtained by integrating the power requirements uploaded by each charging station by the partner device; The energy usage actions arranged in the system state are uploaded to the partner device, and the reinforcement learning table is updated according to the transaction power arranged by the partner device and the calculated rewards for using energy usage actions; and the current system state and energy usage actions are recorded. , rewards and the number of times of the current system state to generate a simulation environment, and in this simulation environment, calculate the rewards obtained by arranging power usage actions in each system state, and update the reinforcement learning table accordingly.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, the following embodiments are given and described in detail with the accompanying drawings as follows.

本發明實施例運用強化學習方法於充電站，根據來自外界的電力需求資訊，使用基於模型的多智能體（multi-agent）強化學習演算法，透過更新迭代並在固定時間段作電動車輛充電站的能源使用規劃，安排各充電站的電池充放電與所提供的電動汽車能源之策略，以達到充電站整體利益最大化。The embodiment of the present invention applies the reinforcement learning method to the charging station. According to the power demand information from the outside world, the model-based multi-agent reinforcement learning algorithm is used to perform the electric vehicle charging station through the update iteration and in a fixed period of time. The energy use planning of each charging station, and the strategy of arranging the charging and discharging of the batteries of each charging station and the energy provided by the electric vehicle, so as to maximize the overall interests of the charging station.

圖1是根據本發明一實施例所繪示的能源共享系統的示意圖。請參考圖1，本實施例的能源共享系統10適用於一電動車輛充電站的合作區域，其中包括多個電動車輛的充電站EVCS 1~EVCS I（其中I為正整數）及負責傳遞資訊的至少一個合作者裝置12。該區域下每個充電站EVCS 1~EVCS I皆備有儲能裝置（energy storage system，ESS），其能向其他電動車輛充電站售出多餘的電量或是購買不足的電量。充電站EVCS 1~EVCS I會依照電廠14所提供的實時電價（real-time-price）、自身電動車輛用戶的充電需求、儲能裝置的剩餘電量與所有電動車用戶的充電需求，決定電動車輛之電量供給並適當調整充放電策略。FIG. 1 is a schematic diagram of an energy sharing system according to an embodiment of the present invention. Referring to FIG. 1 , the energy sharing system 10 of the present embodiment is suitable for a cooperation area of an electric vehicle charging station, which includes a plurality of electric vehicle charging stations EVCS 1 to EVCS I (wherein I is a positive integer) and a controller responsible for transmitting information. At least one collaborator device 12 . Each charging station EVCS 1~EVCS I in this area is equipped with an energy storage system (ESS), which can sell excess electricity or buy insufficient electricity to other electric vehicle charging stations. The charging stations EVCS 1~EVCS I will determine the electric vehicle according to the real-time price (real-time-price) provided by the power plant 14, the charging demand of its own electric vehicle users, the remaining power of the energy storage device and the charging demand of all electric vehicle users. power supply and adjust the charging and discharging strategy appropriately.

圖2是根據本發明一實施例所繪示的基於強化學習的充電站能源使用規劃裝置的方塊圖。請同時參考圖1及圖2，本發明實施例的充電站能源使用規劃裝置20例如是配置在圖1的充電站EVCS 1中，但在其他實施例中，充電站能源使用規劃裝置20也可以配置在圖1的其他充電站中。充電站能源使用規劃裝置20例如是具有運算能力的檔案伺服器、資料庫伺服器、應用程式伺服器、工作站或個人電腦等計算機裝置，其中包括連接裝置22、儲存裝置24及處理器26等元件，這些元件的功能分述如下：2 is a block diagram of an apparatus for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 and FIG. 2 at the same time, the charging station energy use planning device 20 according to the embodiment of the present invention is, for example, configured in the charging station EVCS 1 of FIG. 1 , but in other embodiments, the charging station energy use planning device 20 may also be Configured in other charging stations in Figure 1. The charging station energy usage planning device 20 is, for example, a computer device such as a file server, a database server, an application server, a workstation or a personal computer with computing capabilities, which includes components such as a connection device 22 , a storage device 24 and a processor 26 . , the functions of these components are described as follows:

連接裝置22例如是可與合作者裝置12連接的任意的有線或無線的介面裝置，其可用以將充電站EVCS1自身的電力需求上傳至合作者裝置12，並接收由合作者裝置12回傳的全局電力需求。對於有線方式而言，連接裝置22可以是通用序列匯流排（universal serial bus，USB）、RS232、通用非同步接收器/傳送器（universal asynchronous receiver/transmitter，UART）、內部整合電路（I2C）、序列周邊介面（serial peripheral interface，SPI）、顯示埠（display port）或雷電埠（thunderbolt）等介面，但不限於此。對於無線方式而言，連接裝置22可以是支援無線保真（wireless fidelity，Wi-Fi）、RFID、藍芽、紅外線、近場通訊（near-field communication，NFC）或裝置對裝置（device-to-device，D2D）等通訊協定的裝置，亦不限於此。在一些實施例中，連接裝置22亦可包括支援乙太網路（Ethernet）或是支援802.11g、802.11n、802.11ac等無線網路標準的網路卡，使得充電站能源使用規劃裝置20可經由網路連接合作者裝置12，以上傳或接收電力需求、全局電力需求、交易電量等資料。The connection device 22 is, for example, any wired or wireless interface device that can be connected to the partner device 12 , which can be used to upload the power demand of the charging station EVCS1 to the partner device 12 and receive the data returned by the partner device 12 . Global power demand. For wired mode, the connection device 22 may be a universal serial bus (USB), RS232, universal asynchronous receiver/transmitter (UART), internal integrated circuit (I2C), Interfaces such as serial peripheral interface (SPI), display port (display port) or thunderbolt (thunderbolt), but not limited thereto. For wireless mode, the connection device 22 may support wireless fidelity (Wi-Fi), RFID, Bluetooth, infrared, near-field communication (NFC) or device-to-device (device-to-device) -device, D2D) and other communication protocol devices, and it is not limited to this. In some embodiments, the connection device 22 may also include a network card supporting Ethernet or wireless network standards such as 802.11g, 802.11n, and 802.11ac, so that the charging station energy usage planning device 20 can The partner device 12 is connected via the network to upload or receive data such as power demand, global power demand, and transaction power.

儲存裝置24例如是任意型式的固定式或可移動式隨機存取記憶體（Random Access Memory，RAM）、唯讀記憶體（Read-Only Memory，ROM）、快閃記憶體（Flash memory）、硬碟或類似元件或上述元件的組合，而用以儲存可由處理器26執行的電腦程式。在一些實施例中，儲存裝置24例如還可儲存由處理器26所建立的強化學習表以及由連接裝置22從合作者裝置12接收的全局電力需求。The storage device 24 is, for example, any type of fixed or removable random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), flash memory (Flash memory), hard drive A disk or similar element, or a combination of the foregoing, for storing computer programs executable by the processor 26 . In some embodiments, the storage device 24 may also store, for example, reinforcement learning tables established by the processor 26 and global power requirements received by the connecting device 22 from the partner device 12 .

處理器26例如是中央處理單元（Central Processing Unit，CPU），或是其他可程式化之一般用途或特殊用途的微處理器（Microprocessor）、微控制器（Microcontroller）、數位訊號處理器（Digital Signal Processor，DSP）、可程式化控制器、特殊應用積體電路（Application Specific Integrated Circuits，ASIC）、可程式化邏輯裝置（Programmable Logic Device，PLD）或其他類似裝置或這些裝置的組合，本發明不在此限制。在本實施例中，處理器26可從儲存裝置24載入電腦程式，以執行本發明實施例的基於強化學習的充電站能源使用規劃方法。The processor 26 is, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessors (Microprocessors), microcontrollers (Microcontrollers), and digital signal processors (Digital Signal Processors). Processor, DSP), programmable controller, application specific integrated circuit (Application Specific Integrated Circuits, ASIC), programmable logic device (Programmable Logic Device, PLD) or other similar devices or a combination of these devices, the present invention does not this limit. In this embodiment, the processor 26 can load a computer program from the storage device 24 to execute the reinforcement learning-based charging station energy usage planning method according to the embodiment of the present invention.

本發明實施例的基於強化學習的充電站能源使用規劃方法例如是將多個電動車輛充電站的運作過程形塑成馬可夫決策過程（Markov decision process，MDP），並將各個電動車輛充電站視為智能體（agent）進行學習，使運作時間離散化（time slot），為了提升規劃（planning）效率，例如是採用回合式設定（episode）。The reinforcement learning-based charging station energy usage planning method according to the embodiment of the present invention, for example, forms the operation process of a plurality of electric vehicle charging stations into a Markov decision process (MDP), and regards each electric vehicle charging station as a Markov decision process (MDP). The agent learns to discretize the operating time (time slot). In order to improve the planning efficiency, for example, an episode is used.

詳細而言，圖3是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2及圖3，本實施例的方法適用於上述充電站能源使用規劃裝置20，以下即搭配充電站能源使用規劃裝置20的各項元件說明本實施例的充電站能源使用規劃方法的詳細步驟。In detail, FIG. 3 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 and FIG. 3 at the same time, the method of this embodiment is applicable to the above-mentioned charging station energy use planning device 20 , and the following describes the charging station energy source of this embodiment in combination with various elements of the charging station energy use planning device 20 Detailed steps for using the planning method.

在步驟S302中，由充電站能源使用規劃裝置20的處理器26以充電站EVCS 1的電力資訊、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個電力狀態，並預估在各個系統狀態下安排能源使用動作的期望報酬以建構一強化學習表。其中，處理器26例如是利用連接裝置22將自身的電力需求上傳至能源共享區域的合作者裝置12，並接收由合作者裝置12整合各個充電站EVCS 1~EVCS I上傳的電力需求所得的全局電力需求（即，充電站總體的電力需求），並根據充電站所在區域通知各個充電站當前的全局電力需求。In step S302, the processor 26 of the charging station energy usage planning device 20 defines a plurality of power states based on the power information of the charging station EVCS 1, the remaining battery power, and the global power demand and internal electricity price of the energy sharing area, and predicts the The expected rewards of energy use actions are arranged in each system state to construct a reinforcement learning table. The processor 26, for example, uses the connection device 22 to upload its own power demand to the partner device 12 in the energy sharing area, and receives the global power demand obtained by the partner device 12 integrating the power demands uploaded by the charging stations EVCS 1 to EVCS 1. The power demand (that is, the overall power demand of the charging station), and the current global power demand of each charging station is notified according to the region where the charging station is located.

詳言之，處理器26例如會給定一狀態空間S及一動作空間A，並將在時間段t的狀態標記為

，其中

，以及將在狀態

下於時間段t選擇的動作標記為

，其中

。在狀態

下選擇動作

之後，此環境將轉變為下一狀態

，並產生整體利益P(t)。其中，在狀態

下選擇動作

的機率函數可標記為策略

，而用以評估在時間段t使用策略

的累計利益的期望值的動作值函數（即，Q函數）

可定義為：

,

In detail, the processor 26, for example, will give a state space S and an action space A, and mark the state in the time period t as

,in

, and will be in the state

The action selected under time period t is marked as

,in

. in state

select action

After that, this environment will transition to the next state

, and produce an overall benefit P(t). Among them, in the state

select action

The probability function of can be marked as a strategy

, which is used to evaluate the strategy used at time period t

The action-value function (i.e., the Q-function) of the expected value of the cumulative benefit

can be defined as:

,

其中，

為折扣率（discount factor）。 in,

is the discount factor.

在本實施例中，處理器26例如是將第i個充電站在時間段t的狀態

定義為：

In this embodiment, the processor 26, for example, puts the i-th charging station in the state of the time period t

defined as:

其中，

為在時間段t的能源共享區域的全局電力需求，

為第i個充電站的電池電量，

為第i個充電站的電力需求，

則為能源共享區域的內部電價，例如電廠所提供的實時電價。其中，

係作為觀察用指標，其可幫助充電站學習其他充電站動作的效果，並改善學習效率。 in,

is the global power demand of the energy-sharing area at time period t,

is the battery level of the i-th charging station,

is the power demand of the i-th charging station,

It is the internal electricity price of the energy sharing area, such as the real-time electricity price provided by the power plant. in,

The system is used as an observation indicator, which can help the charging station learn the effect of other charging station actions and improve the learning efficiency.

在其他實施例中，處理器26例如是將第i個充電站在時間段t的狀態

定義為：

In other embodiments, the processor 26 , for example, converts the i-th charging station to the state of the time period t

defined as:

其中，

為在時間段t的能源共享區域的全局電力需求，

為第i個充電站的電池電量，

為第i個充電站的緊急需求，

為第i個充電站的常規需求，

為第i個充電站的再生能源電量，

則為能源共享區域的內部電價。其中，上述的緊急需求例如是滿足至少一個緊急條件的電力需求，而所述的緊急條件例如是充電時間限制（例如為1小時）、充電量限制等充電相關的限制條件，在此不設限。 in,

is the global power demand of the energy-sharing area at time period t,

is the battery level of the i-th charging station,

for the urgent needs of the i-th charging station,

is the regular demand for the i-th charging station,

is the renewable energy power of the i-th charging station,

is the internal electricity price in the energy sharing area. The above-mentioned emergency demand is, for example, a power demand that satisfies at least one emergency condition, and the emergency condition is, for example, a charging time limit (for example, 1 hour), a charging amount limit and other charging-related constraints, which are not limited here. .

每個充電站的動作可定義為：

The actions of each charging station can be defined as:

其中，

為充放電需求量，

為電池充放電電量。其中，當

為正值時，代表充電站需購電，而當

為負值，代表充電站可售電。 in,

For the charging and discharging demand,

Charge and discharge the battery. Among them, when

When the value is positive, it means that the charging station needs to purchase electricity, and when the

A negative value means that the charging station can sell electricity.

在步驟S304中，處理器26依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置12，根據合作者裝置12所安排的交易電量及所計算的採用此能源使用動作的獎勵，以更新強化學習表。In step S304, the processor 26 selects an energy usage action suitable for the current system state according to the reinforcement learning table and uploads it to the partner device 12, according to the transaction amount of electricity arranged by the partner device 12 and the calculated energy usage Use the reward for the action to update the reinforcement learning table.

在一些實施例中，處理器26例如是選擇強化學習表中所記錄的當前系統狀態下的多個能源使用動作中的最優動作，並利用連接裝置22將當前系統狀態以及所選擇的能源使用動作一併傳送給合作者裝置12，而由合作者裝置12計算出安排給充電站的交易電量，其中包括與其他充電站的能源分享量以及向電廠14購買/賣出的電量。處理器26例如還將充電站獲得的利益資訊傳送給合作者裝置12，並由合作者裝置12計算出全部充電站的整體利益，而可作為處理器26採用此能源使用動作的獎勵。In some embodiments, the processor 26, for example, selects an optimal action among a plurality of energy usage actions under the current system state recorded in the reinforcement learning table, and uses the connection device 22 to connect the current system state and the selected energy usage action The actions are also transmitted to the partner device 12 , and the partner device 12 calculates the transaction amount of electricity allocated to the charging station, including the amount of energy shared with other charging stations and the amount of electricity purchased/sold to the power plant 14 . For example, the processor 26 also transmits the benefit information obtained by the charging station to the partner device 12, and the partner device 12 calculates the overall benefit of all the charging stations, which can be used as a reward for the processor 26 to use the energy usage action.

詳細而言，每個充電站的最佳化問題是根據當前的系統狀態去找出能夠最大化整體利益的期望值的最佳策略

，而最佳化動作值函數可標記為

。所述最佳策略

的選擇基於下式：

,

In detail, the optimization problem of each charging station is to find the best strategy to maximize the expected value of the overall benefit according to the current system state

, and the optimal action-value function can be marked as

. the best strategy

is chosen based on the following formula:

,

其中，

為充電站的狀態範圍，

則為充電站的動作範圍，其例如為在時刻t需滿足電動車輛之電量供給與充放電量的範圍，而與充電站自身的電力需求和儲存裝置的剩餘電量有關。 in,

is the state range of the charging station,

Then is the action range of the charging station, which is, for example, the range that needs to satisfy the power supply and charging and discharging capacity of the electric vehicle at time t, and is related to the power demand of the charging station itself and the remaining power of the storage device.

根據合作者裝置12所計算的交易電量及全部充電站的整體利益，處理器26可依據下式更新強化學習表中的學習值

：

According to the transaction power calculated by the partner device 12 and the overall benefits of all charging stations, the processor 26 can update the learning value in the reinforcement learning table according to the following formula

:

其中，

為學習率（learning rate）、

為折扣率（discount factor），

為在系統狀態

下安排交易電量

所得的學習值。上述的學習率

例如為數值介於0.1至0.5之間的任意數，其可決定新系統狀態

對於原系統狀態

的學習值的影響比例。上述的折扣率

例如為數值介於0.9至0.99之間的任意數，其可決定新系統狀態

的學習值相對於所回饋的獎勵

的比率。 in,

is the learning rate,

is the discount rate (discount factor),

for the system state

Arrange the transaction power below

the learned value obtained. The above learning rate

For example, any number between 0.1 and 0.5, which determines the new system state

For the original system state

The impact ratio of the learned value. Discount rate above

For example, any number between 0.9 and 0.99, which determines the new system state

The learned value of relative to the reward given back

The ratio.

在步驟S306中，處理器26記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境，並在模擬環境下，計算在各個系統狀態下安排電源使用動作所獲得的獎勵，據以更新強化學習表。In step S306, the processor 26 records the current system state, the energy usage action, the reward, and the number of times the current system state has gone through to generate a simulated environment, and in the simulated environment, calculates the obtained power usage actions by arranging the power usage actions in each system state. Reward, according to which the reinforcement learning table is updated.

詳細而言，當充電站實際運作時，充電站在每個時刻都會記錄其系統狀態、所執行的動作、執行動作所獲得的獎勵和每個系統狀態的歷經次數，並可利用所記錄的資料產生模擬環境以進行學習。其中，若系統狀態的歷經次數愈高，即表示該系統狀態在未來發生機率愈高，因此，歷經次數可決定系統狀態在規劃過程中的優先程度。In detail, when the charging station is actually operating, the charging station will record its system state, the actions performed, the rewards obtained by performing the actions, and the number of times each system state has been experienced at each moment, and the recorded data can be used. Generate simulated environments for learning. Among them, if the number of times of the system state is higher, it means that the probability of the system state to occur in the future is higher. Therefore, the number of times of experience can determine the priority of the system state in the planning process.

在強化學習表建立之後，即可利用所產生的模擬環境在本地端進行學習。在一些實施例中，為了有足夠的資料能在本地端進行學習，規劃的執行例如是以回合為單位（即，固定每隔一定的時刻執行規劃）。而為了避免不必要的規劃而浪費系統資源，可進一步根據整體利益的變化來判斷是否需要進入規劃。After the reinforcement learning table is established, the generated simulation environment can be used to learn locally. In some embodiments, in order to have enough data for learning at the local end, the execution of the plan is, for example, in units of rounds (ie, the plan is executed at regular intervals). In order to avoid unnecessary planning and waste of system resources, it can be further judged whether it is necessary to enter the planning according to the changes of the overall interests.

詳細而言，圖4是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2、圖3及圖4，本實施例說明圖3實施例的步驟S306的詳細步驟。In detail, FIG. 4 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 , FIG. 3 and FIG. 4 at the same time. This embodiment describes the detailed steps of step S306 in the embodiment of FIG. 3 .

在步驟S402中，處理器26記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境，並在模擬環境下，計算在各個系統狀態下安排電源使用動作所獲得的獎勵。其中，處理器26例如是其在當前系統狀態選擇能源使用動作的情況下充電站可獲得的利益資訊傳送給合作者裝置12，而將合作者裝置12所計算的全部充電站的整體利益作為採用此能源使用動作的獎勵。In step S402, the processor 26 records the current system state, the energy usage action, the reward, and the number of times of the current system state to generate a simulation environment, and in the simulation environment, calculates the obtained power usage action in each system state. award. The processor 26, for example, transmits the benefit information that the charging station can obtain to the partner device 12 when the energy usage action is selected in the current system state, and uses the overall benefit of all the charging stations calculated by the partner device 12 as the adoption The reward for this energy usage action.

在步驟S404中，處理器26會計算整體利益的變化率，並判斷此變化率是否超過預設閾值。所述變化率

的公式如下：

In step S404, the processor 26 calculates the rate of change of the overall benefit, and determines whether the rate of change exceeds a preset threshold. the rate of change

The formula is as follows:

其中，

表示在時刻t由合作者裝置12根據各充電站的利益資訊所計算的整體利益（平均利益），而則

表示時刻t-1至時刻t的整體利益變化率。在步驟S404中，處理器26即根據此變化率

，判斷是否需要進入規劃。 in,

represents the overall benefit (average benefit) calculated by the partner device 12 according to the benefit information of each charging station at time t, and then

Represents the overall interest rate of change from time t-1 to time t. In step S404, the processor 26 according to the change rate

, to determine whether it is necessary to enter the planning.

若變化率

大於預設閾值

，則在步驟S406中，處理器26即規劃強化學習表中的電源使用動作。反之，若變化率

不大於預設閾值

，或是強化學習表的規劃完成後，則在步驟S408中，處理器26將等待充電站進入下一個系統狀態時，再依據更新後的強化學習表，選擇適於在下一個系統狀態下安排的能源使用動作並執行強化學習表的更新。 If the rate of change

greater than the preset threshold

, then in step S406, the processor 26 plans the power usage action in the reinforcement learning table. Conversely, if the rate of change

not greater than the preset threshold

, or after the planning of the reinforcement learning table is completed, in step S408, the processor 26 will wait for the charging station to enter the next system state, and then according to the updated reinforcement learning table, select a suitable for scheduling in the next system state. Energy use actions and perform updates to reinforcement learning tables.

圖5是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2、圖4及圖5，本實施例說明圖4實施例的步驟S406的詳細步驟。FIG. 5 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 , FIG. 4 and FIG. 5 at the same time. This embodiment describes the detailed steps of step S406 in the embodiment of FIG. 4 .

在步驟S502中，處理器26依據強化學習表中所記錄的各個系統狀態的歷經次數，依序選擇一個系統狀態。其中，處理器26例如會依據所記錄的各個系統狀態的歷經次數，將這些系統狀態排序，而選擇歷經次數最高的系統狀態進行本地端學習。In step S502, the processor 26 selects a system state in sequence according to the number of times of each system state recorded in the reinforcement learning table. The processor 26 may, for example, sort the system states according to the recorded epochs of each system state, and select the system state with the highest epoch for local learning.

在步驟S504，處理器26隨機選擇強化學習表中所記錄的該系統狀態下的其中一個能源使用動作，並用以計算在所選擇系統狀態下採用所選擇能源使用動作所獲得的獎勵。在此模擬過程中，處理器26例如是採用如前述實施例所述的方法，將當前選擇的系統狀態及能源使用動作上傳至合作者裝置12，而由合作者裝置12計算所有充電站的整體利益，並提供給所述處理器26作為獎勵，用以判斷是否更新強化學習表。In step S504, the processor 26 randomly selects one of the energy usage actions in the system state recorded in the reinforcement learning table, and uses it to calculate the reward obtained by adopting the selected energy usage action in the selected system state. In this simulation process, the processor 26 uploads the currently selected system state and energy usage action to the partner device 12, for example, using the method described in the previous embodiment, and the partner device 12 calculates the overall value of all charging stations benefits, and provided to the processor 26 as a reward for determining whether to update the reinforcement learning table.

在步驟S506中，處理器26判斷在當前所選擇系統狀態下採用所選擇能源使用動作所獲得的獎勵是否大於先前記錄的獎勵。若當前獎勵未大於先前獎勵，則回到步驟S504，重新選擇能源使用動作，並重新計算獎勵。In step S506, the processor 26 determines whether the reward obtained by adopting the selected energy usage action in the currently selected system state is greater than the previously recorded reward. If the current reward is not greater than the previous reward, go back to step S504, reselect the energy use action, and recalculate the reward.

若當前獎勵大於先前獎勵，則在步驟S508，處理器26即使用當前選擇的能源使用動作更新強化學習表。例如，更新強化學習表中在所選擇系統狀態下選擇此能源使用動作的學習值。If the current reward is greater than the previous reward, in step S508, the processor 26 updates the reinforcement learning table with the currently selected energy usage action. For example, update the learning value in the reinforcement learning table for selecting this energy use action in the selected system state.

在步驟S510中，處理器26會判斷規劃過程中已選擇用來計算獎勵以更新強化學習表的系統狀態的個數是否超過預定比例。所述比例例如為四分之一或其他數值，在此不設限。In step S510, the processor 26 determines whether the number of system states selected for calculating the reward to update the reinforcement learning table in the planning process exceeds a predetermined ratio. The ratio is, for example, one quarter or other values, which are not limited herein.

若已選擇的系統狀態的個數未超過預定比例，則回到步驟S502，由處理器26重新選擇下一個系統狀態，以進行強化學習表的更新。反之，若已選擇的系統狀態的個數超過預定比例，則在步驟S512中，處理器26將會結束規劃過程。藉由限定規劃過程中選取的系統狀態個數，可大幅加快學習速度。在規劃過程結束後，更新的強化學習表已具備一定的經驗，因此能夠在實際運作中，為充電站提供適於當前系統狀態的充放電策略，達到充電站整體利益的最大化。If the number of selected system states does not exceed the predetermined ratio, the process returns to step S502, and the processor 26 reselects the next system state to update the reinforcement learning table. On the contrary, if the number of selected system states exceeds the predetermined ratio, in step S512, the processor 26 will end the planning process. Learning can be greatly accelerated by limiting the number of system states selected during the planning process. After the planning process, the updated reinforcement learning table has certain experience, so it can provide the charging station with a charging and discharging strategy suitable for the current system state in actual operation, so as to maximize the overall benefit of the charging station.

綜上所述，在本發明實施例的基於強化學習的充電站能源使用規劃方法及裝置中，依據充電站能源共享區域內的資訊以及當時環境資料來決定供給電動車輛的電量，透過更新迭代以及在固定時間段進行充電站的能源使用規劃的方式，對強化學習表進行更新，藉此可加速多智能體學習模型的學習速度以利快速適應環境，並可達到充電站整體利益的最大化。To sum up, in the method and device for planning the energy use of a charging station based on reinforcement learning according to the embodiment of the present invention, the amount of electricity supplied to the electric vehicle is determined according to the information in the energy sharing area of the charging station and the current environment data, and through updating iteration and The reinforcement learning table is updated by means of planning the energy use of the charging station in a fixed period of time, which can accelerate the learning speed of the multi-agent learning model to quickly adapt to the environment and maximize the overall benefits of the charging station.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above by the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the scope of the appended patent application.

10:能源共享系統 12:合作者裝置 14:電廠 20:充電站能源使用規劃裝置 22:連接裝置 24:儲存裝置 26:處理器 EVCS 1~EVCS I:充電站 S302~S306、S402~S408、S502~S512:步驟10: Energy Sharing System 12: Collaborator Installation 14: Power Plant 20: Energy use planning device for charging stations 22: Connection device 24: Storage device 26: Processor EVCS 1~EVCS I: charging station S302~S306, S402~S408, S502~S512: Steps

圖1是根據本發明一實施例所繪示的能源共享系統的示意圖。圖2是根據本發明一實施例所繪示的基於強化學習的充電站能源使用規劃裝置的方塊圖。圖3是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。圖4是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。圖5是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。 FIG. 1 is a schematic diagram of an energy sharing system according to an embodiment of the present invention. 2 is a block diagram of an apparatus for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. 3 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. FIG. 4 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. FIG. 5 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention.

S302~S306:步驟 S302~S306: Steps

Claims

A method for planning energy use of a charging station based on reinforcement learning, suitable for planning energy use by a designated charging station among a plurality of charging stations in an energy sharing area, the method includes the following steps: Define multiple system states using its own power demand, remaining battery power, and the global power demand and internal electricity price of the energy sharing area, and estimate the expected reward for arranging energy use actions in each of the system states to construct a reinforcement learning A table, wherein the global power demand is obtained by integrating the power demand uploaded by each of the charging stations by the partner device; According to the reinforcement learning table, select an energy usage action suitable for the current system state and upload it to the partner device. a reward, updating the reinforcement learning table; and Recording the current system state, the energy usage action, the reward and the number of times the current system state is experienced to generate a simulated environment, and in the simulated environment, calculate the arrangement of the The reward obtained from the power usage action is used to update the reinforcement learning table.

The method of claim 1, wherein the step of calculating a reward for arranging the power usage action in each of the system states comprises: In the simulated environment, one of the system states in the reinforcement learning table is selected in sequence according to the number of times of experience, and one of the energy usage actions in the system state is randomly selected for The reward for taking the selected energy usage action in the selected system state is calculated.

The method of claim 2, wherein calculating a reward for arranging the power usage action in each of the system states, and updating the reinforcement learning table accordingly, comprises: The currently computed reward is compared to previously recorded rewards, and the reinforcement learning table is updated with the currently selected energy usage action when the currently computed reward is greater than the previously recorded reward.

The method of claim 2, wherein the step of calculating a reward obtained by arranging the power usage action in each of the system states, and updating the reinforcement learning table accordingly, further comprises: Determine whether the number of the system states that are sequentially selected to calculate the reward to update the reinforcement learning table exceeds a predetermined ratio, and when the number exceeds the predetermined ratio, end the reinforcement learning table. renew.

The method according to claim 1, further comprising: It is judged whether the calculated rate of change of the reward exceeds a preset threshold, and when the rate of change exceeds the preset threshold, the simulated environment is generated to update the reinforcement learning table.

The method of claim 1, wherein the step of selecting, according to the reinforcement learning table, an energy usage action suitable for scheduling in the current system state comprises: An optimal action among a plurality of energy usage actions in the current system state recorded in the reinforcement learning table is selected.

The method of claim 1, wherein the power demand in the system state includes a regular demand and an emergency demand, wherein the emergency demand is a power demand that satisfies at least one emergency condition.

The method of claim 1, wherein the system state further includes the renewable energy power of the charging station.

The method of claim 1, wherein the energy usage action includes the charging and discharging demand of the charging station and the charging and discharging capacity of the battery.

The method of claim 1, wherein the transaction electricity quantity arranged by the partner device includes electricity quantity traded to other said charging stations, electricity quantity purchased from a power plant, and electricity quantity sold back to the power plant.

An energy usage planning device for a charging station based on reinforcement learning, which is configured in a designated charging station, including: a connection device, connected to a partner device, the partner device is used to manage a plurality of charging stations including the designated charging station in the energy sharing area; storage devices to store computer programs; and A processor, coupled to the connection device and the storage device, is configured to load and execute the computer program to: Define a plurality of system states using the power demand of the specified charging station, the remaining battery capacity, and the global power demand and internal electricity price of the energy sharing area, and estimate the expected reward for arranging energy use actions in each of the system states to constructing a reinforcement learning table, wherein the global power demand is obtained by integrating the power demand uploaded by each of the charging stations by the partner device; According to the reinforcement learning table, select an energy usage action suitable for being arranged in the current system state and upload it to the partner device. a reward for updating the reinforcement learning table; and recording the current system state, the energy use action, the reward and the number of times the current system state has been traversed to generate a simulated environment, and in the simulated environment, calculate the arrangement of the The reward obtained from the power usage action is used to update the reinforcement learning table.

The device for planning energy use of a charging station according to claim 11, wherein the processor comprises, in the simulated environment, sequentially selecting one of the system states in the reinforcement learning table according to the number of times of experience, and randomly select one of the energy usage actions in the system state, so as to calculate the reward obtained by adopting the selected energy usage action in the selected system state.

The charging station energy usage planning apparatus of claim 12, wherein the processor includes comparing the currently calculated award to a previously recorded award, and when the currently calculated award is greater than the previously recorded award , the reinforcement learning table is updated with the currently selected energy usage action.

The charging station energy usage planning device according to claim 12, wherein the processor further determines whether the number of the system states sequentially selected for calculating the reward to update the reinforcement learning table exceeds a predetermined ratio, And when the number exceeds the predetermined ratio, the update of the reinforcement learning table is ended.

The device for planning energy use of a charging station according to claim 11, wherein the processor further determines whether the calculated rate of change of the reward exceeds a preset threshold, and when the rate of change exceeds the preset threshold, The simulated environment is generated for updating the reinforcement learning table.

The energy usage planning device for a charging station according to claim 11, wherein the power demand in the system state includes a regular demand and an emergency demand, wherein the emergency demand is a power demand that satisfies at least one emergency condition.

The charging station energy usage planning apparatus of claim 11, wherein the processor includes selecting an optimal action among a plurality of energy usage actions recorded in the reinforcement learning table under the current system state.

The device for planning energy use of a charging station according to claim 11, wherein the system state further includes the amount of renewable energy of the charging station.

The energy usage planning device of a charging station according to claim 11, wherein the energy usage action includes the charging and discharging demand of the charging station and the charging and discharging capacity of the battery.

The charging station energy usage planning device according to claim 11, wherein the transaction power arranged by the partner device includes the power traded to other said charging stations, the power purchased from the power plant, and the power sold back to the power plant.