TWI767868B - Method and apparatus for planning energy usage of charging station based on reinforcement learning - Google Patents

Method and apparatus for planning energy usage of charging station based on reinforcement learning Download PDF

Info

Publication number
TWI767868B
TWI767868B TW110141537A TW110141537A TWI767868B TW I767868 B TWI767868 B TW I767868B TW 110141537 A TW110141537 A TW 110141537A TW 110141537 A TW110141537 A TW 110141537A TW I767868 B TWI767868 B TW I767868B
Authority
TW
Taiwan
Prior art keywords
reinforcement learning
charging station
energy
reward
learning table
Prior art date
Application number
TW110141537A
Other languages
Chinese (zh)
Other versions
TW202320002A (en
Inventor
江坤諺
邱偉育
Original Assignee
國立清華大學
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 國立清華大學 filed Critical 國立清華大學
Priority to TW110141537A priority Critical patent/TWI767868B/en
Application granted granted Critical
Publication of TWI767868B publication Critical patent/TWI767868B/en
Publication of TW202320002A publication Critical patent/TW202320002A/en

Links

Images

Abstract

A method and an apparatus for planning energy usage of charging station based on reinforcement learning are provided. In the method, multiple system states are defined by using a power demand and a remaining battery energy of a charging station itself, and a global power demand and an internal power price of an energy sharing area, and expected returns for arranging energy use actions under each system state are estimated to construct a reinforcement learning table. According to the reinforcement learning table, an energy use action adapted for a current system state is selected and uploaded to a coordinator device, and a trading electricity arranged by the coordinator device and a reward of adopting the power use action calculated by the coordinator device are used to update the reinforcement learning table. The current system state, the power use action, the reward and a number of times the system state being selected are recorded and used to generate a simulation environment, so as to calculate an overall benefit of arranging the power use action under each system state and accordingly update the reinforcement learning table.

Description

基於強化學習的充電站能源使用規劃方法及裝置Method and device for energy use planning of charging station based on reinforcement learning

本發明是有關於一種強化學習方法及裝置,且特別是有關於一種基於強化學習的充電站能源使用規劃方法及裝置。The present invention relates to a reinforcement learning method and device, and in particular, to a reinforcement learning-based energy usage planning method and device for a charging station.

近年來,由於環保意識的提高,許多人開始使用電動車輛,而隨著電動車輛用戶的大幅增加,電動車輛充電站的需求也同步提升。然而,由於電動車輛用戶的習慣不同,其對於充電站的需求有所差異,在大量電動車輛充電的情況下,將造成充電站充電的不協調,並且對整體電網有負面影響。In recent years, many people have started to use electric vehicles due to the increasing awareness of environmental protection, and with the substantial increase in electric vehicle users, the demand for electric vehicle charging stations has also increased. However, due to the different habits of electric vehicle users, their demand for charging stations is different. In the case of a large number of electric vehicles charging, the charging station will be uncoordinated and have a negative impact on the overall power grid.

先前用於多個電動車輛充電站之間的能源使用規劃係採用非線性規劃(nonlinear programming)演算法,其需要實時預測價格、電動車輛需求和可再生能源數據,從而導致性能難以提升。為了解決此問題,目前已有部分文獻提出無模型的強化學習演算法,但此種演算法的收斂速度低,結果將產生較高的花費並造成能源浪費,無法達到充電站整體利益的最大化。Energy usage planning previously used between multiple electric vehicle charging stations employed nonlinear programming algorithms that required real-time forecasting of prices, electric vehicle demand and renewable energy data, making performance improvements difficult. In order to solve this problem, some literatures have proposed a model-free reinforcement learning algorithm, but the convergence speed of this algorithm is low, resulting in higher costs and energy waste, which cannot maximize the overall benefits of the charging station. .

本發明提供一種基於強化學習的充電站能源使用規劃方法及裝置,可妥善安排各充電站之電池充放電與能源提供,達到充電站整體利益最大化。The present invention provides a method and device for planning the energy use of a charging station based on reinforcement learning, which can properly arrange the battery charging and discharging and energy supply of each charging station, so as to maximize the overall benefit of the charging station.

本發明提供一種基於強化學習的充電站能源使用規劃方法,適於由能源共享區域內多個充電站中的指定充電站規劃能源使用。所述方法包括使用自身的電力需求、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個系統狀態,並預估各個系統狀態下安排能源使用動作的期望報酬以建構一個強化學習表,其中全局電力需求是由合作者裝置整合各個充電站上傳的電力需求而得;依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置,根據合作者裝置所安排的交易電量及所計算的採用此能源使用動作的獎勵,以更新強化學習表;以及記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境,並在此模擬環境下,計算在各個系統狀態下安排電源使用動作所獲得的獎勵,據以更新強化學習表。The invention provides a method for planning energy use of a charging station based on reinforcement learning, which is suitable for planning energy use by a designated charging station among a plurality of charging stations in an energy sharing area. The method includes defining multiple system states using its own power demand, remaining battery power, and the global power demand and internal electricity price in the energy sharing area, and estimating the expected reward for arranging energy use actions in each system state to construct a reinforcement learning table , where the global power demand is obtained by integrating the power demand uploaded by each charging station by the partner device; according to the reinforcement learning table, select the energy use action suitable for the arrangement in the current system state and upload it to the partner device, according to the partner device Arranged transaction amount and calculated reward for adopting this energy usage action to update the reinforcement learning table; and record the current system state, energy usage action, reward and the number of epochs of the current system state to generate a simulated environment, and simulate here In the environment, the reward obtained by arranging the power usage action in each system state is calculated, and the reinforcement learning table is updated accordingly.

本發明提供一種基於強化學習的充電站能源使用規劃裝置,其係配置於指定充電站。此充電站能源使用規劃裝置包括連接裝置、儲存裝置及處理器。連接裝置是用以連接合作者裝置,此合作者裝置是用以管理能源共享區域內包括指定充電站在內的多個充電站。儲存裝置用以儲存電腦程式。處理器耦接連接裝置及儲存裝置,且經配置以載入並執行電腦程式以使用指定充電站的電力需求、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個系統狀態,並預估在各個系統狀態下安排能源使用動作的期望報酬以建構一個強化學習表,其中全局電力需求是由合作者裝置整合各個充電站上傳的電力需求而得;依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置,根據合作者裝置所安排的交易電量及所計算的採用能源使用動作的獎勵,以更新強化學習表;以及記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境,並在此模擬環境下,計算在各個系統狀態下安排電源使用動作所獲得的獎勵,據以更新強化學習表。The present invention provides an energy use planning device for a charging station based on reinforcement learning, which is arranged at a designated charging station. The charging station energy use planning device includes a connection device, a storage device and a processor. The connection device is used to connect the partner device, and the partner device is used to manage a plurality of charging stations including the designated charging station in the energy sharing area. The storage device is used to store computer programs. The processor is coupled to the connection device and the storage device, and is configured to load and execute a computer program to define a plurality of system states using the power demand of the specified charging station, the remaining battery capacity, and the global power demand and internal electricity price of the energy sharing area, and Estimate the expected rewards of arranging energy use actions in various system states to construct a reinforcement learning table, in which the global power demand is obtained by integrating the power requirements uploaded by each charging station by the partner device; The energy usage actions arranged in the system state are uploaded to the partner device, and the reinforcement learning table is updated according to the transaction power arranged by the partner device and the calculated rewards for using energy usage actions; and the current system state and energy usage actions are recorded. , rewards and the number of times of the current system state to generate a simulation environment, and in this simulation environment, calculate the rewards obtained by arranging power usage actions in each system state, and update the reinforcement learning table accordingly.

為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, the following embodiments are given and described in detail with the accompanying drawings as follows.

本發明實施例運用強化學習方法於充電站,根據來自外界的電力需求資訊,使用基於模型的多智能體(multi-agent)強化學習演算法,透過更新迭代並在固定時間段作電動車輛充電站的能源使用規劃,安排各充電站的電池充放電與所提供的電動汽車能源之策略,以達到充電站整體利益最大化。The embodiment of the present invention applies the reinforcement learning method to the charging station. According to the power demand information from the outside world, the model-based multi-agent reinforcement learning algorithm is used to perform the electric vehicle charging station through the update iteration and in a fixed period of time. The energy use planning of each charging station, and the strategy of arranging the charging and discharging of the batteries of each charging station and the energy provided by the electric vehicle, so as to maximize the overall interests of the charging station.

圖1是根據本發明一實施例所繪示的能源共享系統的示意圖。請參考圖1,本實施例的能源共享系統10適用於一電動車輛充電站的合作區域,其中包括多個電動車輛的充電站EVCS 1~EVCS I(其中I為正整數)及負責傳遞資訊的至少一個合作者裝置12。該區域下每個充電站EVCS 1~EVCS I皆備有儲能裝置(energy storage system,ESS),其能向其他電動車輛充電站售出多餘的電量或是購買不足的電量。充電站EVCS 1~EVCS I會依照電廠14所提供的實時電價(real-time-price)、自身電動車輛用戶的充電需求、儲能裝置的剩餘電量與所有電動車用戶的充電需求,決定電動車輛之電量供給並適當調整充放電策略。FIG. 1 is a schematic diagram of an energy sharing system according to an embodiment of the present invention. Referring to FIG. 1 , the energy sharing system 10 of the present embodiment is suitable for a cooperation area of an electric vehicle charging station, which includes a plurality of electric vehicle charging stations EVCS 1 to EVCS I (wherein I is a positive integer) and a controller responsible for transmitting information. At least one collaborator device 12 . Each charging station EVCS 1~EVCS I in this area is equipped with an energy storage system (ESS), which can sell excess electricity or buy insufficient electricity to other electric vehicle charging stations. The charging stations EVCS 1~EVCS I will determine the electric vehicle according to the real-time price (real-time-price) provided by the power plant 14, the charging demand of its own electric vehicle users, the remaining power of the energy storage device and the charging demand of all electric vehicle users. power supply and adjust the charging and discharging strategy appropriately.

圖2是根據本發明一實施例所繪示的基於強化學習的充電站能源使用規劃裝置的方塊圖。請同時參考圖1及圖2,本發明實施例的充電站能源使用規劃裝置20例如是配置在圖1的充電站EVCS 1中,但在其他實施例中,充電站能源使用規劃裝置20也可以配置在圖1的其他充電站中。充電站能源使用規劃裝置20例如是具有運算能力的檔案伺服器、資料庫伺服器、應用程式伺服器、工作站或個人電腦等計算機裝置,其中包括連接裝置22、儲存裝置24及處理器26等元件,這些元件的功能分述如下:2 is a block diagram of an apparatus for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 and FIG. 2 at the same time, the charging station energy use planning device 20 according to the embodiment of the present invention is, for example, configured in the charging station EVCS 1 of FIG. 1 , but in other embodiments, the charging station energy use planning device 20 may also be Configured in other charging stations in Figure 1. The charging station energy usage planning device 20 is, for example, a computer device such as a file server, a database server, an application server, a workstation or a personal computer with computing capabilities, which includes components such as a connection device 22 , a storage device 24 and a processor 26 . , the functions of these components are described as follows:

連接裝置22例如是可與合作者裝置12連接的任意的有線或無線的介面裝置,其可用以將充電站EVCS1自身的電力需求上傳至合作者裝置12,並接收由合作者裝置12回傳的全局電力需求。對於有線方式而言,連接裝置22可以是通用序列匯流排(universal serial bus,USB)、RS232、通用非同步接收器/傳送器(universal asynchronous receiver/transmitter,UART)、內部整合電路(I2C)、序列周邊介面(serial peripheral interface,SPI)、顯示埠(display port)或雷電埠(thunderbolt)等介面,但不限於此。對於無線方式而言,連接裝置22可以是支援無線保真(wireless fidelity,Wi-Fi)、RFID、藍芽、紅外線、近場通訊(near-field communication,NFC)或裝置對裝置(device-to-device,D2D)等通訊協定的裝置,亦不限於此。在一些實施例中,連接裝置22亦可包括支援乙太網路(Ethernet)或是支援802.11g、802.11n、802.11ac等無線網路標準的網路卡,使得充電站能源使用規劃裝置20可經由網路連接合作者裝置12,以上傳或接收電力需求、全局電力需求、交易電量等資料。The connection device 22 is, for example, any wired or wireless interface device that can be connected to the partner device 12 , which can be used to upload the power demand of the charging station EVCS1 to the partner device 12 and receive the data returned by the partner device 12 . Global power demand. For wired mode, the connection device 22 may be a universal serial bus (USB), RS232, universal asynchronous receiver/transmitter (UART), internal integrated circuit (I2C), Interfaces such as serial peripheral interface (SPI), display port (display port) or thunderbolt (thunderbolt), but not limited thereto. For wireless mode, the connection device 22 may support wireless fidelity (Wi-Fi), RFID, Bluetooth, infrared, near-field communication (NFC) or device-to-device (device-to-device) -device, D2D) and other communication protocol devices, and it is not limited to this. In some embodiments, the connection device 22 may also include a network card supporting Ethernet or wireless network standards such as 802.11g, 802.11n, and 802.11ac, so that the charging station energy usage planning device 20 can The partner device 12 is connected via the network to upload or receive data such as power demand, global power demand, and transaction power.

儲存裝置24例如是任意型式的固定式或可移動式隨機存取記憶體(Random Access Memory,RAM)、唯讀記憶體(Read-Only Memory,ROM)、快閃記憶體(Flash memory)、硬碟或類似元件或上述元件的組合,而用以儲存可由處理器26執行的電腦程式。在一些實施例中,儲存裝置24例如還可儲存由處理器26所建立的強化學習表以及由連接裝置22從合作者裝置12接收的全局電力需求。The storage device 24 is, for example, any type of fixed or removable random access memory (Random Access Memory, RAM), read-only memory (Read-Only Memory, ROM), flash memory (Flash memory), hard drive A disk or similar element, or a combination of the foregoing, for storing computer programs executable by the processor 26 . In some embodiments, the storage device 24 may also store, for example, reinforcement learning tables established by the processor 26 and global power requirements received by the connecting device 22 from the partner device 12 .

處理器26例如是中央處理單元(Central Processing Unit,CPU),或是其他可程式化之一般用途或特殊用途的微處理器(Microprocessor)、微控制器(Microcontroller)、數位訊號處理器(Digital Signal Processor,DSP)、可程式化控制器、特殊應用積體電路(Application Specific Integrated Circuits,ASIC)、可程式化邏輯裝置(Programmable Logic Device,PLD)或其他類似裝置或這些裝置的組合,本發明不在此限制。在本實施例中,處理器26可從儲存裝置24載入電腦程式,以執行本發明實施例的基於強化學習的充電站能源使用規劃方法。The processor 26 is, for example, a central processing unit (CPU), or other programmable general-purpose or special-purpose microprocessors (Microprocessors), microcontrollers (Microcontrollers), and digital signal processors (Digital Signal Processors). Processor, DSP), programmable controller, application specific integrated circuit (Application Specific Integrated Circuits, ASIC), programmable logic device (Programmable Logic Device, PLD) or other similar devices or a combination of these devices, the present invention does not this limit. In this embodiment, the processor 26 can load a computer program from the storage device 24 to execute the reinforcement learning-based charging station energy usage planning method according to the embodiment of the present invention.

本發明實施例的基於強化學習的充電站能源使用規劃方法例如是將多個電動車輛充電站的運作過程形塑成馬可夫決策過程(Markov decision process,MDP),並將各個電動車輛充電站視為智能體(agent)進行學習,使運作時間離散化(time slot),為了提升規劃(planning)效率,例如是採用回合式設定(episode)。The reinforcement learning-based charging station energy usage planning method according to the embodiment of the present invention, for example, forms the operation process of a plurality of electric vehicle charging stations into a Markov decision process (MDP), and regards each electric vehicle charging station as a Markov decision process (MDP). The agent learns to discretize the operating time (time slot). In order to improve the planning efficiency, for example, an episode is used.

詳細而言,圖3是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2及圖3,本實施例的方法適用於上述充電站能源使用規劃裝置20,以下即搭配充電站能源使用規劃裝置20的各項元件說明本實施例的充電站能源使用規劃方法的詳細步驟。In detail, FIG. 3 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 and FIG. 3 at the same time, the method of this embodiment is applicable to the above-mentioned charging station energy use planning device 20 , and the following describes the charging station energy source of this embodiment in combination with various elements of the charging station energy use planning device 20 Detailed steps for using the planning method.

在步驟S302中,由充電站能源使用規劃裝置20的處理器26以充電站EVCS 1的電力資訊、電池剩餘電量以及能源共享區域的全局電力需求與內部電價定義多個電力狀態,並預估在各個系統狀態下安排能源使用動作的期望報酬以建構一強化學習表。其中,處理器26例如是利用連接裝置22將自身的電力需求上傳至能源共享區域的合作者裝置12,並接收由合作者裝置12整合各個充電站EVCS 1~EVCS I上傳的電力需求所得的全局電力需求(即,充電站總體的電力需求),並根據充電站所在區域通知各個充電站當前的全局電力需求。In step S302, the processor 26 of the charging station energy usage planning device 20 defines a plurality of power states based on the power information of the charging station EVCS 1, the remaining battery power, and the global power demand and internal electricity price of the energy sharing area, and predicts the The expected rewards of energy use actions are arranged in each system state to construct a reinforcement learning table. The processor 26, for example, uses the connection device 22 to upload its own power demand to the partner device 12 in the energy sharing area, and receives the global power demand obtained by the partner device 12 integrating the power demands uploaded by the charging stations EVCS 1 to EVCS 1. The power demand (that is, the overall power demand of the charging station), and the current global power demand of each charging station is notified according to the region where the charging station is located.

詳言之,處理器26例如會給定一狀態空間S及一動作空間A,並將在時間段t的狀態標記為

Figure 02_image001
,其中
Figure 02_image003
,以及將在狀態
Figure 02_image001
下於時間段t選擇的動作標記為
Figure 02_image005
,其中
Figure 02_image007
。在狀態
Figure 02_image001
下選擇動作
Figure 02_image005
之後,此環境將轉變為下一狀態
Figure 02_image009
,並產生整體利益P(t)。其中,在狀態
Figure 02_image001
下選擇動作
Figure 02_image005
的機率函數可標記為策略
Figure 02_image011
,而用以評估在時間段t使用策略
Figure 02_image013
的累計利益的期望值的動作值函數(即,Q函數)
Figure 02_image015
可定義為:
Figure 02_image017
,
Figure 02_image019
In detail, the processor 26, for example, will give a state space S and an action space A, and mark the state in the time period t as
Figure 02_image001
,in
Figure 02_image003
, and will be in the state
Figure 02_image001
The action selected under time period t is marked as
Figure 02_image005
,in
Figure 02_image007
. in state
Figure 02_image001
select action
Figure 02_image005
After that, this environment will transition to the next state
Figure 02_image009
, and produce an overall benefit P(t). Among them, in the state
Figure 02_image001
select action
Figure 02_image005
The probability function of can be marked as a strategy
Figure 02_image011
, which is used to evaluate the strategy used at time period t
Figure 02_image013
The action-value function (i.e., the Q-function) of the expected value of the cumulative benefit
Figure 02_image015
can be defined as:
Figure 02_image017
,
Figure 02_image019

其中,

Figure 02_image021
為折扣率(discount factor)。 in,
Figure 02_image021
is the discount factor.

在本實施例中,處理器26例如是將第i個充電站在時間段t的狀態

Figure 02_image023
定義為:
Figure 02_image025
In this embodiment, the processor 26, for example, puts the i-th charging station in the state of the time period t
Figure 02_image023
defined as:
Figure 02_image025

其中,

Figure 02_image027
為在時間段t的能源共享區域的全局電力需求,
Figure 02_image029
為第i個充電站的電池電量,
Figure 02_image031
為第i個充電站的電力需求,
Figure 02_image033
則為能源共享區域的內部電價,例如電廠所提供的實時電價。其中,
Figure 02_image027
係作為觀察用指標,其可幫助充電站學習其他充電站動作的效果,並改善學習效率。 in,
Figure 02_image027
is the global power demand of the energy-sharing area at time period t,
Figure 02_image029
is the battery level of the i-th charging station,
Figure 02_image031
is the power demand of the i-th charging station,
Figure 02_image033
It is the internal electricity price of the energy sharing area, such as the real-time electricity price provided by the power plant. in,
Figure 02_image027
The system is used as an observation indicator, which can help the charging station learn the effect of other charging station actions and improve the learning efficiency.

在其他實施例中,處理器26例如是將第i個充電站在時間段t的狀態

Figure 02_image023
定義為:
Figure 02_image035
In other embodiments, the processor 26 , for example, converts the i-th charging station to the state of the time period t
Figure 02_image023
defined as:
Figure 02_image035

其中,

Figure 02_image027
為在時間段t的能源共享區域的全局電力需求,
Figure 02_image029
為第i個充電站的電池電量,
Figure 02_image037
為第i個充電站的緊急需求,
Figure 02_image039
為第i個充電站的常規需求,
Figure 02_image041
為第i個充電站的再生能源電量,
Figure 02_image033
則為能源共享區域的內部電價。其中,上述的緊急需求例如是滿足至少一個緊急條件的電力需求,而所述的緊急條件例如是充電時間限制(例如為1小時)、充電量限制等充電相關的限制條件,在此不設限。 in,
Figure 02_image027
is the global power demand of the energy-sharing area at time period t,
Figure 02_image029
is the battery level of the i-th charging station,
Figure 02_image037
for the urgent needs of the i-th charging station,
Figure 02_image039
is the regular demand for the i-th charging station,
Figure 02_image041
is the renewable energy power of the i-th charging station,
Figure 02_image033
is the internal electricity price in the energy sharing area. The above-mentioned emergency demand is, for example, a power demand that satisfies at least one emergency condition, and the emergency condition is, for example, a charging time limit (for example, 1 hour), a charging amount limit and other charging-related constraints, which are not limited here. .

每個充電站的動作可定義為:

Figure 02_image043
The actions of each charging station can be defined as:
Figure 02_image043

其中,

Figure 02_image045
為充放電需求量,
Figure 02_image047
為電池充放電電量。其中,當
Figure 02_image045
為正值時,代表充電站需購電,而當
Figure 02_image045
為負值,代表充電站可售電。 in,
Figure 02_image045
For the charging and discharging demand,
Figure 02_image047
Charge and discharge the battery. Among them, when
Figure 02_image045
When the value is positive, it means that the charging station needs to purchase electricity, and when the
Figure 02_image045
A negative value means that the charging station can sell electricity.

在步驟S304中,處理器26依據強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至合作者裝置12,根據合作者裝置12所安排的交易電量及所計算的採用此能源使用動作的獎勵,以更新強化學習表。In step S304, the processor 26 selects an energy usage action suitable for the current system state according to the reinforcement learning table and uploads it to the partner device 12, according to the transaction amount of electricity arranged by the partner device 12 and the calculated energy usage Use the reward for the action to update the reinforcement learning table.

在一些實施例中,處理器26例如是選擇強化學習表中所記錄的當前系統狀態下的多個能源使用動作中的最優動作,並利用連接裝置22將當前系統狀態以及所選擇的能源使用動作一併傳送給合作者裝置12,而由合作者裝置12計算出安排給充電站的交易電量,其中包括與其他充電站的能源分享量以及向電廠14購買/賣出的電量。處理器26例如還將充電站獲得的利益資訊傳送給合作者裝置12,並由合作者裝置12計算出全部充電站的整體利益,而可作為處理器26採用此能源使用動作的獎勵。In some embodiments, the processor 26, for example, selects an optimal action among a plurality of energy usage actions under the current system state recorded in the reinforcement learning table, and uses the connection device 22 to connect the current system state and the selected energy usage action The actions are also transmitted to the partner device 12 , and the partner device 12 calculates the transaction amount of electricity allocated to the charging station, including the amount of energy shared with other charging stations and the amount of electricity purchased/sold to the power plant 14 . For example, the processor 26 also transmits the benefit information obtained by the charging station to the partner device 12, and the partner device 12 calculates the overall benefit of all the charging stations, which can be used as a reward for the processor 26 to use the energy usage action.

詳細而言,每個充電站的最佳化問題是根據當前的系統狀態去找出能夠最大化整體利益的期望值的最佳策略

Figure 02_image049
,而最佳化動作值函數可標記為
Figure 02_image051
。所述最佳策略
Figure 02_image049
的選擇基於下式:
Figure 02_image053
,
Figure 02_image019
In detail, the optimization problem of each charging station is to find the best strategy to maximize the expected value of the overall benefit according to the current system state
Figure 02_image049
, and the optimal action-value function can be marked as
Figure 02_image051
. the best strategy
Figure 02_image049
is chosen based on the following formula:
Figure 02_image053
,
Figure 02_image019

其中,

Figure 02_image055
為充電站的狀態範圍,
Figure 02_image057
則為充電站的動作範圍,其例如為在時刻t需滿足電動車輛之電量供給與充放電量的範圍,而與充電站自身的電力需求和儲存裝置的剩餘電量有關。 in,
Figure 02_image055
is the state range of the charging station,
Figure 02_image057
Then is the action range of the charging station, which is, for example, the range that needs to satisfy the power supply and charging and discharging capacity of the electric vehicle at time t, and is related to the power demand of the charging station itself and the remaining power of the storage device.

根據合作者裝置12所計算的交易電量及全部充電站的整體利益,處理器26可依據下式更新強化學習表中的學習值

Figure 02_image059
Figure 02_image061
According to the transaction power calculated by the partner device 12 and the overall benefits of all charging stations, the processor 26 can update the learning value in the reinforcement learning table according to the following formula
Figure 02_image059
:
Figure 02_image061

其中,

Figure 02_image063
為學習率(learning rate)、
Figure 02_image065
為折扣率(discount factor),
Figure 02_image067
為在系統狀態
Figure 02_image069
下安排交易電量
Figure 02_image071
所得的學習值。上述的學習率
Figure 02_image063
例如為數值介於0.1至0.5之間的任意數,其可決定新系統狀態
Figure 02_image069
對於原系統狀態
Figure 02_image023
的學習值的影響比例。上述的折扣率
Figure 02_image065
例如為數值介於0.9至0.99之間的任意數,其可決定新系統狀態
Figure 02_image069
的學習值相對於所回饋的獎勵
Figure 02_image073
的比率。 in,
Figure 02_image063
is the learning rate,
Figure 02_image065
is the discount rate (discount factor),
Figure 02_image067
for the system state
Figure 02_image069
Arrange the transaction power below
Figure 02_image071
the learned value obtained. The above learning rate
Figure 02_image063
For example, any number between 0.1 and 0.5, which determines the new system state
Figure 02_image069
For the original system state
Figure 02_image023
The impact ratio of the learned value. Discount rate above
Figure 02_image065
For example, any number between 0.9 and 0.99, which determines the new system state
Figure 02_image069
The learned value of relative to the reward given back
Figure 02_image073
The ratio.

在步驟S306中,處理器26記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境,並在模擬環境下,計算在各個系統狀態下安排電源使用動作所獲得的獎勵,據以更新強化學習表。In step S306, the processor 26 records the current system state, the energy usage action, the reward, and the number of times the current system state has gone through to generate a simulated environment, and in the simulated environment, calculates the obtained power usage actions by arranging the power usage actions in each system state. Reward, according to which the reinforcement learning table is updated.

詳細而言,當充電站實際運作時,充電站在每個時刻都會記錄其系統狀態、所執行的動作、執行動作所獲得的獎勵和每個系統狀態的歷經次數,並可利用所記錄的資料產生模擬環境以進行學習。其中,若系統狀態的歷經次數愈高,即表示該系統狀態在未來發生機率愈高,因此,歷經次數可決定系統狀態在規劃過程中的優先程度。In detail, when the charging station is actually operating, the charging station will record its system state, the actions performed, the rewards obtained by performing the actions, and the number of times each system state has been experienced at each moment, and the recorded data can be used. Generate simulated environments for learning. Among them, if the number of times of the system state is higher, it means that the probability of the system state to occur in the future is higher. Therefore, the number of times of experience can determine the priority of the system state in the planning process.

在強化學習表建立之後,即可利用所產生的模擬環境在本地端進行學習。在一些實施例中,為了有足夠的資料能在本地端進行學習,規劃的執行例如是以回合為單位(即,固定每隔一定的時刻執行規劃)。而為了避免不必要的規劃而浪費系統資源,可進一步根據整體利益的變化來判斷是否需要進入規劃。After the reinforcement learning table is established, the generated simulation environment can be used to learn locally. In some embodiments, in order to have enough data for learning at the local end, the execution of the plan is, for example, in units of rounds (ie, the plan is executed at regular intervals). In order to avoid unnecessary planning and waste of system resources, it can be further judged whether it is necessary to enter the planning according to the changes of the overall interests.

詳細而言,圖4是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2、圖3及圖4,本實施例說明圖3實施例的步驟S306的詳細步驟。In detail, FIG. 4 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 , FIG. 3 and FIG. 4 at the same time. This embodiment describes the detailed steps of step S306 in the embodiment of FIG. 3 .

在步驟S402中,處理器26記錄當前系統狀態、能源使用動作、獎勵及當前系統狀態的歷經次數以產生一模擬環境,並在模擬環境下,計算在各個系統狀態下安排電源使用動作所獲得的獎勵。其中,處理器26例如是其在當前系統狀態選擇能源使用動作的情況下充電站可獲得的利益資訊傳送給合作者裝置12,而將合作者裝置12所計算的全部充電站的整體利益作為採用此能源使用動作的獎勵。In step S402, the processor 26 records the current system state, the energy usage action, the reward, and the number of times of the current system state to generate a simulation environment, and in the simulation environment, calculates the obtained power usage action in each system state. award. The processor 26, for example, transmits the benefit information that the charging station can obtain to the partner device 12 when the energy usage action is selected in the current system state, and uses the overall benefit of all the charging stations calculated by the partner device 12 as the adoption The reward for this energy usage action.

在步驟S404中,處理器26會計算整體利益的變化率,並判斷此變化率是否超過預設閾值。所述變化率

Figure 02_image075
的公式如下:
Figure 02_image077
Figure 02_image079
In step S404, the processor 26 calculates the rate of change of the overall benefit, and determines whether the rate of change exceeds a preset threshold. the rate of change
Figure 02_image075
The formula is as follows:
Figure 02_image077
Figure 02_image079

其中,

Figure 02_image081
表示在時刻t由合作者裝置12根據各充電站的利益資訊所計算的整體利益(平均利益),而則
Figure 02_image075
表示時刻t-1至時刻t的整體利益變化率。在步驟S404中,處理器26即根據此變化率
Figure 02_image075
,判斷是否需要進入規劃。 in,
Figure 02_image081
represents the overall benefit (average benefit) calculated by the partner device 12 according to the benefit information of each charging station at time t, and then
Figure 02_image075
Represents the overall interest rate of change from time t-1 to time t. In step S404, the processor 26 according to the change rate
Figure 02_image075
, to determine whether it is necessary to enter the planning.

若變化率

Figure 02_image075
大於預設閾值
Figure 02_image083
,則在步驟S406中,處理器26即規劃強化學習表中的電源使用動作。反之,若變化率
Figure 02_image075
不大於預設閾值
Figure 02_image083
,或是強化學習表的規劃完成後,則在步驟S408中,處理器26將等待充電站進入下一個系統狀態時,再依據更新後的強化學習表,選擇適於在下一個系統狀態下安排的能源使用動作並執行強化學習表的更新。 If the rate of change
Figure 02_image075
greater than the preset threshold
Figure 02_image083
, then in step S406, the processor 26 plans the power usage action in the reinforcement learning table. Conversely, if the rate of change
Figure 02_image075
not greater than the preset threshold
Figure 02_image083
, or after the planning of the reinforcement learning table is completed, in step S408, the processor 26 will wait for the charging station to enter the next system state, and then according to the updated reinforcement learning table, select a suitable for scheduling in the next system state. Energy use actions and perform updates to reinforcement learning tables.

圖5是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。請同時參照圖1、圖2、圖4及圖5,本實施例說明圖4實施例的步驟S406的詳細步驟。FIG. 5 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. Please refer to FIG. 1 , FIG. 2 , FIG. 4 and FIG. 5 at the same time. This embodiment describes the detailed steps of step S406 in the embodiment of FIG. 4 .

在步驟S502中,處理器26依據強化學習表中所記錄的各個系統狀態的歷經次數,依序選擇一個系統狀態。其中,處理器26例如會依據所記錄的各個系統狀態的歷經次數,將這些系統狀態排序,而選擇歷經次數最高的系統狀態進行本地端學習。In step S502, the processor 26 selects a system state in sequence according to the number of times of each system state recorded in the reinforcement learning table. The processor 26 may, for example, sort the system states according to the recorded epochs of each system state, and select the system state with the highest epoch for local learning.

在步驟S504,處理器26隨機選擇強化學習表中所記錄的該系統狀態下的其中一個能源使用動作,並用以計算在所選擇系統狀態下採用所選擇能源使用動作所獲得的獎勵。在此模擬過程中,處理器26例如是採用如前述實施例所述的方法,將當前選擇的系統狀態及能源使用動作上傳至合作者裝置12,而由合作者裝置12計算所有充電站的整體利益,並提供給所述處理器26作為獎勵,用以判斷是否更新強化學習表。In step S504, the processor 26 randomly selects one of the energy usage actions in the system state recorded in the reinforcement learning table, and uses it to calculate the reward obtained by adopting the selected energy usage action in the selected system state. In this simulation process, the processor 26 uploads the currently selected system state and energy usage action to the partner device 12, for example, using the method described in the previous embodiment, and the partner device 12 calculates the overall value of all charging stations benefits, and provided to the processor 26 as a reward for determining whether to update the reinforcement learning table.

在步驟S506中,處理器26判斷在當前所選擇系統狀態下採用所選擇能源使用動作所獲得的獎勵是否大於先前記錄的獎勵。若當前獎勵未大於先前獎勵,則回到步驟S504,重新選擇能源使用動作,並重新計算獎勵。In step S506, the processor 26 determines whether the reward obtained by adopting the selected energy usage action in the currently selected system state is greater than the previously recorded reward. If the current reward is not greater than the previous reward, go back to step S504, reselect the energy use action, and recalculate the reward.

若當前獎勵大於先前獎勵,則在步驟S508,處理器26即使用當前選擇的能源使用動作更新強化學習表。例如,更新強化學習表中在所選擇系統狀態下選擇此能源使用動作的學習值。If the current reward is greater than the previous reward, in step S508, the processor 26 updates the reinforcement learning table with the currently selected energy usage action. For example, update the learning value in the reinforcement learning table for selecting this energy use action in the selected system state.

在步驟S510中,處理器26會判斷規劃過程中已選擇用來計算獎勵以更新強化學習表的系統狀態的個數是否超過預定比例。所述比例例如為四分之一或其他數值,在此不設限。In step S510, the processor 26 determines whether the number of system states selected for calculating the reward to update the reinforcement learning table in the planning process exceeds a predetermined ratio. The ratio is, for example, one quarter or other values, which are not limited herein.

若已選擇的系統狀態的個數未超過預定比例,則回到步驟S502,由處理器26重新選擇下一個系統狀態,以進行強化學習表的更新。反之,若已選擇的系統狀態的個數超過預定比例,則在步驟S512中,處理器26將會結束規劃過程。藉由限定規劃過程中選取的系統狀態個數,可大幅加快學習速度。在規劃過程結束後,更新的強化學習表已具備一定的經驗,因此能夠在實際運作中,為充電站提供適於當前系統狀態的充放電策略,達到充電站整體利益的最大化。If the number of selected system states does not exceed the predetermined ratio, the process returns to step S502, and the processor 26 reselects the next system state to update the reinforcement learning table. On the contrary, if the number of selected system states exceeds the predetermined ratio, in step S512, the processor 26 will end the planning process. Learning can be greatly accelerated by limiting the number of system states selected during the planning process. After the planning process, the updated reinforcement learning table has certain experience, so it can provide the charging station with a charging and discharging strategy suitable for the current system state in actual operation, so as to maximize the overall benefit of the charging station.

綜上所述,在本發明實施例的基於強化學習的充電站能源使用規劃方法及裝置中,依據充電站能源共享區域內的資訊以及當時環境資料來決定供給電動車輛的電量,透過更新迭代以及在固定時間段進行充電站的能源使用規劃的方式,對強化學習表進行更新,藉此可加速多智能體學習模型的學習速度以利快速適應環境,並可達到充電站整體利益的最大化。To sum up, in the method and device for planning the energy use of a charging station based on reinforcement learning according to the embodiment of the present invention, the amount of electricity supplied to the electric vehicle is determined according to the information in the energy sharing area of the charging station and the current environment data, and through updating iteration and The reinforcement learning table is updated by means of planning the energy use of the charging station in a fixed period of time, which can accelerate the learning speed of the multi-agent learning model to quickly adapt to the environment and maximize the overall benefits of the charging station.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above by the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the scope of the appended patent application.

10:能源共享系統 12:合作者裝置 14:電廠 20:充電站能源使用規劃裝置 22:連接裝置 24:儲存裝置 26:處理器 EVCS 1~EVCS I:充電站 S302~S306、S402~S408、S502~S512:步驟10: Energy Sharing System 12: Collaborator Installation 14: Power Plant 20: Energy use planning device for charging stations 22: Connection device 24: Storage device 26: Processor EVCS 1~EVCS I: charging station S302~S306, S402~S408, S502~S512: Steps

圖1是根據本發明一實施例所繪示的能源共享系統的示意圖。 圖2是根據本發明一實施例所繪示的基於強化學習的充電站能源使用規劃裝置的方塊圖。 圖3是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。 圖4是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。 圖5是依照本發明一實施例所繪示的基於強化學習的充電站能源使用規劃方法的流程圖。 FIG. 1 is a schematic diagram of an energy sharing system according to an embodiment of the present invention. 2 is a block diagram of an apparatus for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. 3 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. FIG. 4 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention. FIG. 5 is a flowchart of a method for planning energy use of a charging station based on reinforcement learning according to an embodiment of the present invention.

S302~S306:步驟 S302~S306: Steps

Claims (20)

一種基於強化學習的充電站能源使用規劃方法,適於由能源共享區域內多個充電站中的指定充電站規劃能源使用,所述方法包括下列步驟: 使用自身的電力需求、電池剩餘電量以及所述能源共享區域的全局電力需求與內部電價定義多個系統狀態,並預估在各所述系統狀態下安排能源使用動作的期望報酬以建構一強化學習表,其中所述全局電力需求是由所述合作者裝置整合各所述充電站上傳的電力需求而得; 依據所述強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至所述合作者裝置,根據所述合作者裝置所安排的交易電量及所計算的採用所述能源使用動作的獎勵,更新所述強化學習表;以及 記錄所述當前系統狀態、所述能源使用動作、所述獎勵及所述當前系統狀態的歷經次數以產生一模擬環境,並在所述模擬環境下,計算在各所述系統狀態下安排所述電源使用動作所獲得的獎勵,據以更新所述強化學習表。 A method for planning energy use of a charging station based on reinforcement learning, suitable for planning energy use by a designated charging station among a plurality of charging stations in an energy sharing area, the method includes the following steps: Define multiple system states using its own power demand, remaining battery power, and the global power demand and internal electricity price of the energy sharing area, and estimate the expected reward for arranging energy use actions in each of the system states to construct a reinforcement learning A table, wherein the global power demand is obtained by integrating the power demand uploaded by each of the charging stations by the partner device; According to the reinforcement learning table, select an energy usage action suitable for the current system state and upload it to the partner device. a reward, updating the reinforcement learning table; and Recording the current system state, the energy usage action, the reward and the number of times the current system state is experienced to generate a simulated environment, and in the simulated environment, calculate the arrangement of the The reward obtained from the power usage action is used to update the reinforcement learning table. 如請求項1所述的方法,其中計算在各所述系統狀態下安排所述電源使用動作所獲得的獎勵的步驟包括: 在所述模擬環境下,依據所述歷經次數依序選擇所述強化學習表中的所述系統狀態其中之一,並隨機選擇所述系統狀態下的所述能源使用動作其中之一,用以計算在所選擇的所述系統狀態下採用所選擇的所述能源使用動作所獲得的所述獎勵。 The method of claim 1, wherein the step of calculating a reward for arranging the power usage action in each of the system states comprises: In the simulated environment, one of the system states in the reinforcement learning table is selected in sequence according to the number of times of experience, and one of the energy usage actions in the system state is randomly selected for The reward for taking the selected energy usage action in the selected system state is calculated. 如請求項2所述的方法,其中計算在各所述系統狀態下安排所述電源使用動作所獲得的獎勵,據以更新所述強化學習表的步驟包括: 將當前計算的所述獎勵與先前記錄的獎勵比較,並在當前計算的所述獎勵大於所述先前記錄的獎勵時,使用當前選擇的所述能源使用動作更新所述強化學習表。 The method of claim 2, wherein calculating a reward for arranging the power usage action in each of the system states, and updating the reinforcement learning table accordingly, comprises: The currently computed reward is compared to previously recorded rewards, and the reinforcement learning table is updated with the currently selected energy usage action when the currently computed reward is greater than the previously recorded reward. 如請求項2所述的方法,其中計算在各所述系統狀態下安排所述電源使用動作所獲得的獎勵,據以更新所述強化學習表的步驟更包括: 判斷依序選擇用來計算所述獎勵以更新所述強化學習表的所述系統狀態的個數是否超過預定比例,並在所述個數超過所述預定比例時,結束所述強化學習表的更新。 The method of claim 2, wherein the step of calculating a reward obtained by arranging the power usage action in each of the system states, and updating the reinforcement learning table accordingly, further comprises: Determine whether the number of the system states that are sequentially selected to calculate the reward to update the reinforcement learning table exceeds a predetermined ratio, and when the number exceeds the predetermined ratio, end the reinforcement learning table. renew. 如請求項1所述的方法,更包括: 判斷所計算的所述獎勵的變化率是否超過預設閾值,並在所述變化率超過所述預設閾值時,產生所述模擬環境,以進行所述強化學習表的更新。 The method according to claim 1, further comprising: It is judged whether the calculated rate of change of the reward exceeds a preset threshold, and when the rate of change exceeds the preset threshold, the simulated environment is generated to update the reinforcement learning table. 如請求項1所述的方法,其中依據所述強化學習表選擇適於在當前系統狀態下安排的能源使用動作的步驟包括: 選擇所述強化學習表中所記錄的所述當前系統狀態下的多個能源使用動作中的最優動作。 The method of claim 1, wherein the step of selecting, according to the reinforcement learning table, an energy usage action suitable for scheduling in the current system state comprises: An optimal action among a plurality of energy usage actions in the current system state recorded in the reinforcement learning table is selected. 如請求項1所述的方法,其中所述系統狀態中的所述電力需求包括常規需求及緊急需求,其中所述緊急需求為滿足至少一緊急條件的電力需求。The method of claim 1, wherein the power demand in the system state includes a regular demand and an emergency demand, wherein the emergency demand is a power demand that satisfies at least one emergency condition. 如請求項1所述的方法,其中所述系統狀態更包括所述充電站的再生能源電量。The method of claim 1, wherein the system state further includes the renewable energy power of the charging station. 如請求項1所述的方法,其中所述能源使用動作包括所述充電站的充放電需求量及電池充放電電量。The method of claim 1, wherein the energy usage action includes the charging and discharging demand of the charging station and the charging and discharging capacity of the battery. 如請求項1所述的方法,其中所述合作者裝置安排的交易電量包括向其他所述充電站交易的電量、向電廠購買的電量以及賣回所述電廠的電量。The method of claim 1, wherein the transaction electricity quantity arranged by the partner device includes electricity quantity traded to other said charging stations, electricity quantity purchased from a power plant, and electricity quantity sold back to the power plant. 一種基於強化學習的充電站能源使用規劃裝置,配置於指定充電站,包括: 連接裝置,連接合作者裝置,所述合作者裝置用以管理能源共享區域內包括所述指定充電站在內的多個充電站; 儲存裝置,儲存電腦程式;以及 處理器,耦接所述連接裝置及所述儲存裝置,經配置以載入並執行所述電腦程式以: 使用所述指定充電站的電力需求、電池剩餘電量以及所述能源共享區域的全局電力需求與內部電價定義多個系統狀態,並預估在各所述系統狀態下安排能源使用動作的期望報酬以建構一強化學習表,其中所述全局電力需求是由所述合作者裝置整合各所述充電站上傳的電力需求而得; 依據所述強化學習表選擇適於在當前系統狀態下安排的能源使用動作並上傳至所述合作者裝置,根據所述合作者裝置所安排的交易電量及所計算的採用所述能源使用動作的獎勵,以更新所述強化學習表;以及 記錄所述當前系統狀態、所述能源使用動作、所述獎勵及所述當前系統狀態的歷經次數以產生一模擬環境,並在所述模擬環境下,計算在各所述系統狀態下安排所述電源使用動作所獲得的獎勵,據以更新所述強化學習表。 An energy usage planning device for a charging station based on reinforcement learning, which is configured in a designated charging station, including: a connection device, connected to a partner device, the partner device is used to manage a plurality of charging stations including the designated charging station in the energy sharing area; storage devices to store computer programs; and A processor, coupled to the connection device and the storage device, is configured to load and execute the computer program to: Define a plurality of system states using the power demand of the specified charging station, the remaining battery capacity, and the global power demand and internal electricity price of the energy sharing area, and estimate the expected reward for arranging energy use actions in each of the system states to constructing a reinforcement learning table, wherein the global power demand is obtained by integrating the power demand uploaded by each of the charging stations by the partner device; According to the reinforcement learning table, select an energy usage action suitable for being arranged in the current system state and upload it to the partner device. a reward for updating the reinforcement learning table; and recording the current system state, the energy use action, the reward and the number of times the current system state has been traversed to generate a simulated environment, and in the simulated environment, calculate the arrangement of the The reward obtained from the power usage action is used to update the reinforcement learning table. 如請求項11所述的充電站能源使用規劃裝置,其中所述處理器包括在所述模擬環境下,依據所述歷經次數依序選擇所述強化學習表中的所述系統狀態其中之一,並隨機選擇所述系統狀態下的所述能源使用動作其中之一,用以計算在所選擇的所述系統狀態下採用所選擇的所述能源使用動作所獲得的獎勵。The device for planning energy use of a charging station according to claim 11, wherein the processor comprises, in the simulated environment, sequentially selecting one of the system states in the reinforcement learning table according to the number of times of experience, and randomly select one of the energy usage actions in the system state, so as to calculate the reward obtained by adopting the selected energy usage action in the selected system state. 如請求項12所述的充電站能源使用規劃裝置,其中所述處理器包括將當前計算的所述獎勵與先前記錄的獎勵比較,並在當前計算的所述獎勵大於所述先前記錄的獎勵時,使用當前選擇的所述能源使用動作更新所述強化學習表。The charging station energy usage planning apparatus of claim 12, wherein the processor includes comparing the currently calculated award to a previously recorded award, and when the currently calculated award is greater than the previously recorded award , the reinforcement learning table is updated with the currently selected energy usage action. 如請求項12所述的充電站能源使用規劃裝置,其中所述處理器更判斷依序選擇用來計算所述獎勵以更新所述強化學習表的所述系統狀態的個數是否超過預定比例,並在所述個數超過所述預定比例時,結束所述強化學習表的更新。The charging station energy usage planning device according to claim 12, wherein the processor further determines whether the number of the system states sequentially selected for calculating the reward to update the reinforcement learning table exceeds a predetermined ratio, And when the number exceeds the predetermined ratio, the update of the reinforcement learning table is ended. 如請求項11所述的充電站能源使用規劃裝置,其中所述處理器更判斷所計算的所述獎勵的變化率是否超過預設閾值,並在所述變化率超過所述預設閾值時,產生所述模擬環境,以進行所述強化學習表的更新。The device for planning energy use of a charging station according to claim 11, wherein the processor further determines whether the calculated rate of change of the reward exceeds a preset threshold, and when the rate of change exceeds the preset threshold, The simulated environment is generated for updating the reinforcement learning table. 如請求項11所述的充電站能源使用規劃裝置,其中所述系統狀態中的所述電力需求包括常規需求及緊急需求,其中所述緊急需求為滿足至少一緊急條件的電力需求。The energy usage planning device for a charging station according to claim 11, wherein the power demand in the system state includes a regular demand and an emergency demand, wherein the emergency demand is a power demand that satisfies at least one emergency condition. 如請求項11所述的充電站能源使用規劃裝置,其中所述處理器包括選擇所述強化學習表中所記錄的所述當前系統狀態下的多個能源使用動作中的最優動作。The charging station energy usage planning apparatus of claim 11, wherein the processor includes selecting an optimal action among a plurality of energy usage actions recorded in the reinforcement learning table under the current system state. 如請求項11所述的充電站能源使用規劃裝置,其中所述系統狀態更包括所述充電站的再生能源電量。The device for planning energy use of a charging station according to claim 11, wherein the system state further includes the amount of renewable energy of the charging station. 如請求項11所述的充電站能源使用規劃裝置,其中所述能源使用動作包括所述充電站的充放電需求量及電池充放電電量。The energy usage planning device of a charging station according to claim 11, wherein the energy usage action includes the charging and discharging demand of the charging station and the charging and discharging capacity of the battery. 如請求項11所述的充電站能源使用規劃裝置,其中所述合作者裝置安排的交易電量包括向其他所述充電站交易的電量、向電廠購買的電量以及賣回所述電廠的電量。The charging station energy usage planning device according to claim 11, wherein the transaction power arranged by the partner device includes the power traded to other said charging stations, the power purchased from the power plant, and the power sold back to the power plant.
TW110141537A 2021-11-08 2021-11-08 Method and apparatus for planning energy usage of charging station based on reinforcement learning TWI767868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW110141537A TWI767868B (en) 2021-11-08 2021-11-08 Method and apparatus for planning energy usage of charging station based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW110141537A TWI767868B (en) 2021-11-08 2021-11-08 Method and apparatus for planning energy usage of charging station based on reinforcement learning

Publications (2)

Publication Number Publication Date
TWI767868B true TWI767868B (en) 2022-06-11
TW202320002A TW202320002A (en) 2023-05-16

Family

ID=83103861

Family Applications (1)

Application Number Title Priority Date Filing Date
TW110141537A TWI767868B (en) 2021-11-08 2021-11-08 Method and apparatus for planning energy usage of charging station based on reinforcement learning

Country Status (1)

Country Link
TW (1) TWI767868B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200108732A1 (en) * 2018-10-09 2020-04-09 Regents Of The University Of Minnesota Physical model-guided machine learning framework for energy management of vehicles
CN111934335A (en) * 2020-08-18 2020-11-13 华北电力大学 Cluster electric vehicle charging behavior optimization method based on deep reinforcement learning
CN112396223A (en) * 2020-11-10 2021-02-23 华北电力大学 Electric vehicle charging station energy management method under interactive energy mechanism
CN113159578A (en) * 2021-04-22 2021-07-23 杭州电子科技大学 Charging optimization scheduling method of large-scale electric vehicle charging station based on reinforcement learning
EP3863882A1 (en) * 2018-10-11 2021-08-18 Vitesco Technologies GmbH Method and back end device for predictively controlling a charging process for an electric energy store of a motor vehicle

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200108732A1 (en) * 2018-10-09 2020-04-09 Regents Of The University Of Minnesota Physical model-guided machine learning framework for energy management of vehicles
EP3863882A1 (en) * 2018-10-11 2021-08-18 Vitesco Technologies GmbH Method and back end device for predictively controlling a charging process for an electric energy store of a motor vehicle
CN111934335A (en) * 2020-08-18 2020-11-13 华北电力大学 Cluster electric vehicle charging behavior optimization method based on deep reinforcement learning
CN112396223A (en) * 2020-11-10 2021-02-23 华北电力大学 Electric vehicle charging station energy management method under interactive energy mechanism
CN113159578A (en) * 2021-04-22 2021-07-23 杭州电子科技大学 Charging optimization scheduling method of large-scale electric vehicle charging station based on reinforcement learning

Also Published As

Publication number Publication date
TW202320002A (en) 2023-05-16

Similar Documents

Publication Publication Date Title
JP2024001341A (en) Battery service providing system and method
Gazafroudi et al. Organization-based multi-agent structure of the smart home electricity system
JP7243425B2 (en) BATTERY INFORMATION MANAGEMENT SYSTEM AND BATTERY INFORMATION MANAGEMENT METHOD
Meng et al. An integrated optimization+ learning approach to optimal dynamic pricing for the retailer with multi-type customers in smart grids
JP2021149788A (en) Information presentation system, server, information presentation method, and information presentation device
Shalaby et al. A dynamic optimal battery swapping mechanism for electric vehicles using an LSTM-based rolling horizon approach
Liu et al. A blockchain-based trustworthy collaborative power trading scheme for 5G-enabled social internet of vehicles
US20230127845A1 (en) Method for aggregating group of electric vehicles based on electric vehicle flexibility, electronic device, and storage medium
TWI763087B (en) Method and apparatus for peer-to-peer energy sharing based on reinforcement learning
Sultanuddin et al. Development of improved reinforcement learning smart charging strategy for electric vehicle fleet
TWI767525B (en) Method and apparatus for renewable energy allocation based on reinforcement learning
Korotunov et al. Genetic algorithms as an optimization approach for managing electric vehicles charging in the smart grid.
TWI767868B (en) Method and apparatus for planning energy usage of charging station based on reinforcement learning
Fu et al. Electric vehicle charging scheduling control strategy for the large-scale scenario with non-cooperative game-based multi-agent reinforcement learning
Lohat et al. AROA: Adam Remora Optimization Algorithm and Deep Q network for energy harvesting in Fog-IoV network
CN112329215B (en) Reliability evaluation method and computing equipment for power distribution network comprising electric automobile power exchange station
Chen et al. Reinforcement learning for smart charging of electric buses in smart grid
JP2021077508A (en) Storing method for secondary battery, storing system for secondary battery and program
JPWO2019171728A1 (en) Power management systems, power management methods, and programs
CN116307606B (en) Shared energy storage flexible operation scheduling method based on block chain
TWI779732B (en) Method for renewable energy bidding using multiagent transfer reinforcement learning
TWI699729B (en) Systems and methods for charging management of charging devices
CN117374995B (en) Power dispatching optimization method, device, equipment and storage medium
CN114697341B (en) Method for calculating power of electronic device, control device and storage medium
US20240157836A1 (en) Systems and methods for energy distribution entities and networks for electric vehicle energy delivery