TWI762128B

TWI762128B - Device and method of energy management using multi-object reinforcement learning

Info

Publication number: TWI762128B
Application number: TW109146801A
Authority: TW
Inventors: 邱偉育; 陳頌仁
Original assignee: 國立清華大學
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-04-21
Also published as: TW202226700A

Abstract

The present invention discloses a device of energy management system using multi-object reinforcement learning, which is coupled with a renewable energy device, an energy storage device and a plurality of energy using appliances. According to the present invention, the device is provided a first RL unit and a second RL device therein, wherein the first RL unit is configured for confucting a first reinforcement learning based on an electricity tariff provided by a power utility and a demand request provided by a user, so as to generate a first scheduling policy for controlling the energy using appliances. On the other hand, the second RL unit is configured for confucting a second reinforcement learning based on the electricity tariff and at least one state parameter of the energy storage device, so as to generate a second scheduling policy for controlling the charing process and the discharing process of the energy storage device. Consequently, level of the user’s satisfaction therefore rises in case of the electricity cost being significantly lowered.

Description

Energy management device and method using multi-objective reinforcement learning

本發明係關於智慧能源管理(Intelligent energy managment)之技術領域，尤指一種利用多目標強化學習之能源管理裝置及方法。The present invention relates to the technical field of intelligent energy management, and more particularly, to an energy management device and method utilizing multi-objective reinforcement learning.

科技的發展係增進生活便利性且帶來良好的生活品質，同時卻也導致能源的大量使用，致使能耗(Energy consumption)成為家庭和小型企業等用戶之開支的重要部分。故而，如何適當處理能耗開支成為用戶極度重視的課題，於是能源管理系統(Energy management system, EMS)近年來被廣泛研究用以妥善管理電力使用，期望能夠在電力需求和能耗之間取得平衡。The development of science and technology improves the convenience of life and brings good quality of life, but also leads to a large amount of energy consumption, so that energy consumption has become an important part of the expenses of users such as households and small businesses. Therefore, how to properly handle energy consumption has become a topic that users attach great importance to. Therefore, energy management systems (EMS) have been widely studied in recent years to properly manage electricity usage, hoping to achieve a balance between electricity demand and energy consumption. .

應知道，電力公司對用戶所收取之電費一般包含二種類別：電能費用(energy charge)及需量費用(demand charge)。其中，電能費用為用戶於一段期間(如：一個月帳單期間)內所使用之總電量所衍生之費用，其計價單位為「千瓦．小時(kWh)」。更詳細地說明，電力公司通常實施分時電價制度(即，不同時間區間設定不同電價費率)，藉此引導用戶降低其尖峰時間區間之用電量。另一方面，需量費用則是電力公司依據用戶於一段期間內之最大需量(maximum demand)所收取的費用，其計價單位為「千瓦(kW)」。更詳細地說明，電力公司依事先與用戶簽訂的契約容量(contracted capacity)收取固定的需量費用，如果用戶實際用電的最大需量超過該契約容量時，再額外收取超約金額。It should be known that electricity charges charged by power companies to users generally include two categories: energy charges and demand charges. Among them, the electricity cost is the cost derived from the total electricity used by the user during a period (eg, a monthly billing period), and the unit of measurement is "kilowatt-hour (kWh)". In more detail, power companies usually implement a time-of-use electricity price system (ie, different electricity price rates are set in different time intervals), thereby guiding users to reduce their electricity consumption during peak time intervals. Demand charges, on the other hand, are charges charged by power companies based on the maximum demand from users within a period of time, and the unit of measure is "kilowatts (kW)." In more detail, the power company charges a fixed demand fee based on the contracted capacity signed with the user in advance. If the actual maximum demand of the user exceeds the contracted capacity, an additional over-contracted amount will be charged.

由前述說明可知，需量響應(Demand Response, DR)在能源管理系統之中扮演著重要角色。有鑑於此，文獻一提出可應用在能源管理系統之中的一種基於需量響應之優化的功率調度方法。於此，文獻一指的是Zhao et.al, “An optimal power scheduling method for demand response in home energy management system,” IEEE Trans. Smart Grid, vol. 4, no. 3, pp. 1391–1400, Sep. 2013。所述方法結合實時電價(real-time pricing)以及累進電價(inclining block rate, IBR)設計出應用在所述能源管理系統之一主控制器的一功率調度算法。實驗結果證實，通過採用此功率調度算法，所述能源管理系統可以降低用戶的能耗開支。可惜的是，此功率調度算法雖然可以降低用戶的能耗開支，卻也同時降低了用戶的用電滿意度。As can be seen from the above description, Demand Response (DR) plays an important role in the energy management system. In view of this, Literature 1 proposes a demand-response-based optimization power scheduling method that can be applied in energy management systems. Herein, document one refers to Zhao et.al, “An optimal power scheduling method for demand response in home energy management system,” IEEE Trans. Smart Grid, vol. 4, no. 3, pp. 1391–1400, Sep . 2013. The method combines real-time pricing and inclining block rate (IBR) to design a power scheduling algorithm applied to a main controller of the energy management system. The experimental results confirm that by adopting this power scheduling algorithm, the energy management system can reduce the user's energy consumption expenditure. Unfortunately, although this power scheduling algorithm can reduce the user's energy consumption expenditure, it also reduces the user's satisfaction with electricity consumption.

因此，文獻二提出了兼顧用戶的用電滿意度和能耗開支的改善方案。於此，文獻二指的是Althaher et.al, “Automated demand response from home energy management system under dynamic pricing and power and comfort constraints,” IEEE Trans. Smart Grid, vol. 6, no. 4, pp. 1874–1883, Jul. 2015.。其中，由於所述改善方案係依據全天電價、可再生能源裝置的電量以及用戶需求而制定出最佳的用電排程。然而，在某些情況下，可能無法提前知道某一日的全天電價以及可再生能源裝置在該日所能夠產生的電量。換句話說，文獻二的改善方案所需要的數據具有不確定性，以致於限制了其實際應用的可行性。Therefore, Literature 2 proposes an improvement plan that takes into account the user's satisfaction with electricity consumption and energy consumption expenditure. Reference 2 here refers to Althaher et.al, “Automated demand response from home energy management system under dynamic pricing and power and comfort constraints,” IEEE Trans. Smart Grid, vol. 6, no. 4, pp. 1874– 1883, Jul. 2015. Wherein, since the improvement plan is based on the all-day electricity price, the electricity quantity of the renewable energy device and the user's demand, the best electricity consumption schedule is formulated. However, in some cases, it may not be possible to know in advance the price of electricity throughout the day on a given day and the amount of electricity that the renewable energy installation will be able to generate on that day. In other words, the data required by the improvement plan of Reference 2 is uncertain, which limits the feasibility of its practical application.

為了處理前述不確定的環境，機器學習技術於是被應用於能源管理系統。其中，強化學習(Reinforcement Learning, RL)即為運用有限信息解決未知環境的一種決策算法。舉例而言，文獻三提出了一種使用強化學習的電設備需量響應方法。於此，文獻三指的是Wen, et.al, “Optimal demand response using devicebased Reinforcement learning,” IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2312–2324, Sep. 2015。進一步地，文獻四提出應用於家庭能源管理的一種使用強化學習和類神經網路之需量響應方法。文獻四指的是Lu, et.al, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Trans. Smart Grid, vol. 10, no. 6, pp. 6629–6639, Nov. 2019。To deal with the aforementioned uncertain environment, machine learning techniques are then applied to energy management systems. Among them, Reinforcement Learning (RL) is a decision-making algorithm that uses limited information to solve the unknown environment. For example, Reference 3 proposes a demand response method for electrical equipment using reinforcement learning. Here, document three refers to Wen, et.al, “Optimal demand response using devicebased Reinforcement learning,” IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2312–2324, Sep. 2015. Further, Reference 4 proposes a demand response method using reinforcement learning and neural-like networks for home energy management. Reference 4 refers to Lu, et.al, “Demand response for home energy management using reinforcement learning and artificial neural network,” IEEE Trans. Smart Grid, vol. 10, no. 6, pp. 6629–6639, Nov. 2019 .

在文獻四的需量響應方法中，係使用權重和轉換(weighted-sum transformation)將需量響應表述為單一目標強化學習(Single Objective Reinforcement Learning, SORL)算法。其中，由於權重變化係基於用戶偏好，故其會隨時間變化。在權重依用戶偏好改變之後，強化學習產生最佳用電排程的速度取決於環境(如:數據)，且其必須歷經一段時間的重新訓練才能夠完成最佳用電排程。更重要的是，文獻四的方法只能夠針對單一目標(即，電設備)的需量響應進行強化學習而後產生對應的最佳用電排程。In the demand response method in Reference 4, the demand response is expressed as a Single Objective Reinforcement Learning (SORL) algorithm using weighted-sum transformation. Among them, since the weight change is based on user preference, it will change over time. After the weights are changed according to the user's preference, the speed at which reinforcement learning generates the optimal electricity schedule depends on the environment (eg, data), and it must be retrained for a period of time to complete the optimal electricity schedule. More importantly, the method in Reference 4 can only perform reinforcement learning on the demand response of a single target (ie, electrical equipment) and then generate the corresponding optimal electricity consumption schedule.

由上述說明可知，習知的應用在能源管理系統之中的基於電設備或用戶用電之需量響應方法仍具有加以改善的空間。有鑑於此，本案之發明人係極力加以研究發明，而終於研發完成一種利用多目標強化學習之能源管理裝置及方法。As can be seen from the above description, there is still room for improvement in the conventional demand response method based on electrical equipment or user power consumption applied in the energy management system. In view of this, the inventor of this case has made great efforts to research and invent, and finally developed an energy management device and method using multi-objective reinforcement learning.

本發明之主要目的在於提供一種利用多目標強化學習之能源管理裝置，其耦接於至少一可再生能源裝置、一能源儲存裝置和複數個耗能裝置之間，且具有一第一強化學習單元以及一第二強化學習單元。其中，所述第一強化學習單元依據用戶給定的裝置運作需求(如裝置運作時間、功率)與電力公司之實時電價而執行強化學習運算以獲得包括複數個裝置控制參數的一耗能裝置運作排程。另一方面，所述第二強化學習單元依據電力公司之實時電價與能源儲存裝置之狀態參數而執行強化學習運算以獲得該能源儲存裝置的一充/放電運作排程。最終，所述能源管理裝置依據耗能裝置運作排程控制複數個所述耗能裝置分時運作，且依據充/放電運作排程控制能源儲存裝置之充/放電，從而在用電成本最小化的情況下使得用戶的用電滿意度能夠最大化。The main purpose of the present invention is to provide an energy management device using multi-objective reinforcement learning, which is coupled between at least one renewable energy device, an energy storage device and a plurality of energy consuming devices, and has a first reinforcement learning unit and a second reinforcement learning unit. Wherein, the first reinforcement learning unit performs reinforcement learning operation according to the device operation requirements (such as device operation time, power) given by the user and the real-time electricity price of the power company to obtain an energy-consuming device operation including a plurality of device control parameters schedule. On the other hand, the second reinforcement learning unit performs reinforcement learning operations according to the real-time electricity price of the power company and the state parameters of the energy storage device to obtain a charging/discharging operation schedule of the energy storage device. Finally, the energy management device controls the time-sharing operation of a plurality of the energy consuming devices according to the energy consuming device operation schedule, and controls the charging/discharging of the energy storage device according to the charging/discharging operation schedule, so as to minimize the cost of electricity consumption Under the circumstance, the user's satisfaction with electricity consumption can be maximized.

為達成上述目的，本發明提出所述利用多目標強化學習之能源管理裝置的一實施例，其耦接於至少一可再生能源裝置、一能源儲存裝置以及複數個耗能裝置之間，且包括：一控制單元；一用戶介面，耦接該控制單元，用以供用戶操作以輸入複數個裝置運作需求；一資訊提供單元，耦接該控制單元，用以提供至少一能源儲存裝置狀態資訊以及一電價資訊予所述控制單元；一第一強化學習單元，用以基於該控制單元之控制而依據複數個所述裝置運作需求和所述電價資訊執行一第一強化學習運算，從而獲得包含複數個裝置控制參數的一耗能裝置運作排程，使該控制單元依據所述耗能裝置運作排程控制複數個所述耗能裝置進行分時動作；以及一第二強化學習單元，用以基於該控制單元之控制而依據所述能源儲存裝置資訊和所述電價資訊執行一第二強化學習運算，從而獲得包含一充電控制參數和一放電控制參數的一充/放電運作排程，使該控制單元依據所述充/放電運作排程控制該能源儲存裝置進行分時動作。 In order to achieve the above object, the present invention proposes an embodiment of the energy management device using multi-objective reinforcement learning, which is coupled between at least one renewable energy device, an energy storage device, and a plurality of energy consumption devices, and includes: : a control unit; a user interface, coupled to the control unit, for the user to operate to input operation requirements of a plurality of devices; an information providing unit, coupled to the control unit, for providing at least one energy storage device status information and an electricity price information to the control unit; a first reinforcement learning unit for performing a first reinforcement learning operation according to a plurality of the device operation requirements and the electricity price information based on the control of the control unit, so as to obtain an energy-consuming device including a plurality of device control parameters an operation schedule, so that the control unit controls a plurality of the energy-consuming devices to perform time-sharing operations according to the energy-consuming device operation schedule; and a second reinforcement learning unit for performing a second reinforcement learning operation according to the energy storage device information and the electricity price information based on the control of the control unit, so as to obtain a charging control parameter and a discharging control parameter The charging/discharging operation schedule enables the control unit to control the energy storage device to perform time-sharing operations according to the charging/discharging operation schedule.

並且，本發明同時提出一種利用多目標強化學習之能源管理方法，其係由耦接於至少一可再生能源裝置、一能源儲存裝置以及複數個耗能裝置之間的一能源管理裝置實現，且包括以下步驟： (1)接收用戶所輸入的複數個裝置運作需求，且取得至少一能源儲存裝置狀態資訊以及一電價資訊； (2)依據複數個所述裝置運作需求和所述電價資訊執行一第一強化學習運算以獲得包含複數個裝置控制參數的一耗能裝置運作排程，且依據所述能源儲存裝置資訊和所述電價資訊執行一第二強化學習運算以獲得包含一充電控制參數和一放電控制參數的一充/放電運作排程；以及 (3)依據所述耗能裝置運作排程控制控制複數個所述耗能裝置進行分時動作，且依據所述充/放電運作排程控制該能源儲存裝置進行分時動作。 Furthermore, the present invention also provides an energy management method using multi-objective reinforcement learning, which is realized by an energy management device coupled between at least one renewable energy device, an energy storage device, and a plurality of energy consuming devices, and Include the following steps: (1) Receive a plurality of device operation requirements input by the user, and obtain at least one energy storage device status information and an electricity price information; (2) Execute a first reinforcement learning operation according to a plurality of the device operation requirements and the electricity price information to obtain an energy consumption device operation schedule including a plurality of device control parameters, and according to the energy storage device information and all performing a second reinforcement learning operation on the electricity price information to obtain a charge/discharge operation schedule including a charge control parameter and a discharge control parameter; and (3) Controlling a plurality of the energy consuming devices to perform time-sharing operations according to the energy-consuming device operation schedule, and controlling the energy storage device to perform time-sharing operations according to the charging/discharging operation schedule.

在一實施例中，所述第一強化學習運算係利用以下數學式(1)、(2)和(3)完成： (1)

； (2)

； (3)

；其中，

、

、

、

、

、和

皆為Q值，其係依據所述耗能裝置之狀態與動作而產生的數值；其中，α為學習率，

為與用電費用對應的獎勵值，

為與用戶滿意度對應的獎勵值，

為折扣率，

為所述耗能裝置於當前時間區間之狀態，

為所述耗能裝置於下一時間區間之狀態，

為所述耗能裝置於當前時間區間的最佳動作，

為所述耗能裝置於當前時間區間之動作，

為所述耗能裝置於下一時間區間之的動作，a為所述耗能裝置所進行的動作， A為所述耗能裝置之一動作集合，

為與用電費用對應的權重值，且

為與用戶滿意度對應的權重值，且

。 In one embodiment, the first reinforcement learning operation is performed using the following equations (1), (2) and (3): (1)

; (2)

; (3)

; in,

,

,and

are all Q values, which are values generated according to the state and action of the energy-consuming device; where α is the learning rate,

is the reward value corresponding to the electricity cost,

is the reward value corresponding to user satisfaction,

is the discount rate,

is the state of the energy-consuming device in the current time interval,

is the state of the energy-consuming device in the next time interval,

is the best action of the energy-consuming device in the current time interval,

is the action of the energy-consuming device in the current time interval,

is the action of the energy-consuming device in the next time interval, a is the action performed by the energy-consuming device, A is an action set of the energy-consuming device,

is the weight value corresponding to the electricity cost, and

is the weight value corresponding to user satisfaction, and

.

在一實施例中，所述第二強化學習運算係利用以下數學式(4)和(5)完成： (4)

； (5)

；其中，

和

為Q值，其係依據所述能源儲存裝置之狀態與動作而產生的數值；其中，α為學習率，

為獎勵值，

為所述能源儲存裝置於當前時間區間之狀態，

為所述能源儲存裝置於下一時間區間之狀態，

為所述能源儲存裝置於當前時間區間的最佳動作，

為所述能源儲存裝置於當前時間區間之動作，a為所述能源儲存裝置所進行的動作，且

為所述能源儲存裝置之一動作集合。 In one embodiment, the second reinforcement learning operation is performed using the following equations (4) and (5): (4)

; (5)

; in,

and

is the Q value, which is a value generated according to the state and action of the energy storage device; wherein, α is the learning rate,

is the reward value,

is the state of the energy storage device in the current time interval,

is the state of the energy storage device in the next time interval,

is the best action of the energy storage device in the current time interval,

is the action of the energy storage device in the current time interval, a is the action performed by the energy storage device, and

An action set for the energy storage device.

在一實施例中，該複數個耗能裝置包括：不可彈性調整運作模式之至少一第一耗能裝置、運作時間可彈性調整之至少一第二耗能裝置、以及運作功率可彈性調整之至少一第三耗能裝置。In one embodiment, the plurality of energy consuming devices include: at least one first energy consuming device whose operating mode cannot be flexibly adjusted, at least one second energy consuming device whose operating time can be flexibly adjusted, and at least one energy consuming device whose operating power can be flexibly adjusted. A third energy consuming device.

在一實施例中，對於所述第二耗能裝置而言，其用於上式(1)和上式(2)之獎勵值係利用下式(1-1)和(1-2)計算獲得： (1-1)

； (1-2)

；其中，

為當前時間區間的實時電價(real time pricing (RTP) tariff)，

為所述第二耗能裝置於當前時間區間的能耗，

為對應於所述第二耗能裝置之運作排程的使用者不滿意度。 In one embodiment, for the second energy consuming device, the reward value for the above formula (1) and the above formula (2) is calculated by using the following formulae (1-1) and (1-2) Gain: (1-1)

; (1-2)

; in,

is the real time pricing (RTP) tariff for the current time interval,

is the energy consumption of the second energy-consuming device in the current time interval,

User dissatisfaction corresponding to the operation schedule of the second energy consuming device.

在一實施例中，對於所述第三耗能裝置而言，其用於上式(1)和上式(2)之獎勵值係利用下式(1-1)和(1-2)計算獲得： (1-1)

； (1-2)

；其中，

為當前時間區間的實時電價(real time pricing (RTP) tariff)，

為所述第三耗能裝置於當前時間區間的能耗，且

為對應於所述第三耗能裝置之運作排程的使用者不滿意度。 In one embodiment, for the third energy consuming device, the reward value used for the above formula (1) and the above formula (2) is calculated by the following formulas (1-1) and (1-2) Gain: (1-1)

; (1-2)

; in,

is the real time pricing (RTP) tariff for the current time interval,

is the energy consumption of the third energy-consuming device in the current time interval, and

User dissatisfaction corresponding to the operation schedule of the third energy consuming device.

在一實施例中，該能源儲存裝置為電池，且所述可再生能源裝置為下列任一者：太陽光能發電裝置、太陽熱能發電裝置、風力發電裝置、火力發電裝置、或水力發電裝置。In one embodiment, the energy storage device is a battery, and the renewable energy device is any one of the following: a solar power generation device, a solar thermal power generation device, a wind power generation device, a thermal power generation device, or a hydroelectric power generation device.

在一實施例中，其中，所述能源管理裝置係透過一電池管理裝置取得所述能源儲存裝置狀態資訊，且該能源儲存裝置狀態資訊包括：電池容量、電池充電效率、電池放電效率、以及漏電參數。In one embodiment, the energy management device obtains the state information of the energy storage device through a battery management device, and the state information of the energy storage device includes: battery capacity, battery charging efficiency, battery discharging efficiency, and leakage current parameter.

在一實施例中，所述能源管理裝置係透過一電池管理裝置控制該能源儲存裝置進行分時動作。In one embodiment, the energy management device controls the energy storage device to perform time-sharing operations through a battery management device.

在一實施例中，該複數個裝置運作需求包括：運作時間區間、持續運作時間、運作功率、最高運作功率、以及最低運作功率。In one embodiment, the plurality of device operation requirements include: an operation time interval, a continuous operation time, an operation power, a maximum operation power, and a minimum operation power.

為使貴審查委員能進一步瞭解本發明之結構、特徵、目的、與其優點，茲附以圖式及較佳具體實施例之詳細說明如後。In order to enable your examiners to further understand the structure, characteristics, purpose, and advantages of the present invention, drawings and detailed descriptions of preferred embodiments are attached as follows.

請參閱圖1，其顯示本發明之一種利用多目標強化學習之能源管理系統的立體圖。並且，請同時參閱圖2，其顯示本發明之利用多目標強化學習之能源管理系統的方塊圖。如圖1與圖2所示，本發明之利用多目標強化學習之能源管理裝置1(下文簡稱“能源管理裝置1”)耦接於至少一可再生能源裝置2、一能源儲存裝置3以及複數個耗能裝置(5a, 5b, 5c)之間。特別說明的是，雖然圖1繪示該可再生能源裝置2為一太陽光能發電裝置(Photovoltaic (PV) device)，然而，在可行的實施例中，所述可再生能源裝置2也可以是太陽熱能發電裝置、風力發電裝置、火力發電裝置、或水力發電裝置。另一方面，圖1繪示該能源儲存裝置3為電池。Please refer to FIG. 1 , which shows a perspective view of an energy management system utilizing multi-objective reinforcement learning of the present invention. Also, please refer to FIG. 2 , which shows a block diagram of an energy management system utilizing multi-objective reinforcement learning of the present invention. As shown in FIG. 1 and FIG. 2 , the energy management device 1 using multi-objective reinforcement learning of the present invention (hereinafter referred to as “energy management device 1 ”) is coupled to at least one renewable energy device 2 , an energy storage device 3 and a plurality of between two energy consumers (5a, 5b, 5c). It is particularly noted that although FIG. 1 shows that the renewable energy device 2 is a photovoltaic (Photovoltaic (PV) device), in a feasible embodiment, the renewable energy device 2 may also be Solar thermal power plant, wind power plant, thermal power plant, or hydroelectric power plant. On the other hand, FIG. 1 shows that the energy storage device 3 is a battery.

如圖1與圖2所示，該能源管理裝置1包括：一控制單元10、一用戶介面11、一資訊提供單元12、一第一強化學習單元13、以及一第二強化學習單元14。在一實施例中，該控制單元10可為一處理器，且該第一強化學習單元13和該第二強化學習單元14係以韌體、函式庫、變數、或運算元的形式被建立於該處理器之中。在另一實施例中，該能源管理裝置1具有一作業系統，且該第一強化學習單元13和該第二強化學習單元14以應用軟體的形式安裝在該作業系統內，用以依據該控制單元10的控制而啟用。如圖1與圖2所示，該用戶介面11，例如一觸控螢幕，其係耦接該控制單元10且用以供用戶操作以輸入複數個裝置運作需求。當然，在用戶的操作下，用戶介面11也可以顯示取該可再生能源裝置2、該能源儲存裝置3及/或該複數個耗能裝置(5a, 5b, 5c)的當前資訊。As shown in FIG. 1 and FIG. 2 , the energy management device 1 includes: a control unit 10 , a user interface 11 , an information providing unit 12 , a first reinforcement learning unit 13 , and a second reinforcement learning unit 14 . In one embodiment, the control unit 10 can be a processor, and the first reinforcement learning unit 13 and the second reinforcement learning unit 14 are built in the form of firmware, library, variables, or operands in the processor. In another embodiment, the energy management device 1 has an operating system, and the first reinforcement learning unit 13 and the second reinforcement learning unit 14 are installed in the operating system in the form of application software for controlling according to the control control of the unit 10 is enabled. As shown in FIG. 1 and FIG. 2 , the user interface 11 , such as a touch screen, is coupled to the control unit 10 and used for the user to operate to input a plurality of device operation requirements. Of course, under the operation of the user, the user interface 11 can also display the current information of the renewable energy device 2, the energy storage device 3 and/or the plurality of energy consuming devices (5a, 5b, 5c).

更詳細地說明，該資訊提供單元12耦接該控制單元10，用以提供至少一能源儲存裝置狀態資訊以及一電價資訊予所述控制單元10。應可理解，所述資訊提供單元12為一資料傳/收介面，其可透過一電池管理裝置4取得該能源儲存裝置3(如電池)之有關能源儲存裝置狀態資訊，包括：電池容量、電池充電效率、電池放電效率、以及漏電參數，從而將前述之能源儲存裝置狀態資訊提供給控制單元10。另一方面，該資訊提供單元12(即，資料傳/收介面)亦可自電力公司處取得實時電價(real time pricing, RTP)表，從而提供一電價資訊予控制單元10。More specifically, the information providing unit 12 is coupled to the control unit 10 for providing at least one energy storage device status information and an electricity price information to the control unit 10 . It should be understood that the information providing unit 12 is a data transmission/reception interface, which can obtain relevant energy storage device status information of the energy storage device 3 (such as a battery) through a battery management device 4, including: battery capacity, battery The charging efficiency, the battery discharging efficiency, and the leakage parameters are used to provide the aforementioned energy storage device status information to the control unit 10 . On the other hand, the information providing unit 12 (ie, the data transmission/reception interface) can also obtain a real time pricing (RTP) table from the power company, so as to provide electricity price information to the control unit 10 .

更詳細地說明，依據該控制單元10之控制，該第一強化學習單元13利用Q-learning法而基於複數個所述裝置運作需求與所述電價資訊執行一第一強化學習運算，從而獲得含有複數個裝置控制參數的一耗能裝置運作排程，使該控制單元10依據該耗能裝置運作排程控制複數個所述耗能裝置(5a, 5b, 5c)進行分時動作。其中，該第一強化學習運算13係利用以下數學式(1)、(2)和(3)完成所述第一強化學習運算： (1)

； (2)

；以及 (3)

。 In more detail, according to the control of the control unit 10, the first reinforcement learning unit 13 uses the Q-learning method to perform a first reinforcement learning operation based on a plurality of the device operation requirements and the electricity price information, so as to obtain a An energy-consuming device operation schedule of a plurality of device control parameters, so that the control unit 10 controls a plurality of the energy-consuming devices (5a, 5b, 5c) to perform time-sharing operations according to the energy-consuming device operation schedule. Wherein, the first reinforcement learning operation 13 uses the following mathematical formulas (1), (2) and (3) to complete the first reinforcement learning operation: (1)

; (2)

; and (3)

.

於上式(1)、(2)和(3)之中，

、

、和

皆為Q值，其係依據所述耗能裝置之狀態與動作而產生的數值。熟悉Q-learning法的演算法工程師必然知道，在強化學習的過程中，會先基於狀態-動作對(state-action pair)產生所謂的Q表。實際上，Q表為包含M×N個Q值的矩陣，包括：

、……、

，而強化學習便是利用獎勵值和學習規則不斷地訓練Q值，直至耗能裝置的最佳動作(即，

)被找出。另一方面，α為學習率，

為與用電費用(即，實時電價)對應的獎勵值，

為與用戶滿意度對應的獎勵值，

為折扣率，

為所述耗能裝置於當前時間區間之狀態，

為所述耗能裝置於下一時間區間之狀態，

為所述耗能裝置於當前時間區間的最佳動作，

為所述耗能裝置於當前時間區間之動作，

為與用電費用對應的權重值，且

為與用戶滿意度對應的權重值，且

。 In the above formulas (1), (2) and (3),

,

,and

All are Q values, which are values generated according to the state and action of the energy consuming device. Algorithm engineers who are familiar with the Q-learning method must know that in the process of reinforcement learning, the so-called Q-table is first generated based on the state-action pair. In fact, the Q table is a matrix containing M × N Q values, including:

,……,

, and reinforcement learning is to use the reward value and learning rules to continuously train the Q value until the optimal action of the energy-consuming device (ie,

) was found. On the other hand, α is the learning rate,

is the reward value corresponding to the electricity cost (ie, the real-time electricity price),

is the reward value corresponding to user satisfaction,

is the discount rate,

is the state of the energy-consuming device in the current time interval,

is the state of the energy-consuming device in the next time interval,

is the best action of the energy-consuming device in the current time interval,

is the action of the energy-consuming device in the current time interval,

is the weight value corresponding to the electricity cost, and

is the weight value corresponding to user satisfaction, and

.

值得加以解釋的是，h指的是當前時間區間(time slot)，若將1天分為24個時間區間，則h∈{1,2,………,23,24}，且h+1為下一個時間區間。另一方面，α∈[0, 1]，且

∈[0, 1]。 It is worth explaining that h refers to the current time slot. If one day is divided into 24 time slots, then h∈{1,2,......,23,24}, and h+1 for the next time interval. On the other hand, α∈[0, 1], and

∈[0, 1].

本發明之目的係在用電成本最小化的情況下使得用戶的電滿意度能夠最大化。因此，必須設立兩個Q表，即

和

。於式(1)中，

為與用電費用(即，實時電價)對應的獎勵值，亦即，式(1)係基於依據用電費用所設計出的獎勵方案從而使用epsilon-greedy算法(ε-greedy algorithm)進行耗能裝置的動作選擇。當然，所述動作係依用戶輸入的裝置運作需求以及裝置類別而有所不同。另一方面，於式(2)中，

為與用戶滿意度對應的獎勵值，亦即，式(2)係基於依據用戶滿意度所設計出的獎勵方案從而使用ε-greedy算法進行耗能裝置的動作選擇。最終，在使用Q-learning法完成

和

(即，兩個Q表)的建立後，式(3)即用於選出耗能裝置(5a, 5b, 5c)於當前時間區間的最佳動作。 The purpose of the present invention is to maximize the user's electrical satisfaction while minimizing the electricity cost. Therefore, two Q tables must be established, namely

and

. In formula (1),

is the reward value corresponding to the electricity cost (ie, the real-time electricity price), that is, the formula (1) is based on the reward scheme designed according to the electricity cost, so that the epsilon-greedy algorithm (ε-greedy algorithm) is used for energy consumption Device action selection. Of course, the actions are different according to the operation requirements of the device input by the user and the type of the device. On the other hand, in formula (2),

is the reward value corresponding to the user's satisfaction, that is, formula (2) is based on the reward scheme designed according to the user's satisfaction, so that the ε-greedy algorithm is used to select the action of the energy-consuming device. Finally, after using the Q-learning method to complete

and

After the establishment of the two Q tables (that is, the two Q tables), the formula (3) is used to select the best action of the energy consuming devices (5a, 5b, 5c) in the current time interval.

依據本發明之設計，該複數個耗能裝置(5a, 5b, 5c)包括：不可彈性調整運作模式之至少一第一耗能裝置(Inflexible appliance)5a、運作時間可彈性調整之至少一第二耗能裝置(Time-flexible appliance)5b、以及運作功率可彈性調整之至少一第三耗能裝置(Power-flexible appliance)5c。舉例而言，對於一般用戶而言，電腦主機和電冰箱屬於不可彈性調整運作模式之第一耗能裝置5a，洗衣機屬於運作時間可彈性調整之第二耗能裝置5b，而空調裝置則屬於運作功率可彈性調整之第三耗能裝置5c。進一步地，可利用下式(a)、式(b)以及式(c)分別表示三種耗能裝置一整天的能耗(energy consumption)： (a)

； (b)

；以及 (c)

。 According to the design of the present invention, the plurality of energy consuming devices (5a, 5b, 5c) include: at least one first energy consuming device (Inflexible appliance) 5a whose operation mode cannot be flexibly adjusted, and at least one second energy consuming device 5a whose operation time can be flexibly adjusted A time-flexible appliance 5b and at least one third power-flexible appliance 5c whose operating power can be flexibly adjusted. For example, for ordinary users, the computer host and the refrigerator belong to the first energy consuming device 5a whose operation mode cannot be flexibly adjusted, the washing machine belongs to the second energy consuming device 5b whose operation time can be flexibly adjusted, and the air conditioner belongs to the operation mode. The third energy consumption device 5c whose power can be flexibly adjusted. Further, the following formula (a), formula (b) and formula (c) can be used to respectively represent the energy consumption of the three energy-consuming devices for a whole day: (a)

; (b)

; and (c)

.

如於上式(1)、(2)和(3)之中，n表示為耗能裝置的編號，且N為最大數量。並且，h表示為時間區間，且H為最大值(如24)。進一步地，可基於裝置能耗設計出對應三種耗能裝置的用戶不滿意度之數學運算式，如下式(b1)、(c1)所示： (b1)

；以及 (c1)

。 As in the above equations (1), (2) and (3), n represents the number of the energy consuming device, and N is the maximum number. And, h is represented as a time interval, and H is the maximum value (eg, 24). Further, based on the energy consumption of the device, a mathematical expression corresponding to the user's dissatisfaction of the three energy-consuming devices can be designed, as shown in the following formulas (b1) and (c1): (b1)

; and (c1)

.

依據不同種類的耗能裝置(5a, 5b, 5c)，用戶在輸入所述裝置運作需求之時，該用戶介面11會引導用戶輸入運作時間區間(Requested time range)、持續運作時間(Operation duration)、運作功率(Operation power)、最高運作功率、以及最低運作功率。於上式(b1)和(c1)之中，

為用戶對於第二耗能裝置5b的用戶不滿意度，且

為用戶對於第三耗能裝置5c的用戶不滿意度。應可注意到，由於第一耗能裝置5a不具運作可調整性，因此本發明之能源管理裝置1無須對其運作模式進行分時控制，自然不會有所謂的用戶不滿意度產生。更詳細地說明，

為與第二耗能裝置5b有關的不滿意度參數，

為與第三耗能裝置5c有關的不滿意度參數，且兩個不滿意度參數的數值皆介於0和1之間。再者，

為本發明之能源管理裝置1控制第二耗能裝置5b啟動之時間，且

收到用戶所輸入的裝置運作需求之時間。更詳細地說明，由於第三耗能裝置5c為運作功率可調之裝置，因此

指的是用戶輸入的最高運作功率(即，能耗)。 According to different types of energy-consuming devices (5a, 5b, 5c), when the user inputs the operation requirements of the device, the user interface 11 will guide the user to input the operation time range (Requested time range), continuous operation time (Operation duration) , Operation power, maximum operating power, and minimum operating power. In the above formulas (b1) and (c1),

is the user's dissatisfaction with the user of the second energy-consuming device 5b, and

It is the user's dissatisfaction with the user of the third energy consuming device 5c. It should be noted that since the first energy consuming device 5a does not have operation adjustability, the energy management device 1 of the present invention does not need to perform time-division control on its operation mode, and naturally there will be no so-called user dissatisfaction. In more detail,

is the dissatisfaction parameter related to the second energy consuming device 5b,

is the dissatisfaction parameter related to the third energy consuming device 5c, and the values of the two dissatisfaction parameters are both between 0 and 1. Furthermore,

is the time when the energy management device 1 of the present invention controls the activation time of the second energy consumption device 5b, and

The time when the device operation request entered by the user is received. In more detail, since the third energy-consuming device 5c is a device with adjustable operating power,

Refers to the maximum operating power (ie, energy consumption) entered by the user.

利用上式(b1)和(c1)計算出用戶不滿意度之後，便可繼續地計算出用於上式(1)和上式(2)之獎勵值。對於所述第二耗能裝置5b而言，其用於上式(1)和上式(2)之獎勵值係利用下式(1-1)和(1-2)計算獲得： (1-1)

；以及 (1-2)

。 After the user's dissatisfaction is calculated using the above equations (b1) and (c1), the reward value for the above equations (1) and (2) can be continuously calculated. For the second energy consuming device 5b, the reward value used in the above formula (1) and the above formula (2) is calculated by the following formulas (1-1) and (1-2): (1- 1)

; and (1-2)

.

於上式(1-1)和(1-2)之中，

為當前時間區間的實時電價(real time pricing (RTP) tariff)，

為所述第二耗能裝置於當前時間區間的能耗，

為對應於所述第二耗能裝置之運作排程的使用者不滿意度。 In the above formulas (1-1) and (1-2),

is the real time pricing (RTP) tariff for the current time interval,

繼續地參閱圖1與圖2。該第二強化學習單元14用以基於該控制單元10之控制而依據所述能源儲存裝置資訊和所述電價資訊執行一第二強化學習運算，從而獲得包含一充電控制參數和一放電控制參數的一充/放電運作排程，使該控制單元10依據所述充/放電運作排程控制該能源儲存裝置3進行分時動作。其中，該第二強化學習單元14係利用以下數學式(4)和(5)完成所述第二強化學習運算： (4)

；以及 (5)

。 Continue to refer to FIGS. 1 and 2 . The second reinforcement learning unit 14 is configured to perform a second reinforcement learning operation according to the energy storage device information and the electricity price information based on the control of the control unit 10, so as to obtain a charging control parameter and a discharging control parameter. A charge/discharge operation schedule enables the control unit 10 to control the energy storage device 3 to perform time-sharing operations according to the charge/discharge operation schedule. Wherein, the second reinforcement learning unit 14 uses the following mathematical formulas (4) and (5) to complete the second reinforcement learning operation: (4)

; and (5)

.

於上式(4)和式(5)中，

和

為Q值，其係依據所述能源儲存裝置3之狀態與動作而產生的數值。另一方面，

為與能源儲存裝置3所採取動作有關的獎勵值，

為所述能源儲存裝置3於當前時間區間之狀態，

為所述能源儲存裝置3於下一時間區間之狀態，

為所述能源儲存裝置3於當前時間區間的最佳動作，

為所述能源儲存裝置3於當前時間區間之動作，a為所述能源儲存裝置3所採取執行的動作，且

為所述能源儲存裝置3之一動作集合。 In the above formula (4) and formula (5),

and

is the Q value, which is a value generated according to the state and action of the energy storage device 3 . on the other hand,

is the reward value associated with the action taken by the energy storage device 3,

is the state of the energy storage device 3 in the current time interval,

is the state of the energy storage device 3 in the next time interval,

is the best action of the energy storage device 3 in the current time interval,

is the action of the energy storage device 3 in the current time interval, a is the action taken by the energy storage device 3, and

It is an action set of the energy storage device 3 .

對於能源儲存裝置3而言，其用於上式(4)之獎勵值係利用下式(4-1)計算獲得：

。其中，

為電池的充/放電功率。應可理解，可再生能源裝置2(如PV裝置)與能源儲存裝置3(如電池)之間通常具有一電池管理裝置4，用以管理該可再生能源裝置2向電池充電，且管理該能源儲存裝置3向各種裝置提供所需電力。因此，在所述充/放電運作排程產生之後，所述能源管理裝置1係透過電池管理裝置4而依據該充/放電運作排程控制該能源儲存裝置3進行分時動作。即，在指定的時間區間(requested time range)內進行充電及/或放電。一般而言，為了使得用電成本能夠最小化，本發明之能源管理裝置1會在實時電價便宜的時間區段控制電池進行充電，且在實時電價相對昂貴的時間區段控制電池向各種裝置提供所需電力。 For the energy storage device 3, the reward value used in the above formula (4) is calculated by the following formula (4-1):

. in,

is the charge/discharge power of the battery. It should be understood that there is usually a battery management device 4 between the renewable energy device 2 (such as a PV device) and the energy storage device 3 (such as a battery) to manage the renewable energy device 2 to charge the battery and manage the energy The storage device 3 supplies required power to various devices. Therefore, after the charge/discharge operation schedule is generated, the energy management device 1 controls the energy storage device 3 to perform time-sharing operations through the battery management device 4 according to the charge/discharge operation schedule. That is, charging and/or discharging is performed within a specified time range (requested time range). Generally speaking, in order to minimize the cost of electricity consumption, the energy management device 1 of the present invention controls the battery to be charged during the time period when the real-time electricity price is cheap, and controls the battery to supply various devices during the time period when the real-time electricity price is relatively expensive. required power.

本發明同時提出一種利用多目標強化學習之能源管理方法，如圖1與圖2所示，本發明之能源管理方法係由耦接於至少一可再生能源裝置2、一能源儲存裝置3以及複數個耗能裝置之間的一能源管理裝置1予以實現。圖3顯示本發明之一種利用多目標強化學習之能源管理方法的流程圖。如圖1、圖2與圖3所示，方法流程首先執行步驟圖S1：接收用戶所輸入的複數個裝置運作需求，且取得至少一能源儲存裝置狀態資訊以及一電價資訊。依據前述說明，資訊提供單元12例如為一資料傳/收介面，其透過電池管理裝置4取得該能源儲存裝置3(如電池)之有關能源儲存裝置狀態資訊，包括：電池容量、電池充電效率、電池放電效率、以及漏電參數，從而將前述之能源儲存裝置狀態資訊提供給控制單元10。並且，該資訊提供單元12(即，資料傳/收介面)亦可自電力公司處取得實時電價(real time pricing, RTP)表，從而提供一電價資訊予控制單元10。The present invention also proposes an energy management method using multi-objective reinforcement learning. As shown in FIG. 1 and FIG. 2 , the energy management method of the present invention is coupled to at least one renewable energy device 2 , an energy storage device 3 and a plurality of An energy management device 1 between each energy consuming device is implemented. FIG. 3 shows a flow chart of an energy management method utilizing multi-objective reinforcement learning of the present invention. As shown in FIG. 1 , FIG. 2 and FIG. 3 , the method flow first executes step S1 : receiving a plurality of device operation requirements input by the user, and obtaining at least one energy storage device status information and electricity price information. According to the above description, the information providing unit 12 is, for example, a data transmission/reception interface, which obtains the relevant energy storage device status information of the energy storage device 3 (eg battery) through the battery management device 4 , including: battery capacity, battery charging efficiency, The battery discharge efficiency and leakage parameters are used to provide the aforementioned energy storage device status information to the control unit 10 . In addition, the information providing unit 12 (ie, the data transmission/reception interface) can also obtain a real time pricing (RTP) table from the power company, so as to provide electricity price information to the control unit 10 .

如圖1、圖2與圖3所示，方法流程接著執行步驟S2：依據複數個所述裝置運作需求和所述電價資訊執行一第一強化學習運算以獲得包含複數個裝置控制參數的一耗能裝置運作排程，且依據所述能源儲存裝置資訊和所述電價資訊執行一第二強化學習運算以獲得包含一充電控制參數和一放電控制參數的一充/放電運作排程。依據前述說明，第一強化學習單元13利用數學式(1)、(2)和(3)完成所述第一強化學習運算，且第二強化學習單元14利用數學式(4)和(5)完成所述第二強化學習運算。As shown in FIG. 1 , FIG. 2 and FIG. 3 , the method flow then executes step S2 : performing a first reinforcement learning operation according to a plurality of the device operation requirements and the electricity price information to obtain a power consumption including a plurality of device control parameters An energy device operation schedule is performed, and a second reinforcement learning operation is performed according to the energy storage device information and the electricity price information to obtain a charge/discharge operation schedule including a charge control parameter and a discharge control parameter. According to the foregoing description, the first reinforcement learning unit 13 uses the mathematical formulas (1), (2) and (3) to complete the first reinforcement learning operation, and the second reinforcement learning unit 14 uses the mathematical formulas (4) and (5) to complete the first reinforcement learning operation. The second reinforcement learning operation is completed.

最終，方法流程接著執行步驟S3：依據所述耗能裝置運作排程控制控制複數個所述耗能裝置(5a, 5b, 5c)進行分時動作，且依據所述充/放電運作排程控制該能源儲存裝置3進行分時動作，從而在用電成本最小化的情況下使得用戶的用電滿意度能夠最大化。Finally, the method flow then executes step S3: controlling a plurality of the energy consuming devices (5a, 5b, 5c) to perform time-sharing operations according to the operation schedule control of the energy consuming devices, and controlling the charging/discharging operation schedule according to the The energy storage device 3 performs time-sharing operations, so that the user's satisfaction with electricity consumption can be maximized while the electricity consumption cost is minimized.

如此，上述已完整且清楚地說明本發明之一種利用多目標強化學習之能源管理裝置。然而，必須加以強調的是，前述本案所揭示者乃為較佳實施例，舉凡局部之變更或修飾而源於本案之技術思想而為熟習該項技藝之人所易於推知者，俱不脫本案之專利權範疇。Thus, the above has completely and clearly described an energy management device utilizing multi-objective reinforcement learning of the present invention. However, it must be emphasized that what is disclosed in the above-mentioned case is a preferred embodiment, and any partial changes or modifications originating from the technical ideas of this case and easy to infer by those who are familiar with the art are all within the scope of this case. the scope of patent rights.

1:能源管理裝置 10:控制單元 11:用戶介面 12:資訊提供單元 13:第一強化學習單元 14:第二強化學習單元 2:可再生能源裝置 3:能源儲存裝置 4:電池管理裝置 5a:第一耗能裝置 5b:第二耗能裝置 5c:第三耗能裝置 S1-S3:步驟 1: Energy management device 10: Control unit 11: User Interface 12: Information provision unit 13: First Reinforcement Learning Unit 14: Second Reinforcement Learning Unit 2: Renewable energy installations 3: Energy Storage Devices 4: Battery management device 5a: The first energy-consuming device 5b: Second energy consumption device 5c: Third energy-consuming device S1-S3: Steps

圖1為本發明之一種利用多目標強化學習之能源管理系統的立體圖；圖2為本發明之利用多目標強化學習之能源管理系統的方塊圖；以及圖3為本發明之一種利用多目標強化學習之能源管理方法的流程圖。 1 is a perspective view of an energy management system utilizing multi-objective reinforcement learning according to the present invention; FIG. 2 is a block diagram of an energy management system utilizing multi-objective reinforcement learning of the present invention; and FIG. 3 is a flow chart of an energy management method utilizing multi-objective reinforcement learning according to the present invention.

1:能源管理裝置 1: Energy management device

2:可再生能源裝置 2: Renewable energy installations

3:能源儲存裝置 3: Energy Storage Devices

5a:第一耗能裝置 5a: The first energy-consuming device

5b:第二耗能裝置 5b: Second energy consumption device

5c:第三耗能裝置 5c: Third energy-consuming device

Claims

An energy management device using multi-objective reinforcement learning, which is coupled between at least one renewable energy device, an energy storage device and a plurality of energy consumption devices, and includes: a control unit; a user interface, coupled to the control a unit for user operation to input operation requirements of a plurality of devices; an information providing unit, coupled to the control unit, for providing at least one energy storage device status information and an electricity price information to the control unit; a first enhancement a learning unit, configured to execute a first reinforcement learning operation according to a plurality of the device operation requirements and the electricity price information based on the control of the control unit, so as to obtain an energy-consuming device operation schedule including a plurality of device control parameters, enabling the control unit to control a plurality of the energy-consuming devices to perform time-sharing operations according to the operation schedule of the energy-consuming devices; and a second reinforcement learning unit configured to rely on the energy storage device information based on the control of the control unit A second reinforcement learning operation is performed with the electricity price information to obtain a charge/discharge operation schedule including a charge control parameter and a discharge control parameter, so that the control unit controls the energy source according to the charge/discharge operation schedule The storage device performs time-sharing operations; wherein, the energy management device is connected to a battery management device with the information providing unit information to obtain the state information of the energy storage device, and the state information of the energy storage device includes: battery capacity , battery charging efficiency, battery discharging efficiency, and leakage parameters.

The energy management device using multi-objective reinforcement learning according to claim 1, wherein the first reinforcement learning unit uses the following mathematical formulas (1), (2) and (3) to complete the first reinforcement learning operation:

Among them, Q ₁ ( _{Sh ,A h} ₎ , Q ₁ ( _Sh+1 ,a), Q ₁ (Sh ,a), Q ₂ ( _Sh ,A _h ), Q ₂ ( _Sh ₊₁ ,a), a), and Q ₂ ( _Sh , a) are all Q values, which are values generated according to the state and action of the energy-consuming device; wherein, α is the learning rate, R _{1, h+1} is the The reward value corresponding to the electricity cost, R _2,h+1 is the reward value corresponding to the user satisfaction, γ is the discount rate, _Sh is the state of the energy-consuming device in the current time interval, and _Sh+1 is the the state of the energy-consuming device in the next time interval,

is the best action of the energy-consuming device in the current time interval, A _h is the action of the energy-consuming device in the current time interval, A _h+1 is the action of the energy-consuming device in the next time interval, a is the action performed by the energy-consuming device, A is an action set of the energy-consuming device, w _1,h is the weight value corresponding to the electricity cost, and w _2,h is the weight corresponding to the user satisfaction value, and w _2,h =1-w _1,h .

The energy management device using multi-objective reinforcement learning according to claim 2, wherein the second reinforcement learning unit uses the following mathematical formulas (4) and (5) to complete the operation of the second reinforcement learning unit:

in,

and

is the reward value,

is the state of the energy storage device in the current time interval,

is the state of the energy storage device in the next time interval,

is the best action of the energy storage device in the current time interval,

is the action of the energy storage device in the current time interval, a is the action performed by the energy storage device, and A _ess is an action set of the energy storage device.

The energy management device using multi-objective reinforcement learning according to claim 1, wherein the plurality of energy consuming devices comprises: at least one first energy consuming device whose operation mode cannot be flexibly adjusted, and at least one first energy consuming device whose operation time can be flexibly adjusted Two energy-consuming devices, and at least one third energy-consuming device whose operating power can be flexibly adjusted.

The energy management device using multi-objective reinforcement learning according to claim 2, wherein, for the second energy-consuming device, the reward value used in the above formula (1) and the above formula (2) is based on the following Formulas (1-1) and (1-2) are calculated to obtain:

in,

is the real-time electricity price in the current time interval,

The energy management device using multi-objective reinforcement learning according to claim 2, wherein, for the third energy-consuming device, the reward value used in the above formula (1) and the above formula (2) is based on the following Formulas (1-1) and (1-2) are calculated to obtain:

in,

is the real-time electricity price in the current time interval,

The energy management device using multi-objective reinforcement learning according to claim 1, wherein the energy storage device is a battery, and the renewable energy device is any one of the following: a solar photovoltaic power generation device, a solar thermal power generation device, Wind power plant, thermal power plant, or hydroelectric power plant.

The energy management device utilizing multi-objective reinforcement learning as claimed in claim 1, wherein the energy management device controls the energy storage device to perform time-sharing actions through a battery management device.

The energy management device utilizing multi-objective reinforcement learning according to claim 1, wherein the plurality of device operation requirements include: operation time interval, continuous operation time, operation power, maximum operation power, and minimum operation power.

An energy management method using multi-objective reinforcement learning is implemented by an energy management device coupled between at least one renewable energy device, an energy storage device and a plurality of energy consuming devices, and includes the following steps: (1) Receive a plurality of device operation requirements input by a user, and obtain at least one energy storage device status information and an electricity price information; wherein, the energy management device is informationally connected to a battery management device, so as to obtain the energy storage device status information, And the state information of the energy storage device includes: battery capacity, battery charging efficiency, battery discharging efficiency, and leakage parameters; (2) performing a first reinforcement learning operation according to a plurality of the device operation requirements and the electricity price information to obtain the An energy consumption device operation schedule of a plurality of device control parameters, and a second reinforcement learning operation is performed according to the energy storage device information and the electricity price information to obtain a charge/discharge control parameter including a charge control parameter and a discharge control parameter and (3) controlling a plurality of the energy consuming devices to perform time-sharing operations according to the energy-consuming device operating schedule, and controlling the energy storage device to perform time-sharing according to the charging/discharging operation schedule action.

The energy management method using multi-objective reinforcement learning according to claim 10, wherein the first reinforcement learning unit operation is completed by using the following mathematical formulas (1), (2) and (3):

The energy management method using multi-objective reinforcement learning according to claim 11, wherein the second reinforcement learning unit operation is completed by using the following mathematical formulas (4) and (5):

in,

and

is the reward value,

is the state of the energy storage device in the current time interval,

is the state of the energy storage device in the next time interval,

is the best action of the energy storage device in the current time interval,

The energy management method using multi-objective reinforcement learning according to claim 10, wherein the plurality of energy-consuming devices comprises: at least one first energy-consuming device whose operation mode cannot be flexibly adjusted, and at least one first energy-consuming device whose operation time can be flexibly adjusted Two energy-consuming devices, and at least one third energy-consuming device whose operating power can be flexibly adjusted.

The energy management method using multi-objective reinforcement learning according to claim 11, wherein, for the second energy-consuming device, the reward value used in the above formula (1) and the above formula (2) is based on the following Formulas (1-1) and (1-2) are calculated to obtain:

in,

is the real-time electricity price in the current time interval,

The energy management method using multi-objective reinforcement learning according to claim 11, wherein, for the third energy-consuming device, the reward value used in the above formula (1) and the above formula (2) is based on the following Formulas (1-1) and (1-2) are calculated to obtain:

in,

is the real-time electricity price in the current time interval,

The energy management method using multi-objective reinforcement learning according to claim 10, wherein the energy storage device is a battery, and the renewable energy device is any one of the following: a solar photovoltaic power generation device, a solar thermal power generation device, Wind power plant, thermal power plant, or hydroelectric power plant.

The energy management method using multi-objective reinforcement learning according to claim 10, wherein the energy management device controls the energy storage device to perform time-sharing actions through a battery management device.

The energy management method using multi-objective reinforcement learning according to claim 10, wherein the plurality of device operation requirements include: operation time interval, continuous operation time, operation power, maximum operation power, and minimum operation power.