TWI705387B

TWI705387B - Reinforcement learning system and its servo device and reinforcement learning method

Info

Publication number: TWI705387B
Application number: TW107146985A
Authority: TW
Inventors: 翁慶昌; 李揚漢; 陳功瀚
Original assignee: 翁慶昌
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2020-09-21
Also published as: TW202025006A

Abstract

一種強化學習系統，包含至少一代理裝置，及一伺服裝置。該至少一代理裝置根據一設定條件，通過網路傳送相關於環境之狀態的數狀態集、接收用於執行動作的數動作集，及傳送與環境互動後所產生的數回饋訊息給該伺服裝置。該伺服裝置根據該設定條件，配置預定比例的記憶體空間為至少一工作站，且選用一訓練模型暫存於該至少一工作站，該至少一工作站以當前的狀態集、動作集與回饋訊息為參數，導入該訓練模型後進行強化學習，並產生下一個動作集，至達成目標。藉此，本發明通過位於雲端且經過配置的硬體資源來進行訓練，不但能夠提升學習效率，且模組化的設計，可以使強化學習架構更容易的被重複利用與更新，而適用不同的代理裝置，提升適用範圍與實用性。A reinforcement learning system includes at least one agent device and a server device. According to a set condition, the at least one proxy device transmits a data state set related to the state of the environment, receives a data action set for performing an action, and transmits a data feedback message generated after interacting with the environment to the server device through the network . The server device allocates a predetermined ratio of memory space as at least one workstation according to the setting condition, and selects a training model to be temporarily stored in the at least one workstation, and the at least one workstation uses the current state set, action set and feedback message as parameters , After importing the training model, perform reinforcement learning and generate the next action set to achieve the goal. In this way, the present invention uses hardware resources located in the cloud and configured to perform training. Not only can learning efficiency be improved, but also the modular design can make the reinforcement learning architecture easier to reuse and update, and apply different Acting device to enhance the scope of application and practicality.

Description

Reinforcement learning system and its servo device and reinforcement learning method

本發明是有關於一種強化學習系統，特別是指一種進行強化學習訓練的強化學習系統及其伺服裝置與強化學習方法。The invention relates to a reinforcement learning system, in particular to a reinforcement learning system for reinforcement learning training and its servo device and reinforcement learning method.

「強化學習（Reinforcement learning，RL）」可以說是人工智慧領域現在最熱門的方向，它之所以聲名大振，與 DeepMind 團隊用它在 AlphaGo 和 AlphaZero 上大獲成功脫不了關係。"Reinforcement learning (RL)" can be said to be the hottest direction in the field of artificial intelligence. The reason for its popularity is related to the great success of the DeepMind team on AlphaGo and AlphaZero.

強化學習是一種接近人類學習的方式，強調如何基於環境而行動，以取得最大化的預期利益。以實驗室中的白老鼠，如何學習操作槓桿，來獲得食物為例：白老鼠是做決定的代理裝置（agent），其初始狀態（state），因對環境（environment）尚未有任何探索的機會，其行為一開始是隨機且不具有目標取向，直到他們無意間碰觸了特意設置的環境中的槓桿，並從拉扯槓桿的動作（action）中，意外獲得了食物也就是對於此行為的獎勵（reward），在大腦的獎勵機制驅使下，白老鼠們開始採取有目標取向的學習方式，為了能獲得更多的食物獎勵，他們可能會聚集到槓桿旁，不斷地嘗試，直到學會正確拉槓桿的動作為止。Reinforcement learning is a way close to human learning, emphasizing how to act based on the environment to maximize the expected benefits. Take a white mouse in the laboratory how to learn how to operate a lever to obtain food as an example: the white mouse is an agent that makes decisions, its initial state, because there is no opportunity to explore the environment , Their behavior is random and non-target-oriented at first, until they inadvertently touch the lever in the specially set environment, and from the action of pulling the lever, they accidentally obtain food, which is the reward for this behavior. (Reward), driven by the brain’s reward mechanism, the white mice began to adopt a goal-oriented learning method. In order to get more food rewards, they may gather next to the lever and keep trying until they learn to pull the lever correctly. So far.

由於強化學習必需基於環境而行動，因此，所有的架構都設置在代理裝置（agent）上，不但學習效率受限於代理裝置（agent）本身的資源與效能，且重要的是，對於每一個代理裝置（agent）而言，都要配置相關於強化學習的所有硬體架構與軟體資源，且一但學習目的不同，原有的強化學習架構就無法被重複利用或更新，而在適用範圍與實用性方面，仍然有大幅提升的空間。Since reinforcement learning must act based on the environment, all architectures are set on the agent device. Not only is the learning efficiency limited by the resources and performance of the agent itself, but also importantly, for each agent As far as the device (agent) is concerned, all hardware architectures and software resources related to reinforcement learning must be configured, and once the learning objectives are different, the original reinforcement learning architecture cannot be reused or updated, but in the scope of application and practicality In terms of sex, there is still room for substantial improvement.

因此，本發明的目的，即在提供一種能夠提升學習效率，及提升適用範圍與實用性的強化學習系統及其伺服裝置與強化學習方法。Therefore, the purpose of the present invention is to provide a reinforcement learning system and its servo device and reinforcement learning method that can improve learning efficiency, and improve the scope and practicability.

於是，本發明強化學習系統及其伺服裝置與強化學習方法。Thus, the reinforcement learning system and its servo device and reinforcement learning method of the present invention.

本發明之功效在於：本發明通過位於雲端且經過配置的硬體資源來進行訓練，不但能夠提升學習效率，且模組化的設計，可以使強化學習架構更容易的被重複利用與更新，而適用不同的代理裝置，提升適用範圍與實用性。The effect of the present invention is that the present invention uses hardware resources located in the cloud and configured for training, which can not only improve learning efficiency, but also has a modular design that makes it easier to reuse and update the reinforcement learning architecture. Applicable to different agent devices to improve the scope of application and practicality.

參閱圖1與圖2，本發明強化學習系統的一實施例，包含數代理裝置1，及一伺服裝置2。1 and FIG. 2, an embodiment of the reinforcement learning system of the present invention includes a number agent device 1 and a server device 2.

該每一代理裝置1根據一設定條件10，傳送相關於環境（environment）之狀態（state）的數狀態集11、接收用於執行動作（action）的數動作集12、傳送與環境互動後所產生的一回饋訊息13，及用於中止動作的一中止訊號S。該設定條件10用於達成一目標（goal）。該回饋訊息13包括對於前述動作的獎勵值R（reward）。According to a set condition 10, each agent device 1 transmits a data state set 11 related to the state of the environment, receives a data action set 12 for performing an action, and transmits the data after interacting with the environment. A feedback message 13 is generated, and a stop signal S used to stop the action. The setting condition 10 is used to achieve a goal. The feedback message 13 includes the reward value R (reward) for the aforementioned action.

應當注意的是，該等代理裝置1的數目不限於多個，在本實施例的其它變化例中，也可以是一個。It should be noted that the number of the proxy devices 1 is not limited to multiple, and may be one in other variations of this embodiment.

該伺服裝置2包括一連線單元21、一存儲單元22，及一處理單元23。The servo device 2 includes a connection unit 21, a storage unit 22, and a processing unit 23.

在本實施例中，該連線單元21通過通訊技術與該等代理裝置1相互通訊，而用於接收該每一代理裝置1的設定條件10、該等狀態集11與該等回饋訊息13，及傳送該等動作集12給該每一代理裝置1。In this embodiment, the connection unit 21 communicates with the proxy devices 1 through communication technology, and is used to receive the setting conditions 10, the state sets 11, and the feedback messages 13 of each proxy device 1. And send the action sets 12 to each agent device 1.

在本實施例中，該存儲單元22是一種記憶體或存儲媒體，包括用於存儲數訓練模型(Trained Model)的一訓練模型池221、用於存儲數估測模型(Untrained Model)的一估測模型池222、用於存儲一動作雜訊模型(Action noise module)的一動作雜訊模型池223，及用於存儲數緩衝資料模型的一緩衝資料模型池224。在本實施例中，該等訓練模型包括但不限於是DQN module、DDPG module、A3C module、PPO module、Q-learning module。每一估測模型為已完成訓練且能夠達成該目標的其中一訓練模型。該動作雜訊模型用於增加該至少一代理裝置1對環境的探索，包括但不限於ε greedy module、 Uhlenbeck module。該緩衝資料模型用於決定資料取用及暫存的方式，包括但不限於回放緩衝模型(replay memory module)、簡易緩衝模型( simple buffer module)。In this embodiment, the storage unit 22 is a memory or storage medium, and includes a training model pool 221 for storing a trained model (Trained Model), and an estimate for storing a data estimation model (Untrained Model). The measurement model pool 222, an action noise model pool 223 for storing an action noise module, and a buffer data model pool 224 for storing data buffer data models. In this embodiment, these training models include but are not limited to DQN module, DDPG module, A3C module, PPO module, Q-learning module. Each estimation model is one of the training models that has completed training and can achieve the goal. The motion noise model is used to increase the exploration of the environment by the at least one agent device 1, including but not limited to ε greedy module and Uhlenbeck module. The buffer data model is used to determine the way of data access and temporary storage, including but not limited to replay memory module and simple buffer module.

該處理單元23包括一計算模組231，及連接於該連線單元21、該存儲單元22與該計算模組231的一處理模組232。The processing unit 23 includes a calculation module 231 and a processing module 232 connected to the connection unit 21, the storage unit 22 and the calculation module 231.

在本實施例中，該計算模組231可以由一個以上的中央處理器（Central Processing Unit）所組成，或是由一個以上的繪圖處理器（Graphics Processing Unit）所組成，或是由一個以上的中央處理器與繪圖處理器所組成。該計算模組231並具有一記憶體空間233，及數運算核心234。In this embodiment, the computing module 231 may be composed of more than one central processing unit (Central Processing Unit), or composed of more than one graphics processing unit (Graphics Processing Unit), or composed of more than one It is composed of central processing unit and graphics processor. The calculation module 231 also has a memory space 233 and a number operation core 234.

在本實施例中，該處理模組232由一個以上的中央處理器（Central Processing Unit）所組成，會根據該每一代理裝置1的設定條件10，配置預定比例的記憶體空間233與預定數目的運算核心234為一個以上的工作站20(worker)。In this embodiment, the processing module 232 is composed of more than one central processing unit (Central Processing Unit), and will allocate a predetermined ratio of memory space 233 and a predetermined number according to the setting condition 10 of each agent device 1. The computing core 234 of is more than one workstation 20 (worker).

參閱圖2、圖3，以下為方便說明，區分該等代理裝置1的編號為1a、2a、3a-1、3a-2、3a-3，分別對應編號1a、2a、3a-1、3a-2、3a-3的環境，及編號1a、2a、3a-1、3a-2、3a-3的工作站20，其中，以編號1a的環境為一單擺車(carpole)3為例，如何通過向左或向右移動該單擺車3，使該單擺車3上的一單擺31可以維持站立而不倒下，其設定條件10如表1。Refer to Figure 2 and Figure 3, for the convenience of description below, the numbers 1a, 2a, 3a-1, 3a-2, 3a-3 to distinguish the proxy devices 1 correspond to the numbers 1a, 2a, 3a-1, 3a- 2. The environment of 3a-3, and the workstations 20 numbered 1a, 2a, 3a-1, 3a-2, and 3a-3. Taking the environment of number 1a as a carpole 3 as an example, how to pass Move the pendulum bike 3 to the left or right, so that a pendulum 31 on the pendulum bike 3 can maintain standing without falling down. The setting conditions 10 are shown in Table 1.

表1：

Table 1:

參閱圖4、圖1與圖2，該實施例的強化學習方法，是通過該伺服裝置2實現以下步驟：Referring to Figure 4, Figure 1 and Figure 2, the reinforcement learning method of this embodiment implements the following steps through the servo device 2:

步驟401：開始。Step 401: Start.

步驟402：通過該連線單元21與編號1a的該代理裝置1建立連線，使該伺服裝置2與編號1a的代理裝置1相互通訊。Step 402: Establish a connection with the proxy device 1 numbered 1a through the connection unit 21, so that the server device 2 and the proxy device 1 numbered 1a communicate with each other.

步驟403：該處理模組232根據編號1a的該代理裝置1的設定條件10，配置預定比例的記憶體空間233與預定數目的運算核心234為一個以上的工作站20。Step 403: The processing module 232 allocates a predetermined ratio of memory space 233 and a predetermined number of computing cores 234 to more than one workstation 20 according to the setting condition 10 of the agent device 1 numbered 1a.

應當注意的是，該設定條件10不限於如表1使用10%的記憶體空間233，及1個運算核心234、1個工作站20，在本實施例的其它變化例中，也可以使用20%~100%的記憶體空間233，及P個運算核心234、K個工作站20，則每一個工作站20使用20~100%/k的記憶體空間233，且P≦K。It should be noted that the setting condition 10 is not limited to using 10% of the memory space 233 as shown in Table 1, and one computing core 234 and one workstation 20. In other variations of this embodiment, 20% can also be used. ~100% memory space 233, and P computing cores 234, K workstations 20, each workstation 20 uses 20~100%/k of memory space 233, and P≦K.

步驟404：該處理模組232根據所連接的該代理裝置1產生唯一的一辨識碼U，且將該辨識碼U及該設定條件10中所選用的訓練模型、所選用的緩衝資料模型與該動作雜訊模型暫存於對應的工作站20。Step 404: The processing module 232 generates a unique identification code U according to the connected proxy device 1, and the identification code U and the selected training model in the setting condition 10, the selected buffer data model and the The motion noise model is temporarily stored in the corresponding workstation 20.

值得說明的是，該辨識碼U包括一主數字串(Universally Unique Identifier，UUID）U1，及一次數字串(room_id)U2。不同的主數字串U1代表在不同環境中的不同代理裝置1。不同的次數字串U2代表在相同環境中執行不同動作的不同代理裝置1，若同一環境中只有執行1種動作的代理裝置1，則次數字串U2省略。It is worth noting that the identification code U includes a universally unique identifier (UUID) U1 and a primary number (room_id) U2. Different main number strings U1 represent different agent devices 1 in different environments. Different sub-digit strings U2 represent different agent devices 1 performing different actions in the same environment. If there is only one agent device 1 performing one type of action in the same environment, the sub-digit strings U2 are omitted.

如圖2所示，編號1a的環境是該單擺車3，對應編號1a之環境且編號1a的代理裝置1的辨識碼U為/ikSYiGCPfUVSNbs5NmhQHe/，編號2a的環境是一種打磚塊遊戲，在編號2a的環境且編號2a的代理裝置1的辨識碼U為/F6bdVeRR86gsm6rJtLz3af/，編號3-1a的環境、編號3-2a的環境、編號3-3a的環境為相同的避障空間，對應編號3-1a之環境且編號3-1a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/01/，對應編號3-2a之環境且編號3-2a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/02/，對應編號3-3a之環境且編號3-3a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/03/。As shown in Figure 2, the environment numbered 1a is the pendulum bike 3, and the identification code U of the agent device 1 corresponding to the environment numbered 1a is /ikSYiGCPfUVSNbs5NmhQHe/, and the environment numbered 2a is a brick-and-mortar game. The identification code U of the number 2a environment and the number 2a proxy device 1 is /F6bdVeRR86gsm6rJtLz3af/, the number 3-1a environment, the number 3-2a environment, and the number 3-3a environment are the same obstacle avoidance space, corresponding to number 3. The identification code U of the agent device 1 in the environment of -1a and number 3-1a is /CjTYQXbCbdKQ9ssi79Z7C7/01/, and the identification code U of the agent device 1 corresponding to the environment of number 3-2a and number 3-2a is /CjTYQXbCbdKQ9ssi79Z7C7/02/ , The identification code U of the proxy device 1 corresponding to the environment numbered 3-3a and numbered 3-3a is /CjTYQXbCbdKQ9ssi79Z7C7/03/.

步驟405：該處理模組232通過該連線單元21將該辨識碼U傳送給編號1a的代理裝置1，藉此，使該狀態集11、該動作集12與該回饋訊息13在存儲有相同之辨識碼U的該代理裝置1與該對應的工作站20間傳遞，即在編號1a的代理裝置1與編號1a的工作站20間傳遞。Step 405: The processing module 232 sends the identification code U to the proxy device 1 with the number 1a through the connection unit 21, so that the status set 11, the action set 12, and the feedback message 13 are stored in the same The identification code U is transmitted between the proxy device 1 and the corresponding workstation 20, that is, between the proxy device 1 with the number 1a and the workstation 20 with the number 1a.

步驟406：該處理模組232根據該設定條件10判斷是否為獎勵模式?如果是，進行步驟500，如果否，進行步驟407。Step 406: The processing module 232 judges whether it is a reward mode according to the setting condition 10? If yes, go to step 500, if not, go to step 407.

步驟407：該處理模組232根據該設定條件10判斷是否為估測模式?如果是，進行步驟600，如果否，進行步驟408。Step 407: The processing module 232 determines whether it is the estimation mode according to the setting condition 10? If yes, proceed to step 600, if not, proceed to step 408.

步驟408：結束。Step 408: End.

參閱圖5、圖2、圖6與圖7，其中，步驟500包括：Referring to FIG. 5, FIG. 2, FIG. 6 and FIG. 7, where step 500 includes:

步驟501：開始。Step 501: Start.

步驟502：通過該連線單元21接收來自於編號1a之代理裝置1的初始的狀態集11。該狀態集11是由該代理裝置1觀察環境所得到。Step 502: Receive the initial state set 11 from the proxy device 1 numbered 1a through the connection unit 21. The state set 11 is obtained by the agent device 1 observing the environment.

步驟503：編號1a的工作站20，通過該記憶體空間233暫存該狀態集11，且通過該運算核心234以初始的狀態集11為參數，導入選用的訓練模型後產生初始的動作集12。Step 503: The workstation 20 numbered 1a temporarily stores the state set 11 through the memory space 233, and uses the initial state set 11 as a parameter through the computing core 234 to generate the initial action set 12 after importing the selected training model.

步驟504：編號1a的工作站20通過該連線單元21傳送該動作集12給編號1a的代理裝置1，使編號1a的代理裝置1根據該動作集12與編號1a的環境互動，進而改變編號1a之環境狀態，如圖3所示，編號1a的代理裝置1通過使該單擺車3往左推一步或往右推一步的動作，改變該單擺車3的位置、速度、該單擺31的角度、速度等狀態。Step 504: The workstation 20 of number 1a transmits the action set 12 to the agent device 1 of number 1a through the connection unit 21, so that the agent device 1 of number 1a interacts with the environment of number 1a according to the action set 12, thereby changing the number 1a As shown in Figure 3, the agent device 1 numbered 1a changes the position, speed, and speed of the pendulum 3 by pushing the pendulum 3 one step to the left or one step to the right. The angle, speed and other status of the camera.

值得說明的是，該單擺車3的位置、速度、該單擺31的角度、速度等狀態，是通過該代理裝置1的感測器(圖未示)偵測與觀察該單擺車3、該單擺31的位置後，計算出該單擺車3、該單擺31的速度。前述獲得狀態的技術，已揭示於強化學習的先前技術中，且非本案申請的技術特徵，由於本領域中具有通常知識者根據以上說明可以推知擴充細節，因此不多加說明。It is worth noting that the position and speed of the pendulum 3, the angle and speed of the pendulum 31, etc. are detected and observed by the sensor (not shown) of the agent device 1 After the position of the pendulum 31, the speed of the pendulum car 3 and the pendulum 31 are calculated. The aforementioned state-obtained technology has been disclosed in the prior art of reinforcement learning, and is not a technical feature of the application in this case. Since a person with ordinary knowledge in the field can infer the details of the expansion based on the above description, no further description is given.

步驟505：編號1a的工作站20判斷是否通過該連線單元21接收該中止訊號S?如果是，進行步驟506，表示該單擺31的角度已超過±12°，可能已倒下，或表示該單擺車3的位置超過±2.4m(公尺)，已脫離可偵測範圍，或表示表示這一回合的步數已超過200步，如果否，則持續進行判斷。Step 505: The workstation 20 numbered 1a judges whether to receive the stop signal S through the connection unit 21? If yes, proceed to step 506, which indicates that the angle of the simple pendulum 31 has exceeded ±12° and may have fallen down, or indicates that the The position of the pendulum bike 3 exceeds ±2.4m (meters), and it is out of the detectable range, or it means that the number of steps in this round has exceeded 200 steps. If not, continue to judge.

步驟506：編號1a的工作站20通過該連線單元21接收由編號1a的代理裝置1所傳送的該回饋訊息13，及下一個狀態集11，而獲得對於前述動作的獎勵值R（reward）。前述下一個狀態集11是由編號1a之代理裝置1在步驟504中以初始的動作集12與編號1a的環境互動後所產生，例如該單擺車3的位置或速度改變，或單擺31的角度或速度改變。Step 506: The workstation 20 with the number 1a receives the feedback message 13 and the next state set 11 sent by the proxy device 1 with the number 1a through the connection unit 21, and obtains the reward value R (reward) for the aforementioned action. The aforementioned next state set 11 is generated by the agent device 1 numbered 1a interacting with the environment numbered 1a with the initial action set 12 in step 504, for example, the position or speed of the pendulum bike 3 changes, or the pendulum 31 The angle or speed changes.

步驟507：編號1a的工作站20通過該記憶體空間233暫存當前的動作集12、對應當前動作集12所產生的回饋訊息13與下一個狀態集11，且通過該運算核心234以當前的動作集12、對應當前動作集12所產生的回饋訊息13與下一個狀態集11為參數，導入該訓練模型進行強化學習，並產生下一個動作集12。Step 507: The workstation 20 numbered 1a temporarily stores the current action set 12, the feedback message 13 generated corresponding to the current action set 12, and the next state set 11 through the memory space 233, and uses the current action set through the computing core 234 Set 12, corresponding to the feedback message 13 generated by the current action set 12 and the next state set 11 as parameters, import the training model for reinforcement learning, and generate the next action set 12.

步驟508：編號1a的工作站20，通過該運算核心234判斷該單擺車3是否連續10回合的步數都超過200步，而達成該目標?如果是，進行步驟509，如果否，回到步驟504。Step 508: Workstation 20 numbered 1a, through the computing core 234, judge whether the pendulum bike 3 has more than 200 steps for 10 consecutive rounds, and the goal is reached? If yes, go to step 509, if not, go back to step 504.

步驟509：結束。Step 509: End.

參閱圖2、圖8與圖9，其中，步驟600包括：Refer to Figure 2, Figure 8 and Figure 9, where step 600 includes:

步驟601：開始。Step 601: Start.

步驟602：通過該連線單元21接收來自於編號1a之代理裝置1的狀態集11。Step 602: Receive the state set 11 from the proxy device 1 numbered 1a through the connection unit 21.

步驟603：編號1a的工作站20，通過該記憶體空間233暫存該狀態集11，且通過該運算核心234以該狀態集11為參數，導入選用的估測模型後產生動作集12。Step 603: The workstation 20 numbered 1a temporarily stores the state set 11 through the memory space 233, and uses the state set 11 as a parameter through the computing core 234 to import the selected estimation model to generate the action set 12.

步驟604：編號1a的工作站20通過該連線單元21傳送該動作集12給編號1a的代理裝置1，使編號1a的代理裝置1根據該動作集12與編號1a的環境互動，進而改變編號1a之環境狀態。Step 604: The workstation 20 of number 1a transmits the action set 12 to the agent device 1 of number 1a through the connection unit 21, so that the agent device 1 of number 1a interacts with the environment of number 1a according to the action set 12, and then changes the number 1a The state of the environment.

步驟605：編號1a的工作站20，通過該運算核心234判斷該單擺車3是否連續10回合的步數都超過200步，而達成該目標?如果是，進行步驟606，如果否，回到步驟602。Step 605: Workstation 20 numbered 1a, through the computing core 234, judge whether the pendulum bike 3 has more than 200 steps for 10 consecutive rounds, and the goal is reached? If yes, go to step 606, if not, go back to step 602.

步驟606：結束。Step 606: End.

參閱圖10與圖11，另外，值得說明的是，在步驟404中，該訓練模型不限於是由該設定條件10所選定，在本實施例的其它變化例中，也可以由該處理模組232以所連線之代理裝置1的數目，及運算核心234的數目為參數，通過一取樣演算模型所決定，而自動尋找出最佳的訓練模型。在本實施例中，該取樣演算模型為湯姆森取樣(Thompson Sampling)演算法，以P個運算核心234、K個工作站20為例，並說明如下：10 and FIG. 11, in addition, it is worth noting that in step 404, the training model is not limited to be selected by the setting condition 10. In other variations of this embodiment, the processing module 232 uses the number of connected agent devices 1 and the number of computing cores 234 as parameters, and is determined by a sampling calculation model, and automatically finds the best training model. In this embodiment, the sampling calculation model is the Thompson Sampling algorithm, taking P arithmetic cores 234 and K workstations 20 as an example, and the description is as follows:

假設K=4，P=1，自動尋找最佳的訓練模型，及適合的工作站20時，該處理模組232每次僅選用一個工作站20與一個運算核心234進行取樣，而獲得一期望值，然後，選用期望值最高的工作站20及其訓練模型。Assuming K=4, P=1, when automatically searching for the best training model and suitable workstation 20, the processing module 232 selects only one workstation 20 and one computing core 234 for sampling at a time to obtain an expected value, and then Select the workstation 20 with the highest expectation and its training model.

假設K=4，P=2，自動尋找最佳的訓練模型，及適合的工作站20時，該處理模組232每次選用任二個工作站20與二個運算核心234進行取樣，而選擇該期望值最高的其中一個工作站20，再與其它工作站20進行比較，然後，選用期望值最高的工作站20及其訓練模型。Assuming K=4, P=2, when automatically searching for the best training model and suitable workstation 20, the processing module 232 selects any two workstations 20 and two computing cores 234 for sampling each time, and selects the expected value One of the highest workstations 20 is compared with other workstations 20, and then the workstation 20 with the highest expected value and its training model are selected.

藉此，由於該取樣演算模型最後選擇的是，期望值較高的工作站20，也因此，能夠選擇出較適合環境的訓練模型，進而提昇學習效率。In this way, since the final selection of the sampling calculation model is the workstation 20 with a higher expected value, a training model that is more suitable for the environment can be selected, thereby improving the learning efficiency.

值得說明的是，前述湯姆森取樣(Thompson Sampling)演算法的算式，是一種目前廣為應用的演算法，由於本領域中具有通常知識者根據以上說明可以推知擴充細節，因此不多加說明。It is worth noting that the aforementioned Thompson Sampling algorithm is a widely used algorithm at present. Since those with ordinary knowledge in the field can infer the details of the expansion based on the above description, no further explanation is given.

經由以上的說明，可將前述實施例的優點歸納如下：Based on the above description, the advantages of the foregoing embodiments can be summarized as follows:

1、本發明將強化學習系統分為只需要能夠觀察環境（environment）及與環境互動的代理裝置1(agent)，及位於雲端且經過配置的硬體資源來進行訓練的伺服裝置2，不但能夠簡化每一代理裝置1的硬體架構與軟體資源，且能夠通過該伺服裝置2大幅提升學習效率。1. The present invention divides the reinforcement learning system into an agent device 1 (agent) that only needs to be able to observe the environment and interact with the environment, and a server device 2 that is located in the cloud and configured hardware resources for training. The hardware architecture and software resources of each agent device 1 are simplified, and the learning efficiency can be greatly improved through the server device 2.

2、本發明能夠通過前述特殊的設計，使位於雲端的工作站20、訓練模型、動作雜訊模型模組化，藉此，前述模組化的工作站20、訓練模型、動作雜訊模型可以重複的被利用或更新，使本發明的伺服裝置2以適用於不同的代理裝置1，有效提升使用效益。2. The present invention can modularize the workstation 20, training model, and motion noise model in the cloud through the aforementioned special design, so that the aforementioned modular workstation 20, training model, and motion noise model can be repeated Being used or updated, the server device 2 of the present invention can be applied to different agent devices 1 and effectively improve the use efficiency.

3、本發明能夠以前述模組化設計，使訓練完成的訓練模型直接更新為估測模型，進行最後的驗證，進一步提升學習效率。3. The present invention can use the aforementioned modular design to directly update the trained training model to an estimation model, perform final verification, and further improve learning efficiency.

4、另外，本發明可以由該處理模組232以所連線之代理裝置1的數目，及運算核心234的數目為參數，通過該取樣演算模型所決定，而自動尋找出最佳的訓練模型，及適合的工作站20，進一步提升學習效率。4. In addition, in the present invention, the processing module 232 can use the number of connected agent devices 1 and the number of computing cores 234 as parameters to determine the sampling calculation model and automatically find the best training model. , And a suitable workstation 20 to further improve learning efficiency.

惟以上所述者，僅為本發明的實施例而已，當不能以此限定本發明實施的範圍，凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾，皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.

1:代理裝置10:設定條件11:狀態集12:動作集13:回饋訊息2:伺服裝置20:工作站21:連線單元22:存儲單元221:訓練模型池222:估測模型池223:動作雜訊模型池224:緩衝資料模型池23:處理單元231:計算模組232:處理模組233:記憶體空間234:運算核心401~408:步驟501~509:步驟601~606:步驟R:獎勵值S:中止訊號1: Agent device 10: Setting conditions 11: State set 12: Action set 13: Feedback message 2: Servo device 20: Workstation 21: Connection unit 22: Storage unit 221: Training model pool 222: Estimated model pool 223: Action Noise model pool 224: buffer data model pool 23: processing unit 231: calculation module 232: processing module 233: memory space 234: computing core 401~408: steps 501~509: steps 601~606: step R: Reward value S: suspension signal

本發明的其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一張方塊圖，說明本發明強化學習系統的一實施例；圖2是一張方塊圖，說明該實施例中數狀態集、數動作集與數回饋訊息能夠在存儲有相同之一辨識碼的該代理裝置與該對應的工作站間傳遞；圖3是一單擺車的一張示意圖；圖4是一張流程圖，說明本發明強化學習方法結合該實施例的步驟流程；圖5是一張流程圖，說明該實施例中一獎勵模式的步驟流程；圖6是一張方塊圖，說明該實施例中該獎勵模式與時間的關係；圖7是一張示意圖，說明該實施例中模組化的緩衝資料模型、訓練模型、估測模型與動作雜訊模型；圖8是一張流程圖，說明該實施例中一估測模式的步驟流程；圖9是一張方塊圖，說明該實施例中該估測模式與時間的關係；圖10是一張示意圖，說明該實施例中以一個運算核心為例，自動尋找最佳的訓練模型；及圖11是一張示意圖，說明該實施例中以二個運算核心為例，自動尋找最佳的訓練模型。Other features and effects of the present invention will be clearly presented in the implementation with reference to the drawings, in which: Figure 1 is a block diagram illustrating an embodiment of the reinforcement learning system of the present invention; Figure 2 is a block diagram , It is explained that the number state set, the number action set and the number feedback message in this embodiment can be transmitted between the agent device and the corresponding workstation storing the same identification code; Figure 3 is a schematic diagram of a pendulum car; Figure 4 is a flow chart illustrating the step flow of the reinforcement learning method of the present invention combined with this embodiment; Figure 5 is a flow chart illustrating the step flow of a reward mode in this embodiment; Figure 6 is a block diagram, Explain the relationship between the reward mode and time in this embodiment; Figure 7 is a schematic diagram illustrating the modularized buffer data model, training model, estimation model, and action noise model in this embodiment; Figure 8 is a picture A flowchart illustrating the flow of steps in an estimation mode in this embodiment; Fig. 9 is a block diagram illustrating the relationship between the estimation mode and time in this embodiment; Fig. 10 is a schematic diagram illustrating the procedure in this embodiment Take one computing core as an example to automatically find the best training model; and FIG. 11 is a schematic diagram illustrating that two computing cores are used as an example in this embodiment to automatically find the best training model.

1:代理裝置 1: proxy device

10:設定條件 10: Setting conditions

11:狀態集 11: Status set

12:動作集 12: Action set

13:回饋訊息 13: Feedback

2:伺服裝置 2: Servo device

223:動作雜訊模型池 223: Motion Noise Model Pool

224:緩衝資料模型池 224: buffer data model pool

23:處理單元 23: processing unit

231:計算模組 231: Computing Module

232:處理模組 232: Processing module

233:記憶體空間 233: memory space

20:工作站 20: Workstation

21:連線單元 21: Connection unit

22:存儲單元 22: storage unit

221:訓練模型池 221: training model pool

234:運算核心 234: Computing Core

R:獎勵值 R: reward value

S:中止訊號 S: Stop signal

Claims

A server device of a reinforcement learning system is connected to at least one agent device. The at least one agent device transmits a number state set related to the state of the environment, receives a number action set for performing an action, and transmits an execution action according to a set condition The data feedback message generated after interacting with the environment during the process. The setting condition is used to achieve a goal. The server device includes: a connection unit that communicates with the at least one proxy device through a network to receive the at least The setting conditions of an agent device, the state sets and the feedback messages, and sending the action sets to the at least one agent device; a storage unit storing a number of training models; and a processing unit including a calculation module, And a processing module connected to the connection unit, the storage unit, and the calculation module. The calculation module has a memory space and a number operation core. The processing module configures a predetermined ratio of The memory space and the predetermined number of computing cores are at least one workstation, and after entering a reward mode, one of the training models is selected to be temporarily stored in the at least one workstation, and the at least one workstation is imported into the at least one workstation with the initial state set as a parameter After a training model, an initial action set is generated. Then, using the current action set, the feedback message corresponding to the current action set and the next state set as parameters, import one of the training models for reinforcement learning, and then generate the next one Action set, to achieve the goal, the training model selected by the processing module is determined by the processing module through a sampling calculation model according to the number of connected proxy devices and the number of computing cores.

The server device of the reinforcement learning system according to claim 1 is connected to a data agent device, wherein the processing module is configured with a predetermined proportion of memory space as a data workstation, and when the connection unit is connected to each agent device, the The processing module generates a unique identification code based on each of the connected agent devices, and transmits each identification code to each agent device through the connection unit, and temporarily stores the identification code in the corresponding workstation. The code is used to enable the state set, the action set and the feedback message to be transmitted between each agent device storing the same identification code and the corresponding workstation. The identification code includes a primary digit string and a primary digit string, Different primary numeric strings represent different agent devices in different environments. The processing module selects different training models according to different agent devices. Different secondary numeric strings represent different agent devices in the same environment. The processing module According to different agent devices performing different actions in the same environment, different training models are selected.

The server device of the reinforcement learning system according to claim 2, wherein the storage unit includes a training model pool and an estimation model pool, the training model pool is used to store the training models, and each workstation also Further, after the corresponding agent device completes the target, one of the training models that have completed the target is updated as an estimated model and stored in the estimated model pool.

The servo device of the reinforcement learning system according to claim 3, wherein the storage unit further stores at least one action noise module for increasing environmental exploration, and the processing module further stores at least one action noise module. The motion noise model is temporarily stored in the at least one workstation, and the at least one workstation further uses each action set as a parameter to import the motion noise model to optimize each action set.

The servo device of the reinforcement learning system according to claim 4, wherein the storage unit further stores a number of buffer data models, and the buffer data models are used to determine data access and temporary storage methods, and the processing module further One of the buffer data models is temporarily stored in the at least one workstation, so that the selected training model, the selected buffer data model, and the motion noise model are temporarily stored in the at least one workstation.

The server device of the reinforcement learning system according to claim 3, wherein each workstation further stores the identification code and the estimation model in the estimation model pool after the corresponding agent device completes the target, and generates According to the corresponding relationship, the processing module further enters an estimation mode according to the set conditions, and selects the estimation model corresponding to the same identification code according to the identification code of each agent device, and temporarily stores it in the corresponding workstation, and At least one workstation further uses the current state set as a parameter to generate an initial action set through the estimation model, and then uses the next state set as a parameter to generate a next action set to achieve the goal.

The servo device of the reinforcement learning system according to claim 1, wherein the sampling calculation model is a Thompson Sampling algorithm.

A reinforcement learning method for enabling at least one agent device to implement the following steps: (a) communicate with a server device via a network according to a set condition, the set condition is used to achieve a goal; (b) the set condition Send to the server device, so that the server device and the at least one agent device enter a reward mode according to the set condition; (c) send a state set related to the state of the environment to the server device (D) receiving an action set for performing actions, the action set is generated by at least one workstation of the servo device using the state set of step (c) as a parameter, after importing a training model; (e) according to the current The action set interacts with the environment, and sends a feedback message and the next state set to the at least one workstation; (f) receives the next action set for executing the action, and the next action set is performed by the at least one workstation in steps ( The action set of e), the feedback message of step (e) and the state set of step (e) are parameters, which are generated after importing the training model for reinforcement learning; (g) repeat steps (e) ~ step (f), to Achieve that goal.

The reinforcement learning method as described in claim 8, further including step (h) to step (l) after step (g), and step (b) includes: (b-1) sending the setting condition to the servo Device; (b-2) Determine whether it is the reward mode, if yes, proceed to step (c), if not, proceed to step (h); (h) send the state set to the servo device; (i) receive the action The action set is generated by the at least one workstation after importing an estimation model with the state set of step (h) as a parameter; (j) interacting with the environment according to the current action set, and sending the next state set to the At least one workstation; (k) receiving the next action set, which is generated by the at least one workstation after importing the estimation model with the state set of step (j) as a parameter; (l) Repeat step (j) ~ step (k) to achieve the goal.

An enhanced learning method that implements the following steps through a server device: (a) communicating with at least one agent device via a network; (b) receiving a setting condition from the at least one agent device, and the setting condition is used to achieve a Objective; (c) According to the setting condition, allocate a predetermined proportion of memory space as at least one workstation; (d) According to the setting condition, enter a reward mode, and select a training model to temporarily store in the at least one workstation; (e) ) Receive an initial state set from the at least one agent device and related to the state of the environment; (f) The at least one workstation uses the initial state set as a parameter to import the training model to generate an initial action set; ( g) Send an action set to the at least one agent device, so that the at least one agent device interacts with the environment according to the action set; (h) Receive a feedback message and the next state. The feedback message and the next state are determined by the at least one The current action set of the agent device is generated after interacting with the environment; (i) The at least one workstation uses the current action set, the feedback message corresponding to the current action set, and the next state set as parameters, and imports the training model for reinforcement learning , And generate the next action set; (j) Repeat step (g) ~ step (i) until the goal is reached; (k) the at least one workstation updates the training model that has completed the goal to an estimation model, so that The identification code has a corresponding relationship with the estimation model.

The reinforcement learning method according to claim 10, wherein, in step (a), the server device communicates with the number agent device, and in step (c), the server device configures a predetermined ratio according to the setting condition The memory space of is a number of workstations, and a unique identification code is generated according to each of the agent devices connected, and each identification code is transmitted to each agent device, and temporarily stored in the corresponding workstation, the identification code It is used to transfer the status set, the action set and the feedback message between each agent device storing the same identification code and the corresponding workstation. The identification code includes a primary digit string and a primary digit string. The main number string of represents different agent devices in different environments. In step (d), the server selects different training models according to different agent devices, and different sub-digit strings represent different agents in the same environment. Device, the processing module selects different training models according to different agent devices that perform different actions in the same environment.

As described in claim 11, the reinforcement learning method further includes step (l) to step (r) after step (k), and step (d) includes: (d-1) determining whether it is based on the set condition For the reward mode, if yes, proceed to step (d-2), if not, proceed to step (l); (d-2) select a training model to temporarily store in the at least one workstation, and then proceed to step (e); l) According to the setting condition, judge whether it is an estimation mode, if yes, proceed to step (m), if not, proceed to step (r); (m) According to the identification code of the at least one proxy device, select the corresponding identification The training model of the code is temporarily stored in the at least one workstation; (n) receiving the state related to the environment from the at least one agent device State set; (o) the at least one workstation uses the state set as a parameter to generate the action set after importing the estimation model; (p) transmits the action set to the corresponding agent device, so that the corresponding agent device is based on the The action set interacts with the environment; (q) Repeat step (n) ~ step (p) until the goal is reached; (r) end.