TWI705387B - Reinforcement learning system and its servo device and reinforcement learning method - Google Patents
Reinforcement learning system and its servo device and reinforcement learning method Download PDFInfo
- Publication number
- TWI705387B TWI705387B TW107146985A TW107146985A TWI705387B TW I705387 B TWI705387 B TW I705387B TW 107146985 A TW107146985 A TW 107146985A TW 107146985 A TW107146985 A TW 107146985A TW I705387 B TWI705387 B TW I705387B
- Authority
- TW
- Taiwan
- Prior art keywords
- workstation
- action
- reinforcement learning
- model
- action set
- Prior art date
Links
Images
Landscapes
- Feedback Control In General (AREA)
Abstract
一種強化學習系統,包含至少一代理裝置,及一伺服裝置。該至少一代理裝置根據一設定條件,通過網路傳送相關於環境之狀態的數狀態集、接收用於執行動作的數動作集,及傳送與環境互動後所產生的數回饋訊息給該伺服裝置。該伺服裝置根據該設定條件,配置預定比例的記憶體空間為至少一工作站,且選用一訓練模型暫存於該至少一工作站,該至少一工作站以當前的狀態集、動作集與回饋訊息為參數,導入該訓練模型後進行強化學習,並產生下一個動作集,至達成目標。藉此,本發明通過位於雲端且經過配置的硬體資源來進行訓練,不但能夠提升學習效率,且模組化的設計,可以使強化學習架構更容易的被重複利用與更新,而適用不同的代理裝置,提升適用範圍與實用性。A reinforcement learning system includes at least one agent device and a server device. According to a set condition, the at least one proxy device transmits a data state set related to the state of the environment, receives a data action set for performing an action, and transmits a data feedback message generated after interacting with the environment to the server device through the network . The server device allocates a predetermined ratio of memory space as at least one workstation according to the setting condition, and selects a training model to be temporarily stored in the at least one workstation, and the at least one workstation uses the current state set, action set and feedback message as parameters , After importing the training model, perform reinforcement learning and generate the next action set to achieve the goal. In this way, the present invention uses hardware resources located in the cloud and configured to perform training. Not only can learning efficiency be improved, but also the modular design can make the reinforcement learning architecture easier to reuse and update, and apply different Acting device to enhance the scope of application and practicality.
Description
本發明是有關於一種強化學習系統,特別是指一種進行強化學習訓練的強化學習系統及其伺服裝置與強化學習方法。The invention relates to a reinforcement learning system, in particular to a reinforcement learning system for reinforcement learning training and its servo device and reinforcement learning method.
「強化學習(Reinforcement learning,RL)」可以說是人工智慧領域現在最熱門的方向,它之所以聲名大振,與 DeepMind 團隊用它在 AlphaGo 和 AlphaZero 上大獲成功脫不了關係。"Reinforcement learning (RL)" can be said to be the hottest direction in the field of artificial intelligence. The reason for its popularity is related to the great success of the DeepMind team on AlphaGo and AlphaZero.
強化學習是一種接近人類學習的方式,強調如何基於環境而行動,以取得最大化的預期利益。以實驗室中的白老鼠,如何學習操作槓桿,來獲得食物為例:白老鼠是做決定的代理裝置(agent),其初始狀態(state),因對環境(environment)尚未有任何探索的機會,其行為一開始是隨機且不具有目標取向,直到他們無意間碰觸了特意設置的環境中的槓桿,並從拉扯槓桿的動作(action)中,意外獲得了食物也就是對於此行為的獎勵(reward),在大腦的獎勵機制驅使下,白老鼠們開始採取有目標取向的學習方式,為了能獲得更多的食物獎勵,他們可能會聚集到槓桿旁,不斷地嘗試,直到學會正確拉槓桿的動作為止。Reinforcement learning is a way close to human learning, emphasizing how to act based on the environment to maximize the expected benefits. Take a white mouse in the laboratory how to learn how to operate a lever to obtain food as an example: the white mouse is an agent that makes decisions, its initial state, because there is no opportunity to explore the environment , Their behavior is random and non-target-oriented at first, until they inadvertently touch the lever in the specially set environment, and from the action of pulling the lever, they accidentally obtain food, which is the reward for this behavior. (Reward), driven by the brain’s reward mechanism, the white mice began to adopt a goal-oriented learning method. In order to get more food rewards, they may gather next to the lever and keep trying until they learn to pull the lever correctly. So far.
由於強化學習必需基於環境而行動,因此,所有的架構都設置在代理裝置(agent)上,不但學習效率受限於代理裝置(agent)本身的資源與效能,且重要的是,對於每一個代理裝置(agent)而言,都要配置相關於強化學習的所有硬體架構與軟體資源,且一但學習目的不同,原有的強化學習架構就無法被重複利用或更新,而在適用範圍與實用性方面,仍然有大幅提升的空間。Since reinforcement learning must act based on the environment, all architectures are set on the agent device. Not only is the learning efficiency limited by the resources and performance of the agent itself, but also importantly, for each agent As far as the device (agent) is concerned, all hardware architectures and software resources related to reinforcement learning must be configured, and once the learning objectives are different, the original reinforcement learning architecture cannot be reused or updated, but in the scope of application and practicality In terms of sex, there is still room for substantial improvement.
因此,本發明的目的,即在提供一種能夠提升學習效率,及提升適用範圍與實用性的強化學習系統及其伺服裝置與強化學習方法。Therefore, the purpose of the present invention is to provide a reinforcement learning system and its servo device and reinforcement learning method that can improve learning efficiency, and improve the scope and practicability.
於是,本發明強化學習系統及其伺服裝置與強化學習方法。Thus, the reinforcement learning system and its servo device and reinforcement learning method of the present invention.
本發明之功效在於:本發明通過位於雲端且經過配置的硬體資源來進行訓練,不但能夠提升學習效率,且模組化的設計,可以使強化學習架構更容易的被重複利用與更新,而適用不同的代理裝置,提升適用範圍與實用性。The effect of the present invention is that the present invention uses hardware resources located in the cloud and configured for training, which can not only improve learning efficiency, but also has a modular design that makes it easier to reuse and update the reinforcement learning architecture. Applicable to different agent devices to improve the scope of application and practicality.
參閱圖1與圖2,本發明強化學習系統的一實施例,包含數代理裝置1,及一伺服裝置2。1 and FIG. 2, an embodiment of the reinforcement learning system of the present invention includes a
該每一代理裝置1根據一設定條件10,傳送相關於環境(environment)之狀態(state)的數狀態集11、接收用於執行動作(action)的數動作集12、傳送與環境互動後所產生的一回饋訊息13,及用於中止動作的一中止訊號S。該設定條件10用於達成一目標(goal)。該回饋訊息13包括對於前述動作的獎勵值R(reward)。According to a set condition 10, each
應當注意的是,該等代理裝置1的數目不限於多個,在本實施例的其它變化例中,也可以是一個。It should be noted that the number of the
該伺服裝置2包括一連線單元21、一存儲單元22,及一處理單元23。The servo device 2 includes a
在本實施例中,該連線單元21通過通訊技術與該等代理裝置1相互通訊,而用於接收該每一代理裝置1的設定條件10、該等狀態集11與該等回饋訊息13,及傳送該等動作集12給該每一代理裝置1。In this embodiment, the
在本實施例中,該存儲單元22是一種記憶體或存儲媒體,包括用於存儲數訓練模型(Trained Model)的一訓練模型池221、用於存儲數估測模型(Untrained Model)的一估測模型池222、用於存儲一動作雜訊模型(Action noise module)的一動作雜訊模型池223,及用於存儲數緩衝資料模型的一緩衝資料模型池224。在本實施例中,該等訓練模型包括但不限於是DQN module、DDPG module、A3C module、PPO module、Q-learning module。每一估測模型為已完成訓練且能夠達成該目標的其中一訓練模型。該動作雜訊模型用於增加該至少一代理裝置1對環境的探索,包括但不限於ε greedy module、 Uhlenbeck module。該緩衝資料模型用於決定資料取用及暫存的方式,包括但不限於回放緩衝模型(replay memory module)、簡易緩衝模型( simple buffer module)。In this embodiment, the
該處理單元23包括一計算模組231,及連接於該連線單元21、該存儲單元22與該計算模組231的一處理模組232。The
在本實施例中,該計算模組231可以由一個以上的中央處理器(Central Processing Unit)所組成,或是由一個以上的繪圖處理器(Graphics Processing Unit)所組成,或是由一個以上的中央處理器與繪圖處理器所組成。該計算模組231並具有一記憶體空間233,及數運算核心234。In this embodiment, the
在本實施例中,該處理模組232由一個以上的中央處理器(Central Processing Unit)所組成,會根據該每一代理裝置1的設定條件10,配置預定比例的記憶體空間233與預定數目的運算核心234為一個以上的工作站20(worker)。In this embodiment, the
參閱圖2、圖3,以下為方便說明,區分該等代理裝置1的編號為1a、2a、3a-1、3a-2、3a-3,分別對應編號1a、2a、3a-1、3a-2、3a-3的環境,及編號1a、2a、3a-1、3a-2、3a-3的工作站20,其中,以編號1a的環境為一單擺車(carpole)3為例,如何通過向左或向右移動該單擺車3,使該單擺車3上的一單擺31可以維持站立而不倒下,其設定條件10如表1。Refer to Figure 2 and Figure 3, for the convenience of description below, the numbers 1a, 2a, 3a-1, 3a-2, 3a-3 to distinguish the
表1:
參閱圖4、圖1與圖2,該實施例的強化學習方法,是通過該伺服裝置2實現以下步驟:Referring to Figure 4, Figure 1 and Figure 2, the reinforcement learning method of this embodiment implements the following steps through the servo device 2:
步驟401:開始。Step 401: Start.
步驟402:通過該連線單元21與編號1a的該代理裝置1建立連線,使該伺服裝置2與編號1a的代理裝置1相互通訊。Step 402: Establish a connection with the
步驟403:該處理模組232根據編號1a的該代理裝置1的設定條件10,配置預定比例的記憶體空間233與預定數目的運算核心234為一個以上的工作站20。Step 403: The
應當注意的是,該設定條件10不限於如表1使用10%的記憶體空間233,及1個運算核心234、1個工作站20,在本實施例的其它變化例中,也可以使用20%~100%的記憶體空間233,及P個運算核心234、K個工作站20,則每一個工作站20使用20~100%/k的記憶體空間233,且P≦K。It should be noted that the setting condition 10 is not limited to using 10% of the
步驟404:該處理模組232根據所連接的該代理裝置1產生唯一的一辨識碼U,且將該辨識碼U及該設定條件10中所選用的訓練模型、所選用的緩衝資料模型與該動作雜訊模型暫存於對應的工作站20。Step 404: The
值得說明的是,該辨識碼U包括一主數字串(Universally Unique Identifier,UUID)U1,及一次數字串(room_id)U2。不同的主數字串U1代表在不同環境中的不同代理裝置1。不同的次數字串U2代表在相同環境中執行不同動作的不同代理裝置1,若同一環境中只有執行1種動作的代理裝置1,則次數字串U2省略。It is worth noting that the identification code U includes a universally unique identifier (UUID) U1 and a primary number (room_id) U2. Different main number strings U1 represent
如圖2所示,編號1a的環境是該單擺車3,對應編號1a之環境且編號1a的代理裝置1的辨識碼U為/ikSYiGCPfUVSNbs5NmhQHe/,編號2a的環境是一種打磚塊遊戲,在編號2a的環境且編號2a的代理裝置1的辨識碼U為/F6bdVeRR86gsm6rJtLz3af/,編號3-1a的環境、編號3-2a的環境、編號3-3a的環境為相同的避障空間,對應編號3-1a之環境且編號3-1a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/01/,對應編號3-2a之環境且編號3-2a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/02/,對應編號3-3a之環境且編號3-3a的代理裝置1的辨識碼U為/CjTYQXbCbdKQ9ssi79Z7C7/03/。As shown in Figure 2, the environment numbered 1a is the
步驟405:該處理模組232通過該連線單元21將該辨識碼U傳送給編號1a的代理裝置1,藉此,使該狀態集11、該動作集12與該回饋訊息13在存儲有相同之辨識碼U的該代理裝置1與該對應的工作站20間傳遞,即在編號1a的代理裝置1與編號1a的工作站20間傳遞。Step 405: The
步驟406:該處理模組232根據該設定條件10判斷是否為獎勵模式?如果是,進行步驟500,如果否,進行步驟407。Step 406: The
步驟407:該處理模組232根據該設定條件10判斷是否為估測模式?如果是,進行步驟600,如果否,進行步驟408。Step 407: The
步驟408:結束。Step 408: End.
參閱圖5、圖2、圖6與圖7,其中,步驟500包括:Referring to FIG. 5, FIG. 2, FIG. 6 and FIG. 7, where
步驟501:開始。Step 501: Start.
步驟502:通過該連線單元21接收來自於編號1a之代理裝置1的初始的狀態集11。該狀態集11是由該代理裝置1觀察環境所得到。Step 502: Receive the initial state set 11 from the
步驟503:編號1a的工作站20,通過該記憶體空間233暫存該狀態集11,且通過該運算核心234以初始的狀態集11為參數,導入選用的訓練模型後產生初始的動作集12。Step 503: The
步驟504:編號1a的工作站20通過該連線單元21傳送該動作集12給編號1a的代理裝置1,使編號1a的代理裝置1根據該動作集12與編號1a的環境互動,進而改變編號1a之環境狀態,如圖3所示,編號1a的代理裝置1通過使該單擺車3往左推一步或往右推一步的動作,改變該單擺車3的位置、速度、該單擺31的角度、速度等狀態。Step 504: The
值得說明的是,該單擺車3的位置、速度、該單擺31的角度、速度等狀態,是通過該代理裝置1的感測器(圖未示)偵測與觀察該單擺車3、該單擺31的位置後,計算出該單擺車3、該單擺31的速度。前述獲得狀態的技術,已揭示於強化學習的先前技術中,且非本案申請的技術特徵,由於本領域中具有通常知識者根據以上說明可以推知擴充細節,因此不多加說明。It is worth noting that the position and speed of the
步驟505:編號1a的工作站20判斷是否通過該連線單元21接收該中止訊號S?如果是,進行步驟506,表示該單擺31的角度已超過±12°,可能已倒下,或表示該單擺車3的位置超過±2.4m(公尺),已脫離可偵測範圍,或表示表示這一回合的步數已超過200步,如果否,則持續進行判斷。Step 505: The
步驟506:編號1a的工作站20通過該連線單元21接收由編號1a的代理裝置1所傳送的該回饋訊息13,及下一個狀態集11,而獲得對於前述動作的獎勵值R(reward)。前述下一個狀態集11是由編號1a之代理裝置1在步驟504中以初始的動作集12與編號1a的環境互動後所產生,例如該單擺車3的位置或速度改變,或單擺31的角度或速度改變。Step 506: The
步驟507:編號1a的工作站20通過該記憶體空間233暫存當前的動作集12、對應當前動作集12所產生的回饋訊息13與下一個狀態集11,且通過該運算核心234以當前的動作集12、對應當前動作集12所產生的回饋訊息13與下一個狀態集11為參數,導入該訓練模型進行強化學習,並產生下一個動作集12。Step 507: The
步驟508:編號1a的工作站20,通過該運算核心234判斷該單擺車3是否連續10回合的步數都超過200步,而達成該目標?如果是,進行步驟509,如果否,回到步驟504。Step 508:
步驟509:結束。Step 509: End.
參閱圖2、圖8與圖9,其中,步驟600包括:Refer to Figure 2, Figure 8 and Figure 9, where
步驟601:開始。Step 601: Start.
步驟602:通過該連線單元21接收來自於編號1a之代理裝置1的狀態集11。Step 602: Receive the state set 11 from the
步驟603:編號1a的工作站20,通過該記憶體空間233暫存該狀態集11,且通過該運算核心234以該狀態集11為參數,導入選用的估測模型後產生動作集12。Step 603: The
步驟604:編號1a的工作站20通過該連線單元21傳送該動作集12給編號1a的代理裝置1,使編號1a的代理裝置1根據該動作集12與編號1a的環境互動,進而改變編號1a之環境狀態。Step 604: The
步驟605:編號1a的工作站20,通過該運算核心234判斷該單擺車3是否連續10回合的步數都超過200步,而達成該目標?如果是,進行步驟606,如果否,回到步驟602。Step 605:
步驟606:結束。Step 606: End.
參閱圖10與圖11,另外,值得說明的是,在步驟404中,該訓練模型不限於是由該設定條件10所選定,在本實施例的其它變化例中,也可以由該處理模組232以所連線之代理裝置1的數目,及運算核心234的數目為參數,通過一取樣演算模型所決定,而自動尋找出最佳的訓練模型。在本實施例中,該取樣演算模型為湯姆森取樣(Thompson Sampling)演算法,以P個運算核心234、K個工作站20為例,並說明如下:10 and FIG. 11, in addition, it is worth noting that in
假設K=4,P=1,自動尋找最佳的訓練模型,及適合的工作站20時,該處理模組232每次僅選用一個工作站20與一個運算核心234進行取樣,而獲得一期望值,然後,選用期望值最高的工作站20及其訓練模型。Assuming K=4, P=1, when automatically searching for the best training model and
假設K=4,P=2,自動尋找最佳的訓練模型,及適合的工作站20時,該處理模組232每次選用任二個工作站20與二個運算核心234進行取樣,而選擇該期望值最高的其中一個工作站20,再與其它工作站20進行比較,然後,選用期望值最高的工作站20及其訓練模型。Assuming K=4, P=2, when automatically searching for the best training model and
藉此,由於該取樣演算模型最後選擇的是,期望值較高的工作站20,也因此,能夠選擇出較適合環境的訓練模型,進而提昇學習效率。In this way, since the final selection of the sampling calculation model is the
值得說明的是,前述湯姆森取樣(Thompson Sampling)演算法的算式,是一種目前廣為應用的演算法,由於本領域中具有通常知識者根據以上說明可以推知擴充細節,因此不多加說明。It is worth noting that the aforementioned Thompson Sampling algorithm is a widely used algorithm at present. Since those with ordinary knowledge in the field can infer the details of the expansion based on the above description, no further explanation is given.
經由以上的說明,可將前述實施例的優點歸納如下:Based on the above description, the advantages of the foregoing embodiments can be summarized as follows:
1、本發明將強化學習系統分為只需要能夠觀察環境(environment)及與環境互動的代理裝置1(agent),及位於雲端且經過配置的硬體資源來進行訓練的伺服裝置2,不但能夠簡化每一代理裝置1的硬體架構與軟體資源,且能夠通過該伺服裝置2大幅提升學習效率。1. The present invention divides the reinforcement learning system into an agent device 1 (agent) that only needs to be able to observe the environment and interact with the environment, and a server device 2 that is located in the cloud and configured hardware resources for training. The hardware architecture and software resources of each
2、本發明能夠通過前述特殊的設計,使位於雲端的工作站20、訓練模型、動作雜訊模型模組化,藉此,前述模組化的工作站20、訓練模型、動作雜訊模型可以重複的被利用或更新,使本發明的伺服裝置2以適用於不同的代理裝置1,有效提升使用效益。2. The present invention can modularize the
3、本發明能夠以前述模組化設計,使訓練完成的訓練模型直接更新為估測模型,進行最後的驗證,進一步提升學習效率。3. The present invention can use the aforementioned modular design to directly update the trained training model to an estimation model, perform final verification, and further improve learning efficiency.
4、另外,本發明可以由該處理模組232以所連線之代理裝置1的數目,及運算核心234的數目為參數,通過該取樣演算模型所決定,而自動尋找出最佳的訓練模型,及適合的工作站20,進一步提升學習效率。4. In addition, in the present invention, the
惟以上所述者,僅為本發明的實施例而已,當不能以此限定本發明實施的範圍,凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾,皆仍屬本發明專利涵蓋的範圍內。However, the above are only examples of the present invention. When the scope of implementation of the present invention cannot be limited by this, all simple equivalent changes and modifications made in accordance with the scope of the patent application of the present invention and the content of the patent specification still belong to Within the scope of the patent for the present invention.
1:代理裝置10:設定條件11:狀態集12:動作集13:回饋訊息2:伺服裝置20:工作站21:連線單元22:存儲單元221:訓練模型池222:估測模型池223:動作雜訊模型池224:緩衝資料模型池23:處理單元231:計算模組232:處理模組233:記憶體空間234:運算核心401~408:步驟501~509:步驟601~606:步驟R:獎勵值S:中止訊號1: Agent device 10: Setting conditions 11: State set 12: Action set 13: Feedback message 2: Servo device 20: Workstation 21: Connection unit 22: Storage unit 221: Training model pool 222: Estimated model pool 223: Action Noise model pool 224: buffer data model pool 23: processing unit 231: calculation module 232: processing module 233: memory space 234: computing core 401~408: steps 501~509:
本發明的其他的特徵及功效,將於參照圖式的實施方式中清楚地呈現,其中: 圖1是一張方塊圖,說明本發明強化學習系統的一實施例; 圖2是一張方塊圖,說明該實施例中數狀態集、數動作集與數回饋訊息能夠在存儲有相同之一辨識碼的該代理裝置與該對應的工作站間傳遞; 圖3是一單擺車的一張示意圖; 圖4是一張流程圖,說明本發明強化學習方法結合該實施例的步驟流程; 圖5是一張流程圖,說明該實施例中一獎勵模式的步驟流程; 圖6是一張方塊圖,說明該實施例中該獎勵模式與時間的關係; 圖7是一張示意圖,說明該實施例中模組化的緩衝資料模型、訓練模型、估測模型與動作雜訊模型; 圖8是一張流程圖,說明該實施例中一估測模式的步驟流程; 圖9是一張方塊圖,說明該實施例中該估測模式與時間的關係; 圖10是一張示意圖,說明該實施例中以一個運算核心為例,自動尋找最佳的訓練模型;及 圖11是一張示意圖,說明該實施例中以二個運算核心為例,自動尋找最佳的訓練模型。Other features and effects of the present invention will be clearly presented in the implementation with reference to the drawings, in which: Figure 1 is a block diagram illustrating an embodiment of the reinforcement learning system of the present invention; Figure 2 is a block diagram , It is explained that the number state set, the number action set and the number feedback message in this embodiment can be transmitted between the agent device and the corresponding workstation storing the same identification code; Figure 3 is a schematic diagram of a pendulum car; Figure 4 is a flow chart illustrating the step flow of the reinforcement learning method of the present invention combined with this embodiment; Figure 5 is a flow chart illustrating the step flow of a reward mode in this embodiment; Figure 6 is a block diagram, Explain the relationship between the reward mode and time in this embodiment; Figure 7 is a schematic diagram illustrating the modularized buffer data model, training model, estimation model, and action noise model in this embodiment; Figure 8 is a picture A flowchart illustrating the flow of steps in an estimation mode in this embodiment; Fig. 9 is a block diagram illustrating the relationship between the estimation mode and time in this embodiment; Fig. 10 is a schematic diagram illustrating the procedure in this embodiment Take one computing core as an example to automatically find the best training model; and FIG. 11 is a schematic diagram illustrating that two computing cores are used as an example in this embodiment to automatically find the best training model.
1:代理裝置 1: proxy device
10:設定條件 10: Setting conditions
11:狀態集 11: Status set
12:動作集 12: Action set
13:回饋訊息 13: Feedback
2:伺服裝置 2: Servo device
223:動作雜訊模型池 223: Motion Noise Model Pool
224:緩衝資料模型池 224: buffer data model pool
23:處理單元 23: processing unit
231:計算模組 231: Computing Module
232:處理模組 232: Processing module
233:記憶體空間 233: memory space
20:工作站 20: Workstation
21:連線單元 21: Connection unit
22:存儲單元 22: storage unit
221:訓練模型池 221: training model pool
234:運算核心 234: Computing Core
R:獎勵值 R: reward value
S:中止訊號 S: Stop signal
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW107146985A TWI705387B (en) | 2018-12-25 | 2018-12-25 | Reinforcement learning system and its servo device and reinforcement learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW107146985A TWI705387B (en) | 2018-12-25 | 2018-12-25 | Reinforcement learning system and its servo device and reinforcement learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202025006A TW202025006A (en) | 2020-07-01 |
TWI705387B true TWI705387B (en) | 2020-09-21 |
Family
ID=73004983
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW107146985A TWI705387B (en) | 2018-12-25 | 2018-12-25 | Reinforcement learning system and its servo device and reinforcement learning method |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI705387B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW445736B (en) * | 1998-12-07 | 2001-07-11 | Lucent Technologies Inc | Dynamic configuration of digital subscriber line channels |
TWI236924B (en) * | 1999-11-03 | 2005-08-01 | Arcade Planet Inc | Prize redemption system for games executed over a wide area network |
TW201643766A (en) * | 2015-06-02 | 2016-12-16 | 緯創資通股份有限公司 | Method, system and server for self-healing of electronic apparatus |
-
2018
- 2018-12-25 TW TW107146985A patent/TWI705387B/en active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW445736B (en) * | 1998-12-07 | 2001-07-11 | Lucent Technologies Inc | Dynamic configuration of digital subscriber line channels |
TWI236924B (en) * | 1999-11-03 | 2005-08-01 | Arcade Planet Inc | Prize redemption system for games executed over a wide area network |
TW201643766A (en) * | 2015-06-02 | 2016-12-16 | 緯創資通股份有限公司 | Method, system and server for self-healing of electronic apparatus |
Also Published As
Publication number | Publication date |
---|---|
TW202025006A (en) | 2020-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021121029A1 (en) | Training model updating method and system, and agent, server and computer-readable storage medium | |
Nguyen et al. | Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning | |
WO2020094060A1 (en) | Recommendation method and apparatus | |
US20210158904A1 (en) | Compound property prediction method and apparatus, computer device, and readable storage medium | |
JP2019506664A5 (en) | ||
CN109194583A (en) | Network congestion Diagnosis of Links method and system based on depth enhancing study | |
JP2017004555A5 (en) | ||
Li et al. | Online federated multitask learning | |
TWI700599B (en) | Method and device for embedding relationship network diagram, computer readable storage medium and computing equipment | |
JP2015531939A5 (en) | ||
CN108665541B (en) | A kind of ground drawing generating method and device and robot based on laser sensor | |
WO2020199690A1 (en) | Cloud platform-based sharing learning system and method, sharing platform and method, and medium | |
CN113408209A (en) | Cross-sample federal classification modeling method and device, storage medium and electronic equipment | |
CN110415521A (en) | Prediction technique, device and the computer readable storage medium of traffic data | |
CN110858973A (en) | Method and device for predicting network traffic of cell | |
CN113392919B (en) | Deep belief network DBN detection method of attention mechanism | |
WO2019154215A1 (en) | Robot running path generation method, computing device and storage medium | |
CN110235149A (en) | Neural plot control | |
TW201933050A (en) | Method and device for determining pupil position | |
TWI705387B (en) | Reinforcement learning system and its servo device and reinforcement learning method | |
Ahmed et al. | Deep reinforcement learning for multi-agent interaction | |
CN111106960A (en) | Mapping method and mapping device of virtual network and readable storage medium | |
CN116862025A (en) | Model training method, system, client and server node, electronic device and storage medium | |
CN113034297A (en) | Complex network key node identification method and system based on node attraction | |
CN116415655A (en) | Decentralizing asynchronous federation learning method based on directed acyclic graph |