RU2458390C1

RU2458390C1 - Modified intelligent controller

Info

Publication number: RU2458390C1
Application number: RU2011113129/08A
Authority: RU
Inventors: Евгений Александрович Шумков (RU); Евгений Александрович Шумков; Валерий Александрович Ботин (RU); Валерий Александрович Ботин
Priority date: 2011-04-05
Filing date: 2011-04-05
Publication date: 2012-08-10

Abstract

FIELD: information technology.

SUBSTANCE: modified intelligent controller includes a control object, a solver, where the output of the control object is connected to the first input of the solver, a support computing unit, an action unit, a Kalman filter, a Kalman filter memory and an action selection unit.

EFFECT: faster operation of the device.

1 dwg

Description

Изобретение относится к классу интеллектуальных контроллеров, использующих принцип обучения с подкреплением, и может использоваться для создания систем управления объектами, работающими в недетерминированной среде.The invention relates to the class of intelligent controllers that use the principle of reinforcement learning, and can be used to create systems for managing objects operating in a non-deterministic environment.

Известен патент США МПК G06F 15/18 6532454 «Stable adaptive control using critic designs», который реализует обучение с подкреплением с использованием нейронных сетей. Устройство по данному патенту состоит из решателя, моделирующей нейронной сети, блока критики, а также блока вычисления ошибки прогнозирования и связей между блоками.Known US patent IPC G06F 15/18 6532454 "Stable adaptive control using critic designs", which implements reinforced learning using neural networks. The device according to this patent consists of a solver simulating a neural network, a block of criticism, and also a block for calculating forecast errors and connections between blocks.

Принцип работы устройства по патенту США МПК G06F 15/18 6532454 следующий - решатель (в его качестве выступает нейронная сеть) получает значение подкрепления, вычисляет действие на данной итерации и передает его на моделирующую нейронную сеть, которая вычисляет прогнозное значение рабочего параметра системы. После выполнения действия система получает реальное значение рабочего параметра, критик вычисляет новое значение подкрепления и корректируется работа моделирующей нейронной сети.The principle of operation of the device according to U.S. IPC G06F 15/18 6532454 is as follows - the solver (the neural network acts in its capacity) receives the reinforcement value, calculates the action at this iteration and passes it to the modeling neural network, which calculates the predicted value of the system operating parameter. After the action is completed, the system receives the real value of the working parameter, the critic calculates the new reinforcement value and corrects the operation of the modeling neural network.

Известен также интеллектуальный контроллер на основе сетей адаптивной критики - патент США МПК G06F 15/18 5448681. Данное устройство состоит из объекта управления, блок критика и решателя на базе нейронной сети. Выходы объекта управления связаны с входами блока критика, а также входами решающей нейронной сети, выход решающей нейронной сети связан с объектом управления и блоком критика, выход сети критика связан с входом решающей нейронной сети.Also known is an intelligent controller based on adaptive criticism networks - US patent IPC G06F 15/18 5448681. This device consists of a control object, a critic unit and a solver based on a neural network. The outputs of the control object are connected with the inputs of the critic block, as well as the inputs of the critical neural network, the output of the critical neural network is connected with the control object and the block of the critic, the output of the critic network is connected with the input of the critical neural network.

Принцип работы устройства по патенту МПК G06F 15/18 5448681 следующий - объект управления выдает сигнал о своем состоянии, блок критика прогнозирует качество выбираемого действия на основе данных текущей временной итерации и состояния объекта, решатель вычисляет управляющее воздействие.The principle of operation of the device according to the IPC patent G06F 15/18 5448681 is as follows - the control object gives a signal about its state, the critic unit predicts the quality of the selected action based on the current time iteration and the state of the object, the solver calculates the control action.

Недостатками устройств по патенту МПК G06F 15/18 6532454 является то, что в них не запоминается история работы системы и критик работает по первоначально настроенным параметрам.The disadvantages of the devices according to the patent of IPC G06F 15/18 6532454 are that they do not remember the history of the system and the critic works according to the initially configured parameters.

Недостатками устройств по патенту МПК G06F 15/18 5448681 являются - нет блока запоминания истории работы системы и низкие адаптационные свойства в связи с жестко заданным принципом работы блока критика.The disadvantages of the devices according to the patent of IPC G06F 15/18 5448681 are - there is no block for storing the history of the system and low adaptive properties due to the rigidly set principle of operation of the critic block.

Общим недостатком двух вышеописанных устройств является недостаточное быстродействие за счет использования нейронных сетей.A common disadvantage of the two devices described above is the lack of performance due to the use of neural networks.

Техническим результатом предлагаемого устройства является повышение скоростных параметров работы.The technical result of the proposed device is to increase the speed parameters.

Задача - разработка модифицированного интеллектуального контроллера с высокими скоростными характеристиками быстродействия.The task is to develop a modified intelligent controller with high speed performance characteristics.

Технический результат достигается тем, что в модифицированный интеллектуальный контроллер, содержащий объект управления, решатель, выход объекта управления связан с первым входом решателя, введены блок расчета подкрепления, блок действий, фильтр Калмана, память фильтра Калмана, блок выбора действий, при этом выход объекта управления также связан с входом блока расчета подкрепления и первым входом блока действий, первый выход блока действий связан со вторым входом решателя, второй выход блока действий связан с третьим входом фильтра Калмана и первым входом блока выбора действий, выход блока расчета подкрепления связан со вторым входом блока действий и вторым входом фильтра Калмана, выход решателя связан с первым входом фильтра Калмана, первый выход фильтра Калмана связан со вторым входом блока выбора действий, второй выход фильтра Калмана связан с первым входом памяти фильтра Калмана, выход памяти фильтра Калмана связан с четвертым входом фильтра Калмана, первый выход блока выбора действий связан с входом объекта управления и третьим входом блока действий, второй выход блока выбора действий связан со вторым входом памяти фильтра Калмана.The technical result is achieved by the fact that in a modified intelligent controller containing a control object, a solver, the output of the control object is connected to the first input of the solver, a reinforcement calculation block, an action block, a Kalman filter, a Kalman filter memory, an action selection block are introduced, and the control object output also connected to the input of the reinforcement calculation block and the first input of the action block, the first output of the action block is connected to the second input of the solver, the second output of the action block is connected to the third input of the Kal filter mana and the first input of the action selection block, the output of the reinforcement calculation block is connected to the second input of the action block and the second input of the Kalman filter, the solver output is connected to the first input of the Kalman filter, the first output of the Kalman filter is connected to the second input of the action selection block, the second output of the Kalman filter is connected with the first input of the Kalman filter memory, the output of the Kalman filter memory is connected to the fourth input of the Kalman filter, the first output of the action selection block is connected to the input of the control object and the third input of the action block, the second output action selection unit coupled to the second input of the storage Kalman filter.

Повышение скоростных параметров работы устройства достигается за счет того, что в устройство вместо нейронных сетей введен фильтр Калмана для прогнозирования подкрепления. При этом используется память фильтра Калмана, которая сохраняет параметры фильтра Калмана для выбранного действия с последующим восстановлением параметров самого фильтра Калмана без дополнительных вычислений. Выделение блока действий в отдельную компоненту также приводит к повышению скоростных характеристик, так как в нем накапливается история работы объекта управления для последующего использования.Increasing the speed parameters of the device is achieved due to the fact that instead of neural networks, a Kalman filter is introduced into the device to predict reinforcements. In this case, the Kalman filter memory is used, which saves the Kalman filter parameters for the selected action with the subsequent restoration of the Kalman filter parameters without additional calculations. Separation of the action block into a separate component also leads to an increase in speed characteristics, since it accumulates the history of the control object for subsequent use.

Таким образом, совокупность существующих признаков, изложенных в формуле изобретения, позволяет достичь желаемый технический результат.Thus, the totality of the existing features set forth in the claims, allows to achieve the desired technical result.

На чертеже изображена схема модифицированного интеллектуального контроллера с фильтром Калмана.The drawing shows a diagram of a modified intelligent controller with a Kalman filter.

Система состоит из нескольких структурных компонент: объекта управления 1, решателя 2, блока расчета подкрепления 3, блока действий 4, фильтра Калмана 5, памяти фильтра Калмана 6 и блока выбора действий 7.The system consists of several structural components: control object 1, solver 2, reinforcement calculation block 3, action block 4, Kalman filter 5, Kalman filter memory 6 and action selection block 7.

Также в системе присутствуют следующие связи: выход 8 объекта управления связан с первым входом 8.1 решателя, первым входом 8.2 блока выбора действий, входом 8.3 блока расчета подкрепления; выход 9 блока расчета подкрепления соединен по 9.1 с блоком действий и по 9.2 с фильтром Калмана, выход 10 блока действий связан с решателем, выход решателя соединен с фильтром Калмана по связи 11, выход 12 блока действий связан 12.1 с блоком выбора действий и по связи 12.2 с фильтром Калмана, выход 13 фильтра Калмана соединен с памятью фильтра Калмана, память фильтра Калмана по 14 соединена с фильтром Калмана, фильтр Калмана по 15 соединен с блоком выбора действий, блок выбора действий соединен по 16 с памятью фильтра Калмана, выход 17 блока выбора действий соединен по 17.1 с блоком действий и по 17.2 с объектом управления.Also in the system there are the following connections: output 8 of the control object is connected with the first input 8.1 of the solver, the first input 8.2 of the action selection block, input 8.3 of the reinforcement calculation block; the output 9 of the reinforcement calculation unit is connected in accordance with 9.1 with the action block and in 9.2 with the Kalman filter, the output of the 10 action block is connected with the solver, the output of the solver is connected with the Kalman filter through communication 11, the output of the 12 action block is connected 12.1 with the action and communication block 12.2 with Kalman filter, Kalman filter output 13 is connected to Kalman filter memory, Kalman filter memory 14 is connected to Kalman filter, Kalman filter 15 is connected to action selection block, action selection block is connected 16 to Kalman filter memory, output 17 of action selection blockoedinen at 17.1 with action block and 17.2 with the control object.

Решатель 2 - это устройство, которое реализует математическую формулу (или несколько формул), описывающую те переменные объекта управления, которые можно непосредственно вычислить.Solver 2 is a device that implements a mathematical formula (or several formulas) that describes those variables of the control object that can be directly calculated.

Блок расчета подкрепления 3 реализует математическую формулу, рассчитывающую реальное значение подкрепления после того, как сигнал действия (управления) 17.2 отработан объектом управления 1.The reinforcement calculation block 3 implements a mathematical formula that calculates the real value of the reinforcement after the action (control) signal 17.2 is worked out by the control object 1.

Блок действий 4 хранит таблицу возможных действий в конкретных ситуациях.Action block 4 stores a table of possible actions in specific situations.

Фильтр Калмана 5 предназначен для вычисления ненаблюдаемой величины. Фильтр Калмана выполняется в стандартном исполнении, например, по патенту США МПК G06F 15/20 №5115391.Kalman filter 5 is designed to calculate an unobservable value. The Kalman filter is performed as standard, for example, according to US patent IPC G06F 15/20 No. 5115391.

Память фильтра Калмана 6 предназначена для временного хранения параметров блока фильтра Калмана 5. Блок хранит столько наборов параметров фильтра Калмана, сколько выбрано возможных действий в блоке действий 4.The memory of the Kalman filter 6 is intended for temporary storage of the parameters of the Kalman filter block 5. The block stores as many sets of Kalman filter parameters as the possible actions are selected in the action block 4.

Блок выбора действий 7 предназначен для выбора действия из возможных в данной ситуации на базе "жадного правила".The action selection block 7 is intended to select an action from possible in a given situation based on a "greedy rule".

Принцип работы интеллектуального контроллера следующий. Объект управления 1 выполняет действие и образует на выходе сигнал состояния 8. Далее сигнал состояния поступает на блок действий 4, блок расчета подкрепления 3 и решатель 2. Блок расчета подкрепления 3 рассчитывает реальное получившееся подкрепление 9 и подает его на блок действий 4 и фильтр Калмана 5. Решатель 2 в свою очередь обрабатывает сигнал с объекта управления и на выходе формирует наблюдаемый сигнал 11. Получив сигнал состояния, блок действий 4 выбирает возможные действия в данной ситуации с учетом реального подкрепления на последней итерации и выдает их последовательно на фильтр Калмана 5, при этом он подает синхронизирующий сигнал 10 на решатель 2, чтобы он одновременно с блоком действий выдавал значение наблюдаемой величины 11 на фильтр Калмана 5. Фильтр Калмана 5 последовательно получает пары значений {наблюдаемый сигнал 11; возможное действие 12.2} и рассчитывает возможное подкрепление 15 для каждого возможного действия. После расчета подкрепления фильтр Калмана выдает значение подкрепления 15 на блок выбора действий 7, на который также идет действие 12.1 от блока действий 4, которое учитывалось при расчете ненаблюдаемого сигнала (подкрепления) фильтром Калмана 5. Также в памяти фильтра Калмана 6 сохраняются текущие параметры фильтра Калмана 5. Блок выбора действий 7 выбирает действие на основе "жадного" правила, которое можно записать как: с вероятностью (1-ε) выбирается то действие, которому соответствует максимальное значение подкрепления

, при этом 0<ε<<1 (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998). После выбора действия сигнал 17 идет на объект управления 1 и записывается в блок действий 4, а также на память фильтра Калмана 6, которая восстанавливает параметры фильтра Калмана 5 для выбранного действия.The principle of operation of the intelligent controller is as follows. The control object 1 performs an action and forms a status signal 8 at the output. Next, the status signal is sent to action block 4, reinforcement calculation block 3 and solver 2. Reinforcement calculation block 3 calculates the real resulting reinforcement 9 and feeds it to action block 4 and Kalman filter 5 Solver 2, in turn, processes the signal from the control object and generates the observed signal 11 at the output. Having received the status signal, action block 4 selects possible actions in this situation, taking into account real reinforcement at the last teratsii and outputs them sequentially to the Kalman filter 5, thus it takes the synchronizing signal 10 on the resolver 2 so that it simultaneously acts to block gave the observed value the value in the Kalman filter 11 5. 5 Kalman filter sequentially receives a pair of values {11 observed signal; possible action 12.2} and calculates the possible reinforcement 15 for each possible action. After calculating the reinforcement, the Kalman filter gives the value of reinforcement 15 to the action selection block 7, which also takes action 12.1 from the action block 4, which was taken into account when calculating the unobserved signal (reinforcement) by the Kalman filter 5. Also, the current Kalman filter parameters are stored in the memory of Kalman filter 6 5. The action selection block 7 selects the action based on the “greedy” rule, which can be written as: with probability (1-ε), the action is selected that corresponds to the maximum reinforcement value

with 0 <ε << 1 (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998). After selecting the action, the signal 17 goes to the control object 1 and is recorded in the action block 4, as well as to the memory of the Kalman filter 6, which restores the parameters of the Kalman filter 5 for the selected action.

Таким образом, алгоритм работы устройства следующий (цифрами указаны только сигналы):Thus, the algorithm of the device is as follows (only signals are indicated by numbers):

1. Объект управления вычисляет сигнал своего состояния 8 (как на базе информации о внешней среде, так и по собственным показателям) и подает его 8.1 - на решатель, 8.2 - на блок действий и 8.3 - на блок расчета подкрепления.1. The control object calculates its state signal 8 (both on the basis of information on the external environment and on its own indicators) and submits it 8.1 to the solver, 8.2 to the action block and 8.3 to the reinforcement calculation block.

2. Решатель вычисляет наблюдаемый параметр 11 системы.2. The solver calculates the observed parameter 11 of the system.

3. Блок расчета подкрепления рассчитывает получившееся (реальное) подкрепление 9 и подает его значение на 9.1 - блок действий и 9.2 - фильтр Калмана.3. The reinforcement calculation unit calculates the resulting (real) reinforcement 9 and submits its value to 9.1 - the action block and 9.2 - Kalman filter.

4. Блок действий, учитывая последнее поступившее подкрепление 9.1, выбирает возможные действия в данной конкретной ситуации 12.4. The action block, taking into account the latest reinforcement 9.1, selects the possible actions in this particular situation 12.

5. Блок действий последовательно подает выбранные действия на 12.2 - фильтр Калмана и синхронизирующий сигнал 10 на решатель, по которому решатель синхронно подает наблюдаемый параметр 11 на фильтр Калмана.5. The action block sequentially supplies the selected actions to 12.2 — the Kalman filter and the synchronizing signal 10 to the solver, by which the solver synchronously supplies the observed parameter 11 to the Kalman filter.

6. При первом поданном сигнале на вычисление 12.2, перед тем как начать работу, фильтр Калмана сохраняет свои параметры по 13 в памяти фильтра Калмана.6. At the first applied signal for calculation 12.2, before starting work, the Kalman filter saves its parameters of 13 in the memory of the Kalman filter.

7. Фильтр Калмана последовательно получает пары значений {наблюдаемый сигнал 11; возможное действие 12.2} и вычисляет прогноз подкрепления (ненаблюдаемый сигнал) 15. Фильтр Калмана работает стандартным образом.7. The Kalman filter sequentially receives pairs of values {observed signal 11; possible action 12.2} and calculates the reinforcement forecast (unobserved signal) 15. The Kalman filter works in a standard way.

8. После вычисления прогноза подкрепления для каждого возможного действия 12.2 фильтр Калмана сохраняет свои параметры в памяти фильтра Калмана по 13 и выдает значение прогнозируемого подкрепления 15 на блок выбора действий.8. After calculating the reinforcement forecast for each possible action 12.2, the Kalman filter saves its parameters in the Kalman filter memory by 13 and provides the value of the predicted reinforcement 15 to the action selection block.

9. Блок выбора действий накапливает пары значений {возможное действие 12.1; прогнозируемое подкрепление 15}.9. The action selection block accumulates value pairs {possible action 12.1; predicted reinforcement 15}.

10. После того как рассчитаны подкрепления для всех возможных действий, от блока действий идет сигнал 12.1 на блок выбора действий об окончании прогнозирования. После получения этого сигнала блок выбора действий выбирает действие на основе "жадного правила" и подает его по 17.1 на объект управления, по 17.2 на блок действий, а также по 16 на память фильтра Калмана. На блок действий также подается по 17.2 прогнозируемое подкрепление для выбранного действия.10. After the reinforcements are calculated for all possible actions, a signal 12.1 is sent from the action block to the action selection block about the end of forecasting. After receiving this signal, the action selection block selects the action based on the “greedy rule” and feeds it to 17.1 to the control object, 17.2 to the action block, and 16 to the Kalman filter memory. According to 17.2, the predicted reinforcement for the selected action is also fed to the action block.

11. Блок действий сохраняет выбранный сигнал 17.1, возможное подкрепление 17.1, состояние объекта управления 8.2, реальное подкрепление 9.1, тем самым накапливая историю для дальнейшего выбора действий в возможных ситуациях.11. The action block saves the selected signal 17.1, possible reinforcement 17.1, the state of the control object 8.2, real reinforcement 9.1, thereby accumulating a history for further selection of actions in possible situations.

12. Память фильтра Калмана восстанавливает по 14 параметры фильтра Калмана для выбранного действия.12. The Kalman filter memory restores 14 Kalman filter parameters for the selected action.

13. Объект управления отрабатывает поданное действие 17.2. Далее цикл на шаг 1.13. The control object fulfills the submitted action 17.2. Next, cycle to step 1.

Claims

A modified intelligent controller containing a control object, a solver, the output of the control object is connected to the first input of the solver, characterized in that it includes a reinforcement calculation block, an action block, a Kalman filter, a Kalman filter memory, an action selection block, while the output of the control object is also connected to the input of the reinforcement calculation block and the first input of the action block, the first output of the action block is connected to the second input of the solver, the second output of the action block is connected to the third input of the Kalman filter and the first input the house of the action selection block, the output of the reinforcement calculation block is connected to the second input of the action block and the second input of the Kalman filter, the solver output is connected to the first input of the Kalman filter, the first output of the Kalman filter is connected to the second input of the action selection block, the second output of the Kalman filter is connected to the first input Kalman filter memory, the Kalman filter memory output is connected to the fourth input of the Kalman filter, the first output of the action selection block is connected to the input of the control object and the third input of the action block, the second output of the de Corollary connected to the second input of the storage Kalman filter.