RU2450336C1

RU2450336C1 - Modified intelligent controller with adaptive critic

Info

Publication number: RU2450336C1
Application number: RU2011100326/08A
Authority: RU
Inventors: Владимир Игнатьевич Ключко (RU); Владимир Игнатьевич Ключко; Евгений Александрович Шумков (RU); Евгений Александрович Шумков
Priority date: 2011-01-11
Filing date: 2011-01-11
Publication date: 2012-05-10

Abstract

FIELD: information technology.

SUBSTANCE: modified intelligent controller with an adaptive critic comprises: a control object, a critic unit, a resolving neural network, an action unit, a time difference calculating unit, a corroboration calculating unit and a unit for selecting actions and connections between them.

EFFECT: improved adaptation properties of the device owing to critic resetting during operation of the device.

1 dwg

Description

Изобретение относится к классу интеллектуальных контроллеров, использующих принцип обучения с подкреплением, и может использоваться для создания систем управления объектами, работающими в недетерминированной среде.The invention relates to the class of intelligent controllers that use the principle of reinforcement learning, and can be used to create systems for managing objects operating in a non-deterministic environment.

Известен патент США МПК G06F 15/18, 6532454 «Stable adaptive control using critic designs», который реализует обучение с подкреплением с использованием нейронных сетей. Устройство по данному патенту состоит из решающей и моделирующей нейронных сетей, блока критики, а также блока вычисления ошибки прогнозирования и связей между блоками.Known US patent IPC G06F 15/18, 6532454 "Stable adaptive control using critic designs", which implements reinforced learning using neural networks. The device according to this patent consists of a decisive and modeling neural network, a criticism block, as well as a prediction error calculation block and connections between blocks.

Принцип работы устройства по патенту МПК G06F 15/18, 6532454 следующий - решающая нейронная сеть получает значение подкрепления, вычисляет действие на данной итерации и передает его на моделирующую нейронную сеть, которая вычисляет прогнозное значение рабочего параметра системы, после выполнения действия система получает реальное значение рабочего параметра, критик вычисляет новое значение подкрепления и корректируется работа моделирующей нейронной сети.The principle of operation of the device according to the IPC patent G06F 15/18, 6532454 is as follows - the decisive neural network receives the reinforcement value, calculates the action at this iteration and passes it to the simulating neural network, which calculates the predicted value of the operating parameter of the system, after the action is completed, the system receives the real value of the working parameter, the critic calculates the new reinforcement value and corrects the operation of the modeling neural network.

Известен также интеллектуальный контроллер на основе сетей адаптивной критики - патент США МПК G06F 15/18, 5448681. Данное устройство состоит из объекта управления, блок критика и решающей нейронной сети. Выходы объекта управления связаны с входами блока критика, а также входами решающей нейронной сети, выход решающей нейронной сети связан с объектом управления и блоком критика, выход сети критика связан с входом решающей нейронной сети.An intelligent controller based on adaptive criticism networks is also known - US patent IPC G06F 15/18, 5448681. This device consists of a control object, a critic unit and a decisive neural network. The outputs of the control object are connected with the inputs of the critic block, as well as the inputs of the critical neural network, the output of the critical neural network is connected with the control object and the block of the critic, the output of the critic network is connected with the input of the critical neural network.

Принцип работы устройства по патенту МПК G06F 15/18, 6532454 следующий - объект управления выдает сигнал о своем состоянии, блок критика вычисляет качество выбираемого действия для текущей временной итерации и состояния объекта, решающая нейронная сеть вычисляет управляющее воздействие.The principle of operation of the device according to the IPC patent G06F 15/18, 6532454 is as follows - the control object gives a signal about its state, the critic unit calculates the quality of the selected action for the current time iteration and the state of the object, the decisive neural network calculates the control action.

Недостатками устройств по патенту МПК G06F 15/18, 6532454 является то, что в них не запоминается история работы системы и критик работает по первоначально настроенным параметрам.The disadvantages of the devices according to the patent of IPC G06F 15/18, 6532454 is that they do not remember the history of the system and the critic works according to the initially configured parameters.

Недостатками устройств по патенту МПК G06F 15/18, 4563746 являются - нет блока запоминания истории работы системы и низкие адаптационные свойства в связи с жестко заданным принципом работы блока критика.The disadvantages of the devices according to the patent of IPC G06F 15/18, 4563746 are - there is no block for storing the history of the system and low adaptive properties due to the rigidly set principle of operation of the critic block.

Техническим результатом предлагаемого устройства является повышение адаптационных свойств за счет перенастройки критика во время работы устройства.The technical result of the proposed device is to increase adaptive properties due to reconfiguration of the critic during the operation of the device.

Задача - разработка модифицированного интеллектуального контроллера с адаптивным критиком с возможностью перенастройки критика во время работы устройства.The task is to develop a modified intelligent controller with an adaptive critic with the ability to reconfigure the critic during device operation.

Технический результат достигается тем, что в модифицированном интеллектуальном контролере с адаптивным критиком, содержащем объект управления, блок критика, решающую нейронную сеть, первый выход объекта управления связан с первым входом решающей нейронной сети, второй выход объекта управления связан со вторым входом решающей нейронной сети, выход решающей нейронной сети связан с первым входом блока критика, и в него введены блок действий, блок расчета временной разности, блок расчета подкрепления и блок выбора действия, при этом первый выход объекта управления связан также с первым входом блока действий, первым входом блока расчета временной разности и первым входом блока расчета подкрепления, второй выход объекта управления также связан со вторым входом блока действий, вторым входом блока расчета временной разности и вторым входом блока расчета подкрепления, выход блока действий связан со вторым входом блока критика, первый и второй выходы блока расчета временной разности связаны с первым и вторым входами блока критика, а третий выход связан с выходом блока критика, выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, выход блока критика связан с четвертым входом блока расчета временной разности, выход блока критика также связан с входом блока выбора действия, первый выход блока выбора действия связан третьим входом блока действий, а второй выход блока выбора действия связан с входом объекта управления.The technical result is achieved by the fact that in a modified intelligent controller with an adaptive critic containing a control object, a critic block, a decisive neural network, the first output of the control object is connected to the first input of the decisive neural network, the second output of the control object is connected to the second input of the decisive neural network, output a decisive neural network is connected to the first input of the critic block, and an action block, a time difference calculation block, a reinforcement calculation block and an action selection block are introduced into it, while the first the output of the control object is also connected with the first input of the action block, the first input of the time difference calculation block and the first input of the reinforcement calculation block, the second output of the control object is also connected with the second input of the action block, the second input of the time difference calculation block and the second input of the reinforcement calculation block, output the action block is connected with the second input of the critic block, the first and second outputs of the time difference calculation block are connected with the first and second inputs of the critic block, and the third output is connected with the output of the critic block, the output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the critic block output is connected to the fourth input of the time difference calculation block, the critic block output is also connected to the input of the action selection block, the first output of the action selection block is connected to the third input of the action block, and the second output block action selection associated with the input of the control object.

Повышение адаптационных свойств достигается за счет того, что в модифицированном интеллектуальном контроллере с адаптивным критиком добавлены блок расчета временной разности и блок расчета подкрепления, которые отвечают соответственно за расчет временной разности и подкрепления, при этом блок расчета временной разности также дообучает критика во время работы системы. Кроме того, в систему введен блок выбора действий из возможных, после обработки их блоком критика. Для запоминания предыдущих результатов работы в устройство добавлен блок действий, который сохраняет историю работы системы и выбирает возможные действия в конкретной ситуации.The improvement of adaptive properties is achieved due to the fact that in the modified intelligent controller with adaptive criticism, a time difference calculation unit and a reinforcement calculation unit are added, which are responsible for calculating the time difference and reinforcement, while the time difference calculation unit also retrains criticism during the operation of the system. In addition, a block has been introduced into the system for selecting actions from the possible ones, after processing by the critic block. To memorize previous work results, an action block has been added to the device, which saves the history of the system and selects possible actions in a specific situation.

Таким образом совокупность существующих признаков, изложенных в формуле изобретения, позволяет достичь желаемый технический результат.Thus, the totality of the existing features set forth in the claims, allows to achieve the desired technical result.

На фиг.1 изображена схема модифицированного интеллектуального контроллера с адаптивным критиком.Figure 1 shows a diagram of a modified intelligent controller with adaptive criticism.

Система состоит из нескольких структурных компонент: объекта управления 1, блока действий 2, решающей нейронной сети 3, блока критика 4, блока расчета временной разности 5, блока расчета подкрепления 6, блока выбора действия 7.The system consists of several structural components: control object 1, action block 2, crucial neural network 3, critic block 4, time difference calculation block 5, reinforcement calculation block 6, action selection block 7.

Также в системе присутствуют следующие связи: от объекта управления идет выход по состоянию объекта управления 8, который соединен с блоком действий по входу 8.1, решающей нейронной сетью по входу 8.2, блоком расчета временной разности по входу 8.3 и блоком расчета подкрепления по входу 8.4. Также от объекта управления идет сигнал по состоянию окружающей среды 9, который соединен с блоком действий по входу 9.1, решающей нейронной сетью по входу 9.2, блоком расчета временной разности по входу 9.3 и блоком расчета подкрепления по входу 9.4. От блока действия идет связь на блок критика 10. От решающей нейронной сети идет связь на блок критика 11. Выход блока критика соединен с входом блока выбора действий по сигналу 12 и входом блока расчета временной разности по сигналу 13. Выходы блока расчета временной разности связаны с входами критика по сигналам 14, 15 и выходом по сигналу 16. От блока расчета подкрепления идет сигнал на блок расчета временной разности по сигналу 17. Первый выход блока выбора действия соединен с входом блока действий 18, а второй с объектом управления по сигналу 19.The following connections are also present in the system: from the control object, the state of the control object 8 is output, which is connected to the action block at input 8.1, the neural network at input 8.2, the time difference calculation block at input 8.3 and the reinforcement calculation block at input 8.4. Also, a signal is sent from the control object for the state of the environment 9, which is connected to a block of actions at input 9.1, a decisive neural network at input 9.2, a block for calculating the time difference at input 9.3, and a block for calculating reinforcements at input 9.4. From the action block, there is a link to critic block 10. From the decisive neural network, there is a link to critic block 11. The output of the critic block is connected to the input of the action selection block by signal 12 and the input of the time difference calculation block by signal 13. The outputs of the time difference calculation block are connected with critic inputs by signals 14, 15 and output by signal 16. From the reinforcement calculation unit, a signal is sent to the time difference calculation unit by signal 17. The first output of the action selection unit is connected to the input of the action unit 18, and the second to the control object by signal 19.

Блок действий 2 предназначен для хранения таблицы возможных действий во всех возможных ситуациях и выбора возможных действий в данной конкретной ситуации.Action block 2 is designed to store a table of possible actions in all possible situations and select possible actions in this particular situation.

Решающая нейронная сеть 3 предназначена для прогнозирования следующего значения рабочего параметра системы (или нескольких параметров). Под рабочим параметром понимается тот параметр системы, оценивая который система может определить, как она работает, либо это параметр, который служит ориентиром для работы системы (рабочих параметров может быть несколько).The decisive neural network 3 is designed to predict the next value of the operating parameter of the system (or several parameters). Under the operating parameter is meant that parameter of the system, evaluating which the system can determine how it works, or it is a parameter that serves as a guide for the system (there may be several operating parameters).

Блок критика 4 предназначен для расчета качества ситуации V(t), последующей при выборе определенного действия.Criticism block 4 is designed to calculate the quality of the situation V (t) that follows when choosing a specific action.

Блок расчета временной разности 5 предназначен для расчета временной разности по формуле:The unit for calculating the time difference 5 is intended for calculating the time difference using the formula:

δ(t)=r(t)+γ·V(t)-V(t-1),δ (t) = r (t) + γV (t) -V (t-1),

где γ∈(0;1] - коэффициент забывания.where γ∈ (0; 1] is the forgetting coefficient.

Блок расчета подкрепления 6 предназначен для расчета подкрепления r(t). Формула расчета подкрепления задается разработчиком.The reinforcement calculation block 6 is intended for calculating the reinforcement r (t). The reinforcement calculation formula is set by the developer.

Блок выбора действия 7 предназначен для выбора конкретного действия из всех возможных в данной ситуации. При выборе используется так называемое «ε - жадное правило» (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998), которое можно записать как: с вероятностью (1-ε) выбирается то действие, которому соответствует максимальное значение качества ситуации

при этом 0<ε<<1.The action selection block 7 is intended to select a specific action from all possible in a given situation. When choosing, the so-called “ε - greedy rule” is used (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998), which can be written as: with probability (1-ε) that action is chosen, which corresponds to the maximum value of the quality of the situation

with 0 <ε << 1.

Принцип работы интеллектуального контроллера следующий. Объект управления 1 выполняет действие и образует на выходе сигналы состояния объекта управления 8 и окружающей среды 9. Данные сигналы идут на следующие блоки: блок действий 2 - сигналы 8.1 и 9.1, решающую нейронную сеть 3 - сигналы 8.2 и 9.2, блок расчета временной разности 5 - сигналы 8.3 и 9.3, блок расчета подкрепления 6 - сигналы 8.4 и 9.4. При этом блок действий 2 сохраняет значения состояния окружающей среды и объекта управления, а также сигнал управления на данной итерации 18, идущий от блока выбора действий 7.The principle of operation of the intelligent controller is as follows. The control object 1 performs an action and generates at the output the state signals of the control object 8 and the environment 9. These signals go to the following blocks: action block 2 — signals 8.1 and 9.1, a neural network 3 — signals 8.2 and 9.2, time difference calculation block 5 - signals 8.3 and 9.3, reinforcement calculation unit 6 - signals 8.4 and 9.4. At the same time, the action block 2 stores the values of the state of the environment and the control object, as well as the control signal at this iteration 18, coming from the action selection block 7.

Решающая нейронная сеть 3, получая значения состояния объекта управления 8.2 и окружающей среды 9.2, прогнозирует следующее значение рабочего параметра 11 и подает его на вход блока критика 4. На блок критика также последовательно подаются все возможные варианты действий, которые может совершить объект в текущей ситуации - данный сигнал идет по связи 10 от блока действий 2. Блок критика последовательно для каждой пары значений {возможное действие; прогнозное значение рабочего параметра} выдает значение качества V, которое вместе с возможным действием идет на блок расчета временной разности 16 и блок выбора действий 12. Блок выбора действий 7 запоминает все пришедшие к нему значения {возможное действие; качество действия} и, основываясь на ε - жадном правиле, выбирает текущее действие 19 и посылает его на объект управления 1. Выбранное действие 18 также посылается на блок действий 2.The decisive neural network 3, receiving the values of the state of the control object 8.2 and the environment 9.2, predicts the next value of the working parameter 11 and feeds it to the input of the critic unit 4. All possible possible actions that the object can perform in the current situation are also successively submitted to the critic unit - this signal goes through communication 10 from the action block 2. The critic block is sequentially for each pair of values {possible action; the predicted value of the operating parameter} gives the quality value V, which, together with the possible action, goes to the block for calculating the time difference 16 and the block for selecting actions 12. The block for selecting actions 7 remembers all the values that came to it {possible action; quality of action} and, based on the ε - greedy rule, selects the current action 19 and sends it to the control object 1. The selected action 18 is also sent to action block 2.

Одновременно с выбором действия блоком выбора действия и отработкой его объектом управления, в блоке расчета подкрепления 6 рассчитывается текущее значение подкрепления 17, которое передается в блок расчета временной разности 5. Блок расчета временной разности в свою очередь рассчитывает значение временной разности и, если необходимо, переобучает нейронную сеть блока критика.Simultaneously with the selection of the action by the action selection block and its completion by the control object, in the reinforcement calculation block 6, the current value of reinforcement 17 is calculated, which is transmitted to the time difference calculation block 5. The time difference calculation block, in turn, calculates the time difference value and, if necessary, retrains neural network block critic.

Решающая нейронная сеть - это стандартный многослойный персептрон с обучением по методу обратного распространения ошибки. Блок критика - также стандартный многослойный персептрон с обучением по методу обратного распространения ошибки. Параметры нейронных сетей выбираются в зависимости от решаемой задачи.The decisive neural network is a standard multilayer perceptron with error back propagation training. The critic block is also a standard multilayer perceptron with error propagation training. The parameters of neural networks are selected depending on the problem being solved.

Обучение блока критика во время работы по изменившейся и вновь поступившей временной разности происходит следующим образом: блок расчета временной разности по сигналам 14 и 15 подает сохраненные пары {сигнал управления; прогнозное значение рабочего параметра}, а по сигналу 16 подает желаемое значение выхода. Обучение методом обратного распространения ошибки происходит до тех пор, пока ошибка нейронной сети критика не станет меньше заданной, при этом реальное значение нейронной сети критика поступает в блок расчета временной разности по сигналу 15 (Rumelhart D.Е., Hinton G.Е., Williams R.J., "Learning representations by back-propagating errors," Nature, vol.323, pp.533-536, 1986).The training of the critic’s unit during work on the changed and newly arrived time difference occurs as follows: the time difference calculation unit by signals 14 and 15 delivers the stored pairs {control signal; the predicted value of the operating parameter}, and by signal 16 it gives the desired output value. Learning by back propagation of the error occurs until the critic’s neural network error is less than the specified value, while the real value of the critic’s neural network is sent to the time difference calculation unit by signal 15 (Rumelhart D.E., Hinton G.E., Williams RJ, "Learning representations by back-propagating errors," Nature, vol. 323, pp. 533-536, 1986).

Claims

A modified intelligent controller with an adaptive critic, containing a control object, a critic block, a decisive neural network, the first output of the control object is connected to the first input of the decisive neural network, the second output of the control object is connected to the second input of the decisive neural network, the output of the decisive neural network is connected to the first input critic block, characterized in that an action block, a time difference calculation block, a reinforcement calculation block and an action selection block are introduced into it, wherein the first output of the control object is also associated with the first input of the action block, the first input of the time difference calculation block and the first input of the reinforcement calculation block, the second output of the control object is also connected to the second input of the action block, the second input of the time difference calculation block and the second input of the reinforcement calculation block, the output of the action block is connected with the second input of the critic block, the first and second outputs of the time difference calculation block are connected with the first and second inputs of the critic block, and the third output is connected with the critic block output, the output of the calculation block is sub The heating is connected to the third input of the time difference calculation block, the critic block output is connected to the fourth input of the time difference calculation block, the critic block output is also connected to the action selection block input, the first output of the action selection block is connected to the third input of the action block, and the second output of the action selection block connected to the input of the control object.