RU2755339C1

RU2755339C1 - Modified intelligent controller with adaptive critical element

Info

Publication number: RU2755339C1
Application number: RU2020141730A
Authority: RU
Inventors: Евгений Александрович Шумков
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-09-15

Abstract

FIELD: computer technology.

SUBSTANCE: invention relates to the field of computer technology. The result is achieved due to the fact that the modified intelligent controller with an adaptive critical element contains a reinforcement calculation block, a time difference calculation block, a critical block, a neural network training block, a decisive neural network, an action selection block, an action block, a control object.

EFFECT: increased adaptive properties of the control system and the increased speed characteristics.

1 cl, 1 dwg

Description

Изобретение относится к интеллектуальным контроллерам, использующим принцип обучения с подкреплением, искусственные нейронные сети и может использоваться для управления сложными системами в недетерминированной среде.The invention relates to intelligent controllers using the principle of reinforcement learning, artificial neural networks and can be used to control complex systems in a non-deterministic environment.

Известен патент МПК G06F 15/18 2523218 «Модифицированный интеллектуальный контроллер с адаптивным критиком». Данное устройство состоит из блока расчета подкрепления, блока расчета временной разности, блока обучения критика, блока критика, решающей нейронной сети, блока выбора действия, блока отбора действия, блока действий, блока занесения действий, объекта управления. При этом первый и второй входы объекта управления связаны с первым и вторым входами решающей нейронной сети, первым и вторым входами блока расчета временной разности, первым и вторым входами блока расчета подкрепления, выход решающей нейронной сети соединен с первым входом блока критика, выход блока критика связан с входом блока выбора действия, выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, выход блока выбора действия соединен с входом объекта управления, первый и второй выходы объекта управления также соединены с первым и вторым входами блока отбора действий, первый выход блока отбора действий соединен с первым входом блока действий, второй выход блока отбора действий соединен со вторым входом блока критика, третий выход блока отбора действий соединен с третьим входом решающей нейронной сети, четвертый выход блока отбора действий соединен со вторым входом блока выбора действия, выход блока действий соединен с третьим входом блока отбора действий, выход блока расчета подкрепления соединен также со вторым входом блока занесения действий, выход блока расчета временной разности соединен с первым входом обучения критика, первый и второй выходы блока обучения критика соединены соответственно с первым и вторым входами блока критика, третий выход блока обучения критика соединен с четвертым входом блока расчета временной разности, выход блока критика также соединен со вторым входом блока обучения критика, выход блока выбора действия также соединен с первым входом блока занесения действий, выход блока занесения действий соединен со вторым входом блока действий.Known patent IPC G06F 15/18 2523218 "Modified intelligent controller with adaptive critic". This device consists of a reinforcement calculation unit, a time difference calculation unit, a critic training unit, a critic's unit, a decisive neural network, an action selection unit, an action selection unit, an action unit, an action recording unit, and a control object. In this case, the first and second inputs of the control object are connected to the first and second inputs of the decision neural network, the first and second inputs of the time difference calculation block, the first and second inputs of the reinforcement calculation block, the output of the decision neural network is connected to the first input of the critic block, the output of the critic block is connected with the input of the action selection block, the output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the output of the action selection block is connected to the input of the control object, the first and second outputs of the control object are also connected to the first and second inputs of the action selection block, the first output of the selection block actions is connected to the first input of the action block, the second output of the action selection block is connected to the second input of the critic block, the third output of the action selection block is connected to the third input of the decisive neural network, the fourth output of the action selection block is connected to the second input of the action selection block, the output of the action block is connected with the third input of the selection block dey the output of the block for calculating the reinforcement is also connected to the second input of the block for recording actions, the output of the block for calculating the time difference is connected to the first input of the critic's training, the first and second outputs of the critic's training unit are connected respectively to the first and second inputs of the critic's block, the third output of the critic's training block is connected with the fourth input of the block for calculating the time difference, the output of the critic block is also connected to the second input of the critic training block, the output of the action selection block is also connected to the first input of the block of recording actions, the output of the block of recording actions is connected to the second input of the block of actions.

Принцип работы устройства по патенту МПК G06F 15/18 2523218 следующий - объект управления выполняет действие и на выходах выдает сигналы состояния объекта управления и внешней среды, по которым блок отбора действий запрашивает и получает от блока действий возможные действия для данной ситуации. Решающая нейронная сеть, получив новые данные, прогнозирует значение рабочего параметра на следующую временную итерацию. Блок отбора действий, получив возможные действия, синхронно с решающей нейронной сетью начинает подавать возможные действия и прогноз рабочего параметра на блок критика, который в свою очередь последовательно вычисляет возможное будущее подкрепление для каждого возможного действия. Далее возможные будущие подкрепления подаются на блок выбора действия, который по заданному алгоритму выбирает действие объекта управления на следующую временную итерацию и подает выбранное действие на объект управления. Блок расчета подкрепления, получив значения текущего состояния внешней среды и объекта управления, вычисляет значение полученного подкрепления за последнюю отработанную итерацию управления. Полученное значение подкрепления передается в блок расчета временной разности, который рассчитывает временную разность и формирует обучающую выборку для нейронной сети блока критика. Если значение полученной временной разности и если ошибка существенная, то блок расчета временной разности останавливает работу критика и переобучает его на новых данных.The principle of operation of the device under the patent IPC G06F 15/18 2523218 is as follows - the control object performs an action and outputs state signals of the control object and the external environment at the outputs, according to which the action selection unit requests and receives from the action unit possible actions for this situation. The decisive neural network, having received new data, predicts the value of the working parameter for the next time iteration. The action selection unit, having received possible actions, synchronously with the decisive neural network, begins to submit possible actions and a forecast of the operating parameter to the critic's block, which in turn sequentially calculates the possible future reinforcement for each possible action. Further, possible future reinforcements are fed to the action selection unit, which, according to a given algorithm, selects the action of the control object for the next time iteration and submits the selected action to the control object. The block for calculating the reinforcement, having received the values of the current state of the external environment and the control object, calculates the value of the received reinforcement for the last worked iteration of the control. The received value of the reinforcement is transmitted to the block for calculating the time difference, which calculates the time difference and forms a training sample for the neural network of the critic's block. If the value of the obtained time difference and if the error is significant, then the block for calculating the time difference stops the critic's work and retrains him on the new data.

Недостатками данного устройства являются невозможность переобучения решающей нейронной сети, недостаточная скорость работы за счет взаимодействия блока выбора действия, блока отбора действия, блока действий, блока занесения действий.The disadvantages of this device are the impossibility of retraining the decisive neural network, insufficient speed of work due to the interaction of the action selection block, the action selection block, the action block, the action entry block.

Наиболее близким техническим решением является патент РФ МПК G06F 15/00 №2450336 «Модифицированный интеллектуальный контроллер с адаптивным критиком». Данное устройство состоит из блока расчета подкрепления, блока расчета временной разности, блока обучения критика, блока критика, решающей нейронной сети, блока выбора действия, блока отбора действий, блока действий, блока занесения действий, объекта управления. При этом первый и второй выходы объекта управления связаны с первым и вторым входами решающей нейронной сети, первым и вторым входами блока расчета временной разности, первым и вторым входом блока расчета подкрепления, выход решающей нейронной сети соединен с первым входом блока критика, выход блока критика связан с входом блока выбора действия, выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, выход блока выбора действия соединен с входом объекта управления, первый и второй выходы объекта управления также соединены с первым и вторым входами блока отбора действий, первый выход блока отбора действий соединен с первым входом блока действий, второй выход блока отбора действий соединен со вторым входом блока критика, третий выход блока отбора действий соединен с третьим входом решающей нейронной сети, четвертый выход блока отбора действий соединен со вторым входом блока выбора действия, выход блока действий соединен с третьим входом блока отбора действий, выход блока расчета подкрепления соединен также со вторым входом блока занесения действий, выход блока расчета временной разности соединен с первым входом блока обучения критика, первый и второй выходы блока обучения критика соединены соответственно с первым и вторым входами блока критика, третий выход блока обучения критика соединен с четвертым входом блока расчета временной разности, выход блока критика также соединен со вторым входом блока обучения критика, выход блока выбора действия также соединен с первым входом блока занесения действий, выход блока занесения действий соединен со вторым входом блока действий. Принцип работы устройства по патенту РФ МПК G06F 15/00 №2450336 «Модифицированный интеллектуальный контроллер с адаптивным критиком» следующий - объект управления выдает сигналы состояния и действия, по которым блок действий выбирает возможные действия в данной ситуации и подает их на блок критика параллельно с прогнозным значением рабочего параметра, который рассчитывает блок прогнозирования параметра. Критик, получая данные, последовательно оценивает последствия возможных действий и выдает их на блок выбора действия, который с помощью «жадного» - правила выбирает действие и подает его на исполнение в объект управления. Параллельно этому процессу, блок расчета подкрепления рассчитывает полученное подкрепление и подает его на блок расчета временной разности, который рассчитывает ошибку временной разности и если ошибка существенная, то блок расчета временной разности останавливает работу критика и переобучает его на новых данных.The closest technical solution is the RF patent IPC G06F 15/00 No. 2450336 "Modified intelligent controller with adaptive critic". This device consists of a reinforcement calculation unit, a time difference calculation unit, a critic's training unit, a critic's unit, a decisive neural network, an action selection unit, an action selection unit, an action unit, an action recording unit, and a control object. In this case, the first and second outputs of the control object are connected to the first and second inputs of the decision neural network, the first and second inputs of the time difference calculation block, the first and second inputs of the reinforcement calculation block, the output of the decision neural network is connected to the first input of the critic block, the output of the critic block is connected with the input of the action selection block, the output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the output of the action selection block is connected to the input of the control object, the first and second outputs of the control object are also connected to the first and second inputs of the action selection block, the first output of the selection block actions is connected to the first input of the action block, the second output of the action selection block is connected to the second input of the critic block, the third output of the action selection block is connected to the third input of the decisive neural network, the fourth output of the action selection block is connected to the second input of the action selection block, the output of the action block is connected with the third input of the selection block dey the output of the block for calculating the reinforcement is also connected to the second input of the block for recording actions, the output of the block for calculating the time difference is connected to the first input of the critic training block, the first and second outputs of the critic training block are connected respectively to the first and second inputs of the critic block, the third output of the critic training block is connected to the fourth input of the block for calculating the time difference, the output of the critic block is also connected to the second input of the critic training block, the output of the action selection block is also connected to the first input of the block of recording actions, the output of the block of recording actions is connected to the second input of the block of actions. The principle of operation of the device under the patent of the Russian Federation IPC G06F 15/00 No. 2450336 "Modified intelligent controller with adaptive critic" is as follows - the control object issues state and action signals, according to which the action block selects possible actions in this situation and sends them to the critic block in parallel with the predicted one. the value of the operating parameter, which is calculated by the parameter prediction block. The critic, receiving the data, sequentially evaluates the consequences of possible actions and issues them to the action selection block, which, using the “greedy” rule, selects the action and submits it for execution to the control object. In parallel to this process, the reinforcement calculation unit calculates the received reinforcement and feeds it to the time difference calculation unit, which calculates the time difference error, and if the error is significant, the time difference calculation unit stops the critic's work and retrains him with new data.

Недостатками данного контроллера являются - недостаточные адаптационные свойства, сложность обучения нейронной сети блока критика и решающей нейронной сети, ограниченные возможности работы блока выбора действия.The disadvantages of this controller are insufficient adaptive properties, the complexity of training the neural network of the critic's block and the decisive neural network, the limited capabilities of the action selection block.

Общий недостаток устройств на основе сетей адаптивных критиков состоит в том, что базовый подход не является обобщенным и достаточным для построения универсальной адаптивной системы управления объектом, действующим в недетерминированной среде. Система управления не может радикально менять свое поведение и вырабатывать новые реакции при абсолютно новых, неизвестных данных о состоянии окружающей среды и объекта управления (D. Prokhorov, D. Wanch. Adaptive critic designs. IEEE transactions on Neural Networks, September 1997. pp. 997-1007). Ввиду того, что система должна управляться в режиме реального времени, ее недостатками являются большое количество вычислений и сложность дообучения нейронных сетей.A common disadvantage of devices based on networks of adaptive critics is that the basic approach is not generalized and sufficient for building a universal adaptive control system for an object operating in a non-deterministic environment. The control system cannot radically change its behavior and develop new reactions with completely new, unknown data about the state of the environment and the control object (D. Prokhorov, D. Wanch. Adaptive critic designs. IEEE transactions on Neural Networks, September 1997. pp. 997 -1007). Due to the fact that the system must be controlled in real time, its disadvantages are a large number of calculations and the complexity of additional training of neural networks.

Задача - усовершенствование модифицированного интеллектуального контроллера с адаптивным критиком и расширение функциональных возможностей.The task is to improve the modified intelligent controller with adaptive criticism and expand the functionality.

Техническим результатом предлагаемого устройства является повышение адаптационных свойств системы управления на базе интеллектуального контроллера, повышение его скоростных характеристик и упрощение конечной реализации для разработчика.The technical result of the proposed device is to improve the adaptive properties of the control system based on the intelligent controller, increase its speed characteristics and simplify the final implementation for the developer.

Технический результат достигается тем, что в модифицированном интеллектуальном контроллере с адаптивным критиком содержащем блок расчета подкрепления, блок расчета временной разности, блок критика, решающую нейронную сеть, блок отбора действий, блок действий, блок выбора действий, объект управления, первый и второй выходы объекта управления связаны с первым и вторым входами решающей нейронной сети, первым и вторым входами блока расчета временной разности, первым и вторым входами блока отбора действий, а также с первым и вторым входами блока расчета подкрепления, первый выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, первый выход блока выбора действий связан с первым входом объекта управления, второй выход блока критика связан с первым входом блока выбора действий, первый выход решающей нейронной сети связан с первым входом блока критика, второй выход блока действий связан с пятым входом блока отбора действий, третий выход блока отбора действий связан с первым входом блока действий, первый выход блока отбора действий связан с третьим входом решающей нейронной сети, второй выход блока отбора действий связан со вторым входом блока критика, введены блок обучения нейронных сетей, первый выход блока расчета подкрепления также связан с пятым входом блока действий, первый выход блока расчета временной разности связан с четвертым входом блока действий и первым входом блока обучения нейронных сетей, второй выход блока расчета временной разности связан с третьим входом блока критика, первый выход блока критика связан с четвертым входом блока расчета временной разности, второй выход блока критика также связан со вторым входом блока обучения нейронных сетей, третий выход блока критика связан с третьим входом блока отбора действий, первый выход блока обучения нейронных сетей связан с первым входом блока критика и вторым входом решающей нейронной сети, второй выход блока обучения нейронных сетей связан со вторым входом блока критика и первым входом решающей нейронной сети, третий выход блока обучения нейронных сетей связан с третьим входом блока действий, четвертый выход блока обучения нейронных сетей связан с четвертым входом блока отбора действий, первый выход решающей нейронной сети связан также с четвертым входом блока обучения нейронных сетей, второй выход блока отбора действий связан также со вторым входом блока выбора действий, первый выход блока действий связан с пятым входом блока расчета временной разности, третий выход блока действий связан с третьим входом блока обучения нейронных сетей, первый выход блока выбора действий также связан со вторым входом блока действий.The technical result is achieved by the fact that in a modified intelligent controller with an adaptive critic containing a reinforcement calculation block, a time difference calculation block, a critic block, a solving neural network, an action selection block, an action block, an action selection block, a control object, the first and second outputs of the control object connected with the first and second inputs of the decision neural network, the first and second inputs of the time difference calculation block, the first and second inputs of the action selection block, as well as with the first and second inputs of the reinforcement calculation block, the first output of the reinforcement calculation block is connected with the third input of the time difference calculation block difference, the first output of the action selection block is connected to the first input of the control object, the second output of the critic block is connected to the first input of the action selection block, the first output of the decision neural network is connected to the first input of the critic block, the second output of the action block is connected to the fifth input of the action selection block, the third output of the action selection block is connected to the first input of the action block, the first output of the action selection block is connected to the third input of the decisive neural network, the second output of the action selection block is connected to the second input of the critic block, a neural network training block is introduced, the first output of the reinforcement calculation block is also connected to the fifth input of the action block , the first output of the block for calculating the time difference is connected with the fourth input of the action block and the first input of the block for training neural networks, the second output of the block for calculating the time difference is connected with the third input of the critic block, the first output of the critic block is connected with the fourth input of the block for calculating the time difference, the second output of the block criticism is also associated with the second input of the neural network training block, the third output of the criticism block is associated with the third input of the action selection block, the first output of the neural network training block is associated with the first input of the critic block and the second input of the decisive neural network, the second output of the neural network training block is associated with the second entrance of the critic and lane block the input of the decision neural network, the third output of the neural network training unit is connected to the third input of the action block, the fourth output of the neural network training block is connected to the fourth input of the action selection block, the first output of the decision neural network is also connected to the fourth input of the neural network training block, the second output the action selection block is also connected to the second input of the action selection block, the first output of the action block is connected to the fifth input of the time difference calculation block, the third output of the action block is connected to the third input of the neural network training block, the first output of the action selection block is also connected to the second input of the action block ...

Повышение адаптационных свойств системы управления на базе интеллектуального контроллера, достигается за счет выделения процесса обучения нейронной сети блока критика и решающей нейронной сети в блок обучения нейронных сетей, при этом данный блок обучает и нейронную сеть блока критика и решающую нейронную сеть. Другим важным моментом является то, что работа с блоком действий строится по новому принципу с использованием блока отбора действий, блока выбора действий, блока обучения нейронных сетей, блока расчета временной разности и блока расчета подкрепления. Скоростные характеристики работы системы повышаются за счет блока отбора действий, который ограничивает возможные действия не подходящие по минимально заданному подкреплению, а также прямым обращением к блоку действий блока отбора действий, блока критика, блока расчета подкрепления и блока расчета временной разности. Упрощение реализации для разработчика заключается в модернизации взаимодействия блоков расчета подкрепления, блока расчета временной разности и блока выбора действий с блоком действий, а также выделение процесса обучения нейронной сети блока критика и решающей нейронной сети в отдельный блок.An increase in the adaptive properties of a control system based on an intelligent controller is achieved by separating the learning process of the neural network of the critic's block and the decisive neural network into the training block of neural networks, while this block trains both the neural network of the critic's block and the decisive neural network. Another important point is that work with an action block is built according to a new principle using an action selection block, an action selection block, a neural network training block, a time difference calculation block and a reinforcement calculation block. The speed characteristics of the system operation are increased due to the action selection block, which limits possible actions that are not suitable for the minimum specified reinforcement, as well as by direct reference to the action block of the action selection block, critic block, reinforcement calculation block and time difference calculation block. Simplifying the implementation for the developer consists in modernizing the interaction of the blocks for calculating the reinforcement, the block for calculating the time difference and the block for choosing actions with the block of actions, as well as separating the learning process of the neural network of the critic's block and the decisive neural network into a separate block.

Таким образом, совокупность существенных признаков, изложенных в формуле изобретения, позволяет достигнуть желаемый результат.Thus, the totality of the essential features set forth in the claims makes it possible to achieve the desired result.

На фиг. 1 изображена схема модифицированного интеллектуального контроллера с адаптивным критиком, который состоит из нескольких структурных компонент: блока расчета подкрепления 1, блока расчета временной разности 2, блока критика 3, блока выбора действия 4, блока обучения нейронных сетей 5, решающей нейронной сети 6, блока отбора действий 7, блока действий 8, объекта управления 9.FIG. 1 shows a diagram of a modified intelligent controller with an adaptive critic, which consists of several structural components: a reinforcement calculation unit 1, a time difference calculation unit 2, a criticism unit 3, an action selection unit 4, a neural network training unit 5, a solving neural network 6, a selection unit actions 7, block of actions 8, control object 9.

Также в системе присутствуют следующие связи - от объекта управления 9 к блоку расчета подкрепления 1 идут связи 10.1 и 11.1, от объекта управления 9 на блок расчета временной разности 2 идут сигналы по связям 10.2 и 11.2, от объекта управления 9 на решающую нейронную сеть 6 идут сигналы по связям 10.3 и 11.3, от объекта управления 9 на блок отбора действий 7 идут сигналы по связям 10.4 и 11.4, от блока расчета подкрепления 1 на блок расчета временной разности 2 идет сигнал по связи 12.1, блок расчета подкрепления 1 и блок действий 8 связаны по сигналу 12.2, от блока действий 8 идет связь 13 на блок расчета временной разности 2, от блока критика 3 идет связь 14 на блок расчета временной разности 2, от блока расчета временной разности 2 идет сигнал на блок критика 3 по связи 15, от блока расчета временной разности 2 идет связь 16.1 на блок обучения нейронных сетей 5 и связь 16.2 на блок действий 8, от блока критика 3 идет связь 17.1 на блок выбора действий 4 и связь 17.2 на блок обучения нейронных сетей 5, от блока отбора действий 7 идет связь 18.1 на блок критика 3 и связь 18.2 на блок выбора действий 4, от решающей нейронной сети 6 идет связь 19.1 на блок критика 3 и связь 19.2 на блок обучения нейронных сетей 5, от блока обучения нейронных сетей 5 идет связь 20.1 на блок критика 3 и связь 20.2 на решающую нейронную сеть 6, от блока действий 8 идет сигнал по связи 21 на блок обучения нейронных сетей 5, от блока обучения нейронных сетей 5 идет связь 22 на блок действий 8, от решающей нейронной сети 6 идет сигнал на блок отбора действий 7 по связи 26, от блока обучения нейронных сетей 5 идут сигналы по связи 24.1 на блок критика 3 и по связи 24.2 на решающую нейронную сеть 6, блок критика 3 и блок отбора действий 7 соединены по связи 25, блок отбора действий 7 и решающая нейронная сеть 6 соединены по связи 26, от блока действий 8 на блок отбора действий 7 идет сигнал по связи 27, от блока отбора действий 7 идет сигнал 28 на блок действий 8, блок выбора действий 4 по связи 29.1 соединен с блоком действий 8 и по связи 29.2 соединен с объектом управления 9.The system also contains the following connections - from the control object 9 to the reinforcement calculation unit 1 there are connections 10.1 and 11.1, from the control object 9 to the time difference calculation unit 2 there are signals via connections 10.2 and 11.2, from the control object 9 to the decisive neural network 6 signals via links 10.3 and 11.3, from the control object 9 to the action selection unit 7 there are signals via links 10.4 and 11.4, from the reinforcement calculation unit 1 to the time difference calculation unit 2 there is a signal via communication 12.1, the reinforcement calculation unit 1 and the action block 8 are connected by signal 12.2, from the block of actions 8 there is a connection 13 to the block for calculating the time difference 2, from the block of criticism 3 there is a connection 14 to the block for calculating the time difference 2, from the block for calculating the time difference 2 there is a signal to the block of the critic 3 via connection 15, from the block for calculating the time difference 2, there is a connection 16.1 to the training block of neural networks 5 and connection 16.2 to the block of actions 8, from the block criticism 3 there is a connection 17.1 to the block of choice of actions 4 and communication 17.2 to the training block I neural networks 5, from the action selection block 7 there is a connection 18.1 to the critic block 3 and connection 18.2 to the action selection block 4, from the decisive neural network 6 there is a connection 19.1 to the critic block 3 and connection 19.2 to the training block of neural networks 5, from the block for training neural networks 5, there is a connection 20.1 to the critic block 3 and communication 20.2 to the decisive neural network 6, from the action block 8 there is a signal via connection 21 to the training block of neural networks 5, from the training block of neural networks 5 there is a connection 22 to the action block 8, from the decisive neural network 6, a signal goes to the action selection unit 7 via connection 26, from the training unit of neural networks 5 there are signals via connection 24.1 to the critic block 3 and via connection 24.2 to the critical neural network 6, the criticism block 3 and the action selection unit 7 are connected via communication 25, the block for selecting actions 7 and the decisive neural network 6 are connected via communication 26, from the block of actions 8 to the block for selecting actions 7 there is a signal via communication 27, from the block for selecting actions 7 there is a signal 28 to the block of actions 8, the block for choosing actions 4 on communication 29.1 is connected to the block of actions 8 and through communication 29.2 is connected to the control object 9.

Блок расчета подкрепления 1 предназначен для расчета подкрепления работы интеллектуального контроллера. Формула расчета подкрепления задается разработчиком.Reinforcement calculation block 1 is designed to calculate the reinforcement of the work of an intelligent controller. The formula for calculating the reinforcement is set by the developer.

Блок расчета временной разности 2 предназначен для расчета временной разности (Саттон Р., Барто А. «Обучение с подкреплением». БИНОМ: Лаборатория знаний. 2012. 399 с.).The block for calculating the time difference 2 is designed to calculate the time difference (Sutton R., Barto A. “Reinforcement Learning.” BINOM: Knowledge Laboratory. 2012. 399 p.).

Блок критика 3 предназначен для расчета прогнозного значения качества ситуации последующей при выборе определенного действия. Для расчета качества ситуации используется послойно - полносвязная нейронная сеть прямого распространения сигнала (многослойный персептрон).Criticism block 3 is designed to calculate the predicted value of the quality of the subsequent situation when choosing a certain action. To calculate the quality of the situation, a layer-by-layer - fully connected neural network of direct signal propagation (multilayer perceptron) is used.

Блок выбора действия 4 предназначен для выбора конкретного действия из всех возможных в данной ситуации. При выборе используется так называемое «жадное правило» (Саттон Р., Барто А. «Обучение с подкреплением». БИНОМ: Лаборатория знаний. 2012. 399 с, которое можно записать как «с вероятностью ε (0<ε≤1) выбирается то действие, которому соответствует максимальное значение качества ситуации».Action selection block 4 is designed to select a specific action from all possible in a given situation. When choosing, the so-called “greedy rule” is used (Sutton R., Barto A. “Reinforcement Learning.” BINOM: Knowledge Laboratory. 2012.399 s, which can be written as “with probability ε (0 <ε≤1), then action, which corresponds to the maximum value of the quality of the situation. "

Блок обучения нейронных сетей 5 предназначен для обучения нейронных сетей критика и решающей нейронной сети.The neural network training unit 5 is designed to train the critic's neural networks and the decisive neural network.

Решающая нейронная сеть 6, предназначена для прогнозирования следующего значения рабочего параметра системы (рабочих параметров может быть несколько). Под рабочим параметром понимается тот параметр системы, оценивая который, система может определить, как она работает, либо это параметр, который служит ориентиром для работы системы.The decisive neural network 6 is designed to predict the next value of the operating parameter of the system (there may be several operating parameters). The operating parameter is understood as that parameter of the system, evaluating which, the system can determine how it works, or it is a parameter that serves as a reference point for the operation of the system.

Блок отбора действий 7 предназначен для отбора всех возможных действий в данной ситуации с учетом минимального накопленного подкрепления для каждого возможного действия.Action selection block 7 is designed to select all possible actions in a given situation, taking into account the minimum accumulated reinforcement for each possible action.

Блок действий 8 предназначен для хранения таблицы возможных действий во всех возможных ситуациях, истории работы объекта управления (ситуация -> действие) и накопленного подкрепления при совершении определенного действия в определенной ситуации.The block of actions 8 is intended for storing a table of possible actions in all possible situations, the history of the control object (situation -> action) and the accumulated reinforcement when performing a certain action in a certain situation.

Заявленное устройство работает следующим образом.The claimed device operates as follows.

1. Объект управления 9 выполняет действие и на выходах формируются сигналы состояния объекта управления 10 и внешней среды 11, которые поступают в блок отбора действий 7 по связям 10.4 и 11.4 соответственно, на решающую нейронную сеть 6 по связям 10.3 и 11.3, на блок расчета подкрепления 1 по связям 10.1 и 11.1 и блок расчета временной разности 2 по связям 10.2 и 11.2.1. The control object 9 performs an action and the signals of the state of the control object 10 and the external environment 11 are generated at the outputs, which are fed to the action selection unit 7 via connections 10.4 and 11.4, respectively, to the decisive neural network 6 via connections 10.3 and 11.3, to the reinforcement calculation unit 1 for links 10.1 and 11.1 and a block for calculating the time difference 2 for links 10.2 and 11.2.

2. При поступлении новых данных от объекта управления 9 по сигналам состояния объекта 10.4 и внешней среды 11.4, блок отбора действий 7 запрашивает по связи 28 у блока действий 8 возможные действия в данной ситуации и по связи 27 получает их. Получив возможные действия, блок отбора действий 7 синхронно с решающей нейронной сетью 6 начинает подавать попарно значения {возможное действие; прогноз рабочего параметра) на блок критика 3 по связям: 18.1 - возможное действие от блока отбора действий 7 и 19.1 - прогноз рабочего параметра от решающей нейронной сети 6. При этом блок отбора действий 7 подает поочередно различные действия, а решающая нейронная сеть 6 только одно вычисленное прогнозное значение рабочего параметра. Синхронизация блока отбора действий 7 с решающей нейронной сетью 6 идет по связи 26, при этом блок отбора действий 7 ждет момента, когда решающая нейронная сеть 6 выдаст прогноз рабочего параметра.2. When new data is received from the control object 9 according to the signals of the state of the object 10.4 and the external environment 11.4, the action selection unit 7 requests via communication 28 from the action unit 8 possible actions in this situation and receives them via communication 27. Having received the possible actions, the action selection unit 7 synchronously with the decisive neural network 6 begins to supply pairwise values {possible action; forecast of the working parameter) on the block of critic 3 by connections: 18.1 - a possible action from the block for selecting actions 7 and 19.1 - forecasting the working parameter from the decisive neural network 6. In this case, the block for selecting actions 7 gives alternately different actions, and the decisive neural network 6 only one the calculated predicted value of the operating parameter. The synchronization of the action selection unit 7 with the decisive neural network 6 is carried out via connection 26, while the action selection unit 7 waits for the moment when the decisive neural network 6 will issue a forecast of the operating parameter.

3. Решающая нейронная сеть 6, получив новые значения состояния объекта управления и внешней среды по связям 10.3 и 11.3 соответственно, вычисляет прогнозное значение рабочего параметра на следующую временную итерацию. Решающая нейронная сеть 6 после вычисления прогнозного значения рабочего параметра подает синхронизирующий сигнал по связи 26 на блок отбора действий 7 и подает вычисленное значение на блок критика 3 по связи 19.1 совместно с сигналом по связи 18.1 от блока отбора действий 7, который содержит возможное действие.3. The decision neural network 6, having received the new values of the state of the control object and the external environment by connections 10.3 and 11.3, respectively, calculates the predicted value of the operating parameter for the next time iteration. The decisive neural network 6, after calculating the predicted value of the operating parameter, sends a synchronization signal via communication 26 to the action selection unit 7 and supplies the calculated value to the critic 3 unit via connection 19.1 together with a signal via communication 18.1 from the action selection unit 7, which contains a possible action.

4. Блок критика 3, получая сигналы {возможное_действие; прогноз_рабочего параметра) по связям 18.1 и 19.1 от блока отбора действий 7 и блока решающей нейронной сети 6 вычисляет возможное будущее подкрепление для поданного действия. При этом блок отбора действий 7 подает на блок критика 3 по связи 18.1 последовательно столько действий сколько их возможно в данной ситуации. Соответственно блок критика 3, вычисляет столько значений возможных будущих подкреплений, сколько вариантов действий предоставил блок отбора действий 7. После вычисления каждого значения возможного подкрепления, блок критика 3 посылает синхронизирующий сигнал по связи 25 на блок отбора действий 7 о возможности приема новых данных и параллельно посылает рассчитанное значение по связи 17.1 на блок выбора действий 4.4. Block critic 3, receiving signals {possible_action; forecast_working parameter) by links 18.1 and 19.1 from the action selection block 7 and the decision neural network block 6 calculates the possible future reinforcement for the submitted action. In this case, the block of selection of actions 7 submits to the block of criticism 3 through connection 18.1 sequentially as many actions as possible in this situation. Accordingly, the block critic 3 calculates as many values of possible future reinforcements as the options for actions were provided by the block of selection of actions 7. After calculating each value of the possible reinforcement, the block of critic 3 sends a synchronizing signal via communication 25 to the block of selection of actions 7 about the possibility of receiving new data and simultaneously sends calculated value via communication 17.1 per action selection block 4.

5. Блок выбора действий 4 запоминает все пришедшие к нему значения {возможное действие; качество действия} и, основываясь на ε - жадном правиле, выбирает текущее действие и посылает его по связи 29.2 на объект управления 9. Выбранное действие также посылается на блок действий 8 по связи 29.1.5. Block of choice of actions 4 remembers all the values that have come to it {possible action; quality of the action} and, based on the ε - greedy rule, selects the current action and sends it via link 29.2 to the control object 9. The selected action is also sent to the action block 8 via link 29.1.

6. Блок расчета подкрепления 1, получая значения текущего состояния среды и объекта управления по связям 10.1 и 11.1 соответственно, вычисляет по заданной формуле значение полученного подкрепления за последнюю отработанную итерацию управления. Полученное значение рассчитанного подкрепления по связи 12.1 подается в блок расчета временной разности 2, который рассчитывает значение текущей временной разности. Если значение ошибки временной разности выше заданного разработчиком порога (т.е. большая ошибка) и получаемое подкрепление снижается, то блок расчета временной разности 2 посылает сигнал по связи 16.1 на блок обучения нейронных сетей 5 о начале дообучения блока критика 3. Также блок расчета временной разности 2 записывает данные о текущей временной разности в блок действий 8 по связи 16.2.6. The block for calculating the reinforcement 1, receiving the values of the current state of the environment and the control object by connections 10.1 and 11.1, respectively, calculates the value of the received reinforcement for the last worked iteration of control according to the given formula. The obtained value of the calculated reinforcement for connection 12.1 is fed to the block for calculating the time difference 2, which calculates the value of the current time difference. If the value of the error of the time difference is higher than the threshold set by the developer (i.e., a large error) and the received reinforcement decreases, then the block for calculating the time difference 2 sends a signal via connection 16.1 to the training unit of neural networks 5 about the start of additional training of the block of criticism 3. Also, the block for calculating the time difference difference 2 writes data about the current time difference in the block of actions 8 for communication 16.2.

7. Блок обучения нейронных сетей 5, получив сигнал по связи 16.1 от блока расчета временной разности 2 о начале переобучения блока критика 3 посылает сигнал 23 на блок отбора действий 7 о приостановлении работы по выбору действий, то есть отключается блок решающей нейронной сети 6 и блок критика 3. При этом объект управления 9, блок расчета подкрепления 1 и блок действий 8 работают в обычном режиме, но объект управления 9 не предпринимает никаких действий или отрабатывает последнюю команду от блока выбора действий 4 (в зависимости от реализуемой задачи).7. The learning unit of neural networks 5, having received a signal via communication 16.1 from the unit for calculating the time difference 2 about the start of retraining of the block of criticism 3, sends a signal 23 to the block of selection of actions 7 about the suspension of work on the choice of actions, that is, the block of the decisive neural network 6 and the block criticism 3. In this case, the control object 9, the reinforcement calculation unit 1 and the action block 8 operate as usual, but the control object 9 does not take any action or processes the last command from the action selection block 4 (depending on the task being performed).

8. Блок обучения нейронных сетей 5, получив сигнал по связи 16.1 от блока расчета временной разности 2, в случае большой ошибки прогноза рабочего параметра, формирует наборы {входы; выходы}, запрашивая данные у блока действий 8 по связи 22 и принимая их по связи 21 начинает обучение нейронной сети блока критика 3. При этом в процессе обучения блок обучения нейронных сетей 5 подает на входы блока критика 3 по связям 20.1 и 24.1 значения, полученные от блока действий 8 и решающей нейронной сети 6, снимает данные с выхода блока критика 3 по связи 17.2. Обучение происходит по методу обратного распространения ошибки. Корректировка синаптических связей нейронной сети критика 3 происходит по сигналу 20.1. В случае если ошибка обучения нейронной сети блока критика 3 меньше заданной разработчиком, то блок обучения нейронных сетей 5 останавливает обучение нейронной сети блока критика 3 и посылает сигнал по связи 23 на блок отбора действий 7 о продолжении работы устройства в рабочем режиме.8. The unit for training neural networks 5, having received a signal via communication 16.1 from the unit for calculating the time difference 2, in the case of a large forecast error of the operating parameter, generates sets {inputs; outputs}, requesting data from the block of actions 8 via connection 22 and receiving them via connection 21, begins training the neural network of the critic 3 block. At the same time, in the learning process, the training unit of neural networks 5 feeds the inputs of the critic 3 block through connections 20.1 and 24.1 the values obtained from the block of actions 8 and the decisive neural network 6, removes data from the output of the block of criticism 3 via connection 17.2. Learning is done using the back propagation method. Correction of the synaptic connections of the neural network of critic 3 occurs according to signal 20.1. If the learning error of the neural network of the critic 3 unit is less than the one set by the developer, then the neural network training unit 5 stops training the neural network of the critic 3 unit and sends a signal via communication 23 to the action selection unit 7 about the continuation of the device in operating mode.

9. Блок обучения нейронных сетей 5 получив сигнал 16.1 от блока расчета временной разности 2, также начинает переобучение решающей нейронной сети 6. Вначале блок обучения нейронных сетей 5 посылает сигнал 23 на блок отбора действий 7 о приостановлении работы по выбору действий, то есть отключаются блок решающей нейронной сети 6 и блок критика 3. При этом объект управления 9, блок расчета подкрепления 1 и блок действий 8 работают в обычном режиме, но объект управления 9 не предпринимает никаких действий или отрабатывает последнюю команду от блока выбора действий 4 (в зависимости от реализуемой задачи). Далее, блок обучения нейронных сетей 5 по сигналу 22 запрашивает обучающую выборку для решающей нейронной сети 6 у блока действий 8 и получает данные по сигналу 21. Получив обучающую выборку, блок обучения нейронных сетей 5 начинает обучение решающей нейронной сети 6 по алгоритму обратного распространения ошибки. По сигналам 20.2 и 24.2 подаются данные на входы решающей нейронной сети 6, а по сигналу 19.2 снимаются данные с выхода решающей нейронной сети 6. Корректировка синаптических связей решающей нейронной сети происходит по сигналу 20.2. В случае если ошибка обучения решающей нейронной сети 6 меньше заданной разработчиком, то блок обучения нейронных сетей 5 останавливает обучение решающей нейронной сети 6 и посылает сигнал на блок отбора действий 7 по сигналу 26 о продолжении рабочего режима работы.9. The learning unit of neural networks 5, having received signal 16.1 from the unit for calculating the time difference 2, also starts retraining the decisive neural network 6. First, the training unit of neural networks 5 sends a signal 23 to the selection unit of actions 7 to suspend work by choosing actions, that is, the block is turned off decisive neural network 6 and block criticism 3. In this case, control object 9, block of calculation of reinforcement 1 and block of actions 8 operate as usual, but control object 9 does not take any action or executes the last command from block of choice of actions 4 (depending on the implemented tasks). Further, the unit for training neural networks 5 on the signal 22 requests the training sample for the decision neural network 6 from the action unit 8 and receives data on the signal 21. Having received the training sample, the training unit for the neural networks 5 starts training the decision neural network 6 according to the backpropagation algorithm. Signals 20.2 and 24.2 send data to the inputs of the decisive neural network 6, and signal 19.2 reads data from the output of the decisive neural network 6. Correction of the synaptic connections of the decisive neural network occurs on signal 20.2. If the learning error of the decision neural network 6 is less than the one set by the developer, then the learning unit of the neural networks 5 stops learning the decision neural network 6 and sends a signal to the action selection unit 7 upon the signal 26 to continue the operating mode.

Claims

A modified intelligent controller with an adaptive critic, containing a reinforcement calculation block, a time difference calculation block, a criticism block, a decision neural network, an action selection block, an action block, an action selection block, a control object, the first and second outputs of the control object are connected to the first and second inputs the decision neural network, the first and second inputs of the time difference calculation block, the first and second inputs of the action selection block, as well as with the first and second inputs of the reinforcement calculation block, the first output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the first output of the selection block actions is connected to the first input of the control object, the second output of the critic block is connected to the first input of the action selection block, the first output of the decision neural network is connected to the first input of the critic block, the second output of the action block is connected to the fifth input of the action selection block, the third output of the action selection block is connected with the first input of the action block, the first The th output of the action selection block is connected to the third input of the decisive neural network, the second output of the action selection block is connected to the second input of the critic's block, characterized in that the neural network training unit is additionally installed, while the first output of the reinforcement calculation block is also connected to the fifth input of the action block , the first output of the block for calculating the time difference is connected with the fourth input of the action block and the first input of the block for training neural networks, the second output of the block for calculating the time difference is connected with the third input of the critic block, the first output of the critic block is connected with the fourth input of the block for calculating the time difference, the second output of the block criticism is also associated with the second input of the neural network training block, the third output of the criticism block is associated with the third input of the action selection block, the first output of the neural network training block is associated with the first input of the critic block and the second input of the decisive neural network, the second output of the neural network training block is associated with the second input of the critic's block and the first input of the decision neural network, the third output of the neural network training block is connected to the third input of the action block, the fourth output of the neural network training block is connected to the fourth input of the action selection block, the first output of the decision neural network is connected to the fourth input of the neural network training block, the second output of the block selection of actions is connected with the second input of the block of selection of actions, the first output of the block of actions is connected with the fifth input of the block for calculating the time difference, the third output of the block of actions is connected with the third input of the block of training neural networks, the first output of the block of selection of actions is also connected with the second input of the block of actions.