RU2732201C1

RU2732201C1 - Method for constructing processors for output in convolutional neural networks based on data-flow computing

Info

Publication number: RU2732201C1
Application number: RU2020107116A
Authority: RU
Inventors: Антон Викторович Шадрин; Анастасия Александровна Чуприк; Екатерина Владимировна Кондратюк; Виталий Витальевич Михеев; Роман Владимирович Киртаев; Дмитрий Владимирович Негров
Original assignee: Российская Федерация, от имени которой выступает ФОНД ПЕРСПЕКТИВНЫХ ИССЛЕДОВАНИЙ
Priority date: 2020-02-17
Filing date: 2020-02-17
Publication date: 2020-09-14

Abstract

FIELD: physics.

SUBSTANCE: invention relates to a method of constructing neural network processors for output in convolutional neural networks and a neural network processor. Method comprises arranging, on one integrated circuit of a neural network processor, computer elements which are located in nodes of a regular grid, as well as shared register files which contain locally stored data, wherein each shared register file is connected to several neighbouring computing elements, each of which in turn processes data from shared register files, and then storing output data of calculations in said shared register files; wherein on one or more sides of the integrated circuit to a set of extreme shared register files are connected to the same side of input-output controllers.

EFFECT: technical result is faster operation of the convolutional neural networks.

10 cl, 3 dwg, 1 tbl

Description

Область техникиTechnology area

Изобретение относится к области «Компьютерные системы, использующие модели, основанные на знаниях - способы или устройства для вывода» может быть использовано для ускорения выполнения алгоритмов статистического вывода, использующих сверточные нейронные сети, и уменьшения энергопотребления систем, использующих такие алгоритмы.The invention relates to the field of "Computer systems using knowledge-based models - methods or devices for inference" can be used to accelerate the execution of statistical inference algorithms using convolutional neural networks, and reduce the power consumption of systems using such algorithms.

Предшествующий уровень техникиPrior art

Известен ряд технических решений в виде устройств, предназначенных для ускорения исполнения алгоритмов, основанных на нейронных сетях, например, патент США № US 20110029471 А1. Недостатком этого устройства является отсутствие возможности локального переиспользования данных из-за необходимости хранения промежуточных результатов во внешнем запоминающем устройстве, что приводит к избыточному энергопотреблению. Также, структура устройства предназначена для ускорения определенного класса алгоритмов и не позволяет ускорять нейронные сети со сложной взаимосвязью между слоями, специальными слоями и непредусмотренными активационными функциями.A number of technical solutions are known in the form of devices designed to accelerate the execution of algorithms based on neural networks, for example, US patent No. US 20110029471 A1. The disadvantage of this device is the lack of the possibility of local reuse of data due to the need to store intermediate results in an external memory device, which leads to excessive power consumption. Also, the structure of the device is designed to accelerate a certain class of algorithms and does not allow accelerating neural networks with complex relationships between layers, special layers, and unintended activation functions.

Наиболее близким по технической сущности к заявляемому изобретению является патент № WO 2016186811 А1, в котором предложен способ вычисления сверток, используя нейросетевой процессор, основанный на систолическом массиве, содержащий модуль прямого доступа к памяти, унифицированный буфер, выполняющий функцию промежуточного хранения нейронных активаций, модуль матричных вычислений, выполненный в виде систолического массива, блок векторных вычислений, выполняющий вычисление активационных функций и объединение результирующих активаций и систему управления, выполняющую поток поступающих извне инструкций, управляющую конфигурацией и работой всех блоков.The closest in technical essence to the claimed invention is patent No. WO 2016186811 A1, which proposes a method for calculating convolutions using a neural network processor based on a systolic array, containing a direct memory access module, a unified buffer that performs the function of intermediate storage of neural activations, a matrix module computations, executed in the form of a systolic array, a vector computation unit that calculates activation functions and combines the resulting activations; and a control system that executes a stream of instructions coming from outside that controls the configuration and operation of all units.

Недостатком данного способа является низкая программируемость, приводящая к ограничению количества различных алгоритмов, которые устройство способно выполнять с помощью предложенного способа.The disadvantage of this method is low programmability, leading to a limitation of the number of different algorithms that the device is capable of performing using the proposed method.

Раскрытие изобретенияDisclosure of invention

Задачей, решаемой в представленном изобретении, является уменьшение объема данных, передаваемых из и во внешнюю память в нейросетевом процессоре для ускорения работы сверточных нейронных сетей, а также оптимизация энергопотребления при переиспользовании локально сохраненных данных, что позволяет уменьшить энергопотребление и реализовать программируемость вычислительных элементов нейросетевого процессора. Программируемость позволяет выполнять произвольные нейросетевые алгоритмы, не ограничиваясь их конечным количеством, заданным на этапе создания нейросетевого процессора. Также, для ускорения специфичных для сверточных нейронных сетей тензорных операций, нейросетевой процессор должен содержать специальные блоки, позволяющие выполнять такие операции с высокой эффективностью.The problem solved in the presented invention is to reduce the amount of data transferred from and to external memory in the neural network processor to speed up the operation of convolutional neural networks, as well as to optimize energy consumption when reusing locally stored data, which allows to reduce power consumption and implement the programmability of the computational elements of the neural network processor. Programmability allows executing arbitrary neural network algorithms, not being limited to their finite number, specified at the stage of creating a neural network processor. Also, to speed up tensor operations specific to convolutional neural networks, the neural network processor must contain special blocks that allow performing such operations with high efficiency.

Решение задачи достигается с помощью следующих приемов.The solution to the problem is achieved using the following techniques.

Способ построения нейросетевых процессоров заключается в размещении на одной интегральной схеме вычислительных элементов, причем вычислительные элементы располагаются в узлах регулярной сетки, и функционально соседние вычислительные элементы соединяются с разделяемыми регистровыми файлами, через которые осуществляется обмен локальными данными в нейросетевом процессоре, причем используется явное управление хранением и обменом локальными данными для достижения детерминированности работы: каждый вычислительный элемент по очереди осуществляет обработку данных из разделяемых регистровых файлов, а потом хранение выходных данных вычислений в этих разделяемых регистровых файлах. Для осуществления обмена данными с внешними процессорами в целях увеличения общей вычислительной мощности и поддерживания работы вычислительной системы, состоящей из нескольких нейросетевых процессоров, на одной или нескольких сторонах интегральной схемы к совокупности крайних разделяемых регистровых файлов подсоединяются располагающиеся на той же стороне контроллеры ввода-вывода.The method for constructing neural network processors consists in placing computational elements on one integrated circuit, and the computational elements are located at the nodes of a regular grid, and functionally adjacent computational elements are connected to shared register files through which local data is exchanged in the neural network processor, and explicit storage control and exchange of local data to achieve determinism of work: each computational element in turn processes data from shared register files, and then stores the output data of calculations in these shared register files. To exchange data with external processors in order to increase the total computing power and support the operation of a computing system consisting of several neural network processors, I / O controllers located on the same side are connected to the set of extreme shared register files on one or several sides of the integrated circuit.

Для реализации описанного выше способа предлагается устройство нейросетевого процессора, содержащее вычислительные элементы, разделяемые регистровые файлы, каждый из которых соединен с несколькими соседними вычислительными элементами, а совокупность крайних разделяемых регистровых файлов на одной или нескольких сторонах интегральной схемы подсоединена к располагающимся на тех же сторонах контроллерам ввода-вывода.To implement the method described above, a neural network processor device is proposed, containing computing elements, shared register files, each of which is connected to several neighboring computing elements, and a set of extreme shared register files on one or more sides of the integrated circuit is connected to input controllers located on the same sides - withdrawal.

Для минимизации объема управляющей логики разделяемые регистровые файлы состоят из банков, причем, к каждому банку в отдельный момент времени может иметь доступ только один вычислительный элемент.To minimize the amount of control logic, the shared register files consist of banks, and only one computational element can have access to each bank at a given time.

Для технологичности проектирования и изготовления вычислительные элементы нейросетевого процессора могут быть расположены в узлах прямоугольной регулярной сетки, а каждый разделяемый регистровый файл функционально соединен с двумя соседними вычислительными элементами.For manufacturability of design and manufacture, the computing elements of the neural network processor can be located at the nodes of a rectangular regular grid, and each shared register file is functionally connected to two adjacent computing elements.

Для корректной работы нейросетевого процессора в его состав включены:For the correct operation of the neural network processor, it includes:

- блок обработки исключений, передающий информацию о возникших ошибочных и исключительных ситуациях всем вычислительным элементам;- an exception handling unit that transmits information about errors and exceptions that have occurred to all computational elements;

- блок синхронизации, содержащий набор регистров, при этом, каждый вычислительный элемент может приостановить свои функции до возникновения определенного значения в этих регистрах или изменить эти регистры в соответствии с исполняемой нейросетевым процессором программой;- a synchronization block containing a set of registers, in this case, each computational element can suspend its functions until a certain value occurs in these registers or change these registers in accordance with the program executed by the neural network processor;

- блок реконфигурации, который позволяет изменять операции, выполняемые вычислительными элементами в процессе работы.- a reconfiguration block that allows you to change the operations performed by computing elements in the process.

Для иллюстрации приведенного выше описания на фиг. 1 приведена структурная схема предлагаемого нейросетевого процессора, где позициями отмечены:To illustrate the above description, FIG. 1 shows a block diagram of the proposed neural network processor, where the positions are marked:

1 - вычислительный элемент (ВЭ);1 - computing element (CE);

2 - разделяемый регистровый файл (РРФ);2 - shared register file (RRF);

3 - контроллер ввода-вывода (КВВ);3 - input-output controller (KVV);

4 - блок синхронизации;4 - synchronization block;

5 - блок обработки исключений;5 - block for processing exceptions;

6 - блок реконфигурации.6 - reconfiguration block.

Для решения задач, решаемых в предлагаемом изобретении ВЭ нейросетевого процессора содержат память инструкций, декодер инструкций, блок логики счетчика инструкций, аппаратный вычислитель тензорных сечений, арифметико-логический модуль, блоки выборки и записи данных в РРФ, а выполняемая в каждый момент инструкция определяется блоком логики счетчика инструкций, способным обрабатывать инструкции изменения текущей выполняемой инструкции - инструкции перехода.To solve the problems solved in the proposed invention, the CEs of the neural network processor contain an instruction memory, an instruction decoder, an instruction counter logic block, a hardware tensor section calculator, an arithmetic-logic module, blocks for fetching and recording data in the RFP, and the instruction executed at each moment is determined by the logic an instruction counter capable of processing instructions for modifying the currently executing instruction — branch instructions.

В свою очередь арифметико-логический модуль состоит из тензорного конвейера, скалярного конвейера и системы передачи данных, при этом тензорный конвейер предназначен для выполнения операций свертки с накоплением.In turn, the arithmetic-logic module consists of a tensor pipeline, a scalar pipeline and a data transmission system, while the tensor pipeline is designed to perform convolution-accumulation operations.

Блоки выборки и записи данных в РРФ включают в себя модуль чтения регистров, модуль записи регистров, коммутатор памяти, арбитр, статусные регистры.The blocks for sampling and writing data in the RFR include a register reading module, a register writing module, a memory switch, an arbiter, and status registers.

Отметим, что арифметико-логический модуль и блоки выборки и записи данных в РРФ реализуются способными обрабатывать одновременно несколько операндов, выполняя несколько однородных операций над данными, расположенными в РРФ.Note that the arithmetic-logic module and blocks for sampling and recording data in the RRF are implemented capable of processing several operands simultaneously, performing several homogeneous operations on the data located in the RRF.

Блок логики счетчика инструкций ВЭ нейросетевого процессора способен обрабатывать помимо инструкций перехода так же и инструкции управления циклами, которые позволяют выполнить определенный блок инструкций определенное количество раз без дополнительных накладных расходов, связанных с выполнением инструкций перехода расходов - тактов работы нейросетевого процессора, связанных с выполнением инструкций перехода.The block of logic of the counter of instructions of the SE of the neural network processor is capable of processing, in addition to the transition instructions, also instructions for controlling cycles, which allow you to execute a certain block of instructions a certain number of times without additional overheads associated with the execution of instructions for the transition of costs - cycles of the neural network processor associated with the execution of instructions ...

Кроме того, ВЭ снабжен аппаратным вычислителем тензорных сечений, управляемым блоком логики счетчика инструкций, который вычисляет адрес в многомерном массиве - тензоре, на основании его размеров и текущих значений счетчиков индексов элементов тензора. Следует заметить, что элементы РРФ, адрес которых вычисляется аппаратным вычислителем тензорных сечений, проецируются в другие предопределенные программой, исполняемой нейросетевым процессором, РРФ с помощью коммутатора памяти.In addition, the SE is equipped with a hardware tensor section calculator controlled by an instruction counter logic unit, which calculates the address in a multidimensional array - tensor, based on its size and current values of the tensor element index counters. It should be noted that the elements of the RRF, the address of which is calculated by the hardware tensor section calculator, are projected into other predefined programs executed by the neural network processor, the RRF, using the memory switch.

Для иллюстрации приведенного выше описания ВЭ на фиг. 2 изображена структурная схема вычислительного элемента и его взаимосвязи с разделяемыми регистровыми файлами, где позициями отмечены:To illustrate the above description of SE, FIG. 2 shows a block diagram of a computing element and its relationship with shared register files, where the positions are marked:

7 - аппаратный вычислитель тензорных сечений (АВТС);7 - hardware tensor section calculator (AVTS);

8 - память инструкций (ПИ);8 - instruction memory (PI);

9 - блок логики счетчика инструкций (БЛСИ);9 - block of logic counter instructions (BLSI);

10 - декодер инструкций (ДИ).10 - instruction decoder (ID).

Блоки выборки и записи данных в РРФ:Blocks of sampling and data recording in the RFR:

11 - модуль чтения регистров (МЧР);11 - register reading module (CDM);

15 - арбитр;15 - referee;

16 - модуль записи регистра (МЗР);16 - register writing module (MZR);

17 - коммутатор памяти (КП);17 - memory switch (KP);

18 - статусные регистры (CP).18 - status registers (CP).

Арифметико-логический модуль:Arithmetic logic module:

12 - тензорный конвейер (ТК);12 - tensor conveyor (TC);

13 - скалярный конвейер (СК);13 - scalar conveyor (SC);

14 - система передачи данных (СПД).14 - data transmission system (SPD).

Следует отметить, что подсоединенные к РРФ КВВ способны передавать поступающую извне информацию ВЭ.It should be noted that the HVCs connected to the RRF are capable of transmitting information from the external EE coming from the outside.

КВВ представляют из себя вычислительный элемент с дополнительной возможностью прямого доступа к внешней динамической памяти.KVV represent a computational element with the additional possibility of direct access to external dynamic memory.

КВВ позволяют организовать прозрачную передачу данных в такой же нейросетевой процессор, установленный рядом.KVV allows you to organize transparent data transmission to the same neural network processor installed nearby.

Изложенное выше позволяет создать вычислительную систему, состоящую из нескольких соединенных нейросетевых процессоров для увеличения общей вычислительной мощности.The above allows you to create a computing system consisting of several connected neural network processors to increase the total computing power.

Для достижения результата предлагаемый нейросетевой процессор состоит из набора ВЭ, каждый из которых способен обмениваться данными с набором других аналогичных ВЭ. Причем обмен данными происходит за счет того, что, как правило, РРФ одновременно доступен нескольким ВЭ. Каждый ВЭ способен оперировать с данными, расположенных в РРФ, как в достаточной степени произвольное сечение тензора и способен, за счет своей программируемости, выполнить операции тензорного умножения на произвольных сечениях. Для этого ВЭ содержит специальный ТК, как часть арифметико-логического модуля, рассчитывающий свертку части данных, а также АВТС, управляющий выборкой данных, поступающих в ТК.To achieve the result, the proposed neural network processor consists of a set of SEs, each of which is able to exchange data with a set of other similar SEs. Moreover, the exchange of data occurs due to the fact that, as a rule, the RRF is simultaneously available to several EEs. Each VE is able to operate with data located in the RRF as a sufficiently arbitrary section of a tensor and is capable, due to its programmability, to perform tensor multiplication operations on arbitrary sections. For this purpose, the VE contains a special TC, as part of the arithmetic-logical module, which calculates the convolution of a part of the data, as well as the ATS, which controls the selection of data entering the TC.

Кроме того, для повышения плотности кода (англ. Code density), БЛСИ способен обрабатывать не только последовательное выполнение программы и операции перехода, но также особые команды аппаратных циклов, позволяющие выполнить определенный блок команд заданное число раз без дополнительных накладных расходов на выполнение инструкций перехода и сравнения. Кроме того, БЛСИ поддерживает вложенные циклы определенной глубины и команду преждевременного выхода из цикла, позволяющую прервать выполнение заданного количества вложенных циклов. Это позволяет увеличить плотность вычислительного кода за счет минимизации количества выполняемых вспомогательных операций (например, операций сравнения и условного перехода), не выполняющих непосредственно арифметических действий, составляющих основу нейросетевого алгоритма.In addition, to increase the code density, the BLSI is able to process not only sequential program execution and branch operations, but also special instructions of hardware cycles that allow executing a certain block of instructions a given number of times without additional overhead for executing branch instructions and comparisons. In addition, BLSI supports nested loops of a certain depth and the command to exit the loop early, which allows you to interrupt the execution of a specified number of nested loops. This allows you to increase the density of the computational code by minimizing the number of auxiliary operations performed (for example, comparison and conditional branch operations) that do not directly perform arithmetic operations that form the basis of the neural network algorithm.

Кроме того, архитектура нейросетевого процессора позволяет управлять АВТС с помощью индексов аппаратных циклов, вычисляемых БЛСИ. Такой подход позволяет выполнять произвольные операции матричного умножения таким образом, что непосредственно ТК ВЭ выполняет только арифметические операции, а вся адресная арифметика выполняется исключительно на аппаратном уровне. Такой подход позволяет довести плотность кода до максимально теоретически возможной.In addition, the architecture of the neural network processor makes it possible to control the ATS using the hardware cycle indices calculated by the BLSI. This approach allows performing arbitrary operations of matrix multiplication in such a way that the TC SE directly performs only arithmetic operations, and all address arithmetic is performed exclusively at the hardware level. This approach allows you to bring the code density to the maximum theoretically possible.

В качестве иллюстрации на фиг. 3 приведена структурная схема ТК для нахождения скалярного произведения:By way of illustration, FIG. 3 shows a structural diagram of the TC for finding the dot product:

19 - нулевой скаляр;19 - zero scalar;

20 - входной скаляр;20 - input scalar;

21 - левый умножаемый вектор;21 - left vector to be multiplied;

22 - правый умножаемый вектор;22 - the right vector to be multiplied;

23 - мультиплексор;23 - multiplexer;

24 - умножитель;24 - multiplier;

25 - сумматор;25 - adder;

26 - регистр;26 - register;

27 - результат.27 is the result.

Подробное описание.Detailed description.

Нейросетевой процессор (фиг. 1) состоит из совокупности набора выполненных на одной интегральной схеме ВЭ 1. ВЭ расположены в узлах регулярной сетки, и соседние ВЭ имеют общий РРФ 2, за счет чего осуществляется обмен данными в системе, причем используется явное управление хранением и обменом данными для достижения детерминированности работы. Сетка, в которой расположены ВЭ 1 выполняется предпочтительно прямоугольной, соответственно каждый РРФ 2 разделяется между двумя соседними ВЭ 1.The neural network processor (Fig. 1) consists of a set of SE 1 made on one integrated circuit. The SE are located at the nodes of a regular grid, and the neighboring SE have a common RRF 2, due to which data exchange in the system is carried out, and explicit storage and exchange control data to achieve determinism of work. The grid in which the SE 1 is located is preferably rectangular, respectively, each RRF 2 is divided between two adjacent SE 1.

Предпочтительно, с каждой из четырех сторон сетки, в которой находятся ВЭ 1, установлены КВВ 3, способные передавать поступающую извне информацию в ВЭ 1. При этом между крайними ВЭ 1 и КВВ 3 находятся РРФ 2. Сами КВВ 3 представляют из себя вычислительный элемент с дополнительной возможностью прямого доступа к внешней динамической памяти. Наличие КВВ 3 позволяет организовать прозрачную передачу данных, например, в такой же нейросетевой процессор, установленный рядом. Таким образом, можно построить вычислительную систему, которая состоит из нескольких нейросетевых процессоров, объединенных при помощи КВВ 3 для увеличения общей вычислительной мощности.Preferably, on each of the four sides of the grid, in which the VE 1 are located, there are installed KVV 3, capable of transmitting information coming from the outside to the VE 1. In this case, between the extreme VE 1 and KVV 3 there are RRF 2 additional possibility of direct access to external dynamic memory. The presence of KVV 3 allows you to organize transparent data transfer, for example, to the same neural network processor installed nearby. Thus, it is possible to build a computing system, which consists of several neural network processors, combined using KVV 3 to increase the total computing power.

Структурная схема ВЭ 1 представлен на фиг. 2. Каждый ВЭ 1 содержит память инструкций ПИ 8, ДИ 10, арифметико-логический модуль, состоящий из ТК 12, СК 13 и СПД 14, блоков выборки и записи данных в РРФ, включающих Арбитр 15, МЗР 16, МЧР 11 и КП 17, а выполняемая в каждый момент инструкция определяется БЛСИ 9, способным обрабатывать инструкции изменения текущей выполняемой инструкции (инструкции перехода). Сами РРФ 2 предпочтительно разделены на банки, причем для минимизации объема управляющей логики, к каждому банку в отдельный момент времени может иметь доступ только один ВЭ.The block diagram of SE 1 is shown in Fig. 2. Each VP 1 contains a memory of instructions PI 8, CI 10, an arithmetic-logic module consisting of TC 12, SK 13 and SPD 14, blocks for sampling and recording data in the RFR, including Arbiter 15, MZR 16, CDR 11 and KP 17 , and the instruction executed at each moment is determined by the BLSI 9 capable of processing instructions for changing the currently executing instruction (jump instructions). The RRF 2 themselves are preferably divided into banks, and to minimize the amount of control logic, only one CE can have access to each bank at a given time.

Арифметико-логический модуль и блоки выборки и записи данных в РРФ каждого ВЭ выполнены таким образом, что способны обрабатывать одновременно несколько операндов, выполняя несколько однородных операций над данными, расположенными в РРФ определенным образом. Сам арифметико-логический модуль состоит из ТК 12, СК 13 и СПД 14, при этом, ТК 12 предназначен для выполнения операций свертки с накоплением, структурная схема ТК для нахождения скалярного произведения показана на фиг. 3.The arithmetic logic module and blocks for sampling and recording data in the RFR of each VE are designed in such a way that they are capable of processing several operands simultaneously, performing several homogeneous operations on data located in the RFR in a certain way. The arithmetic logic module itself consists of TC 12, SC 13 and SPD 14, while TC 12 is designed to perform convolution operations with accumulation, the TC block diagram for finding the dot product is shown in Fig. 3.

БЛСИ 9 каждого ВЭ 1 способен обрабатывать помимо инструкций перехода так же и инструкции управления циклами, которые позволяют выполнить определенный блок инструкций определенное количество раз без дополнительных накладных расходов, связанных с выполнением инструкций перехода.BLSI 9 of each VE 1 is able to process, in addition to jump instructions, also loop control instructions that allow you to execute a certain block of instructions a certain number of times without additional overheads associated with the execution of jump instructions.

АВТС 7 каждого ВЭ 1, управляемый БЛСИ 9, вычисляет адрес в многомерном массиве - тензоре, на основании его размеров и текущих значений счетчиков индексов элементов тензора при этом элементы РРФ, адрес которых вычисляется блоком АВТС 7 проецируются в другие предопределенные регистры с помощью КП 17.AVTS 7 of each VE 1, controlled by BLSI 9, calculates the address in a multidimensional array - tensor, based on its sizes and current values of the tensor element index counters, while the RRF elements, the address of which is calculated by the AVTS 7 unit, are projected into other predefined registers using KP 17.

Кроме того, нейросетевой процессор содержит блок обработки исключений 5, передающий информацию о возникших ошибочных и исключительных ситуациях всем ВЭ 1; блок синхронизации 4, содержащий набор регистров, при этом, каждый ВЭ 1 может приостановить свои функции до возникновения определенного значения в этих регистрах или изменить эти регистры по своему желанию; блок реконфигурации 6, который позволяет изменять операции, выполняемые ВЭ 1 в процессе работы.In addition, the neural network processor contains an exception handling unit 5, which transmits information about the error and exceptional situations that have occurred to all VE 1; synchronization block 4 containing a set of registers, while each VP 1 can suspend its functions until a certain value occurs in these registers or change these registers at will; reconfiguration block 6, which allows you to change the operations performed by the SE 1 during operation.

Пример конкретного выполнения.An example of a specific implementation.

Устройство - нейросетевой процессор выполнен в виде кремниевой интегральной цифровой схемы, изготовленной по технологическому процессу 90 нм с низким энергопотреблением, состоящий из регулярной сетки вычислительных элементов, размером 4×4, и 32 локальных разделяемых регистровых файлов, каждый из которых разделяется между двумя вычислительными элементами и имеет объем 2048 машинных слов. Стандартное машинное слово, которым оперирует процессор, имеет размер 16 бит. Каждый БЛСИ содержит аппаратный счетчик циклов, поддерживающий до трех аппаратных циклов. Аппаратный вычислитель тензорных сечений, способный работать с произвольными размерами тензоров, может одновременно выполнять до четырех однотипных операций. Скалярный конвейер системы способен выполнять набор стандартных арифметических и логических операций (сложение, вычитание, логическое «и», логическое «или», логическое «не», логическое исключающее «или», логические сдвиги), а также операции насыщаемого сложения и вычитания над двумя операндами. Тензорный конвейер предназначен для выполнения операции скалярного произведения двух массивов, длиной, не превышающей 8 машинных слов с накоплением. Контроллеры ввода-вывода нейросетевого процессора поддерживают прозрачную передачу данных между крайними разделяемыми регистровыми файлами соседних нейросетевых процессоров. Нейросетевой процессор содержит блок обработки исключений, который может сообщить вычислительным элементам о некорректной операции, выполненной определенным вычислительных элементом или о необходимости сброса нейросетевого процессора.The device - a neural network processor is made in the form of a silicon integrated digital circuit manufactured according to the technological process of 90 nm with low power consumption, consisting of a regular grid of computational elements, 4 × 4 in size, and 32 local shared register files, each of which is shared between two computational elements and has a volume of 2048 machine words. The standard machine word that the processor operates on is 16 bits in size. Each LSI contains a hardware cycle counter that supports up to three hardware cycles. The hardware tensor section calculator, capable of working with arbitrary tensor sizes, can simultaneously perform up to four operations of the same type. The scalar pipeline of the system is capable of performing a set of standard arithmetic and logical operations (addition, subtraction, logical "and", logical "or", logical "not", logical exclusive "or", logical shifts), as well as saturable addition and subtraction operations on two operands. The tensor pipeline is designed to perform the operation of the dot product of two arrays with a length not exceeding 8 accumulated machine words. I / O controllers of the neural network processor support transparent data transfer between the extreme shared register files of neighboring neural network processors. The neural network processor contains an exception handling unit that can inform the computational elements about an incorrect operation performed by a certain computational element or about the need to reset the neural network processor.

В таблице 1 представлено сравнение оцененной энергоэффективности предлагаемой реализации нейросетевого процессора и процессора Intel i7-6700K.Table 1 shows a comparison of the estimated energy efficiency of the proposed implementation of the neural network processor and the Intel i7-6700K processor.

Промышленная применимостьIndustrial applicability

Устройство - нейросетевой процессор, может быть использован в различных применениях, накладывающих ограничение на энергопотребление, таких, как персональные мобильные устройства, робототехнические платформы, системы автономного видеонаблюдения. Кроме того, нейросетевой процессор может быть использован в системах интеллектуальной обработки большого количества данных (например, в центрах обработки данных), для увеличения удельной вычислительной мощности.The device is a neural network processor that can be used in various applications that impose restrictions on energy consumption, such as personal mobile devices, robotic platforms, and autonomous video surveillance systems. In addition, a neural network processor can be used in systems for intelligent processing of large amounts of data (for example, in data processing centers) to increase the specific computing power.

Выводыconclusions

Метод построения нейросетевых процессоров включает способ и устройство для его реализации способное выполнять большое количество тензорных операций в единицу времени за счет большого количества распределенных вычислительных элементов, содержащих специализированные составляющие для выполнения тензорных вычислений. Это позволяет значительно ускорить вычисления, специфичные для сверточных искусственных нейронных сетей. Кроме того, поскольку основной объем данных передается локально между соседними вычислительными элементами, и управляется явным образом, устройство может быть выполнено крайне энергоэффективным.The method for constructing neural network processors includes a method and a device for its implementation capable of performing a large number of tensor operations per unit of time due to a large number of distributed computing elements containing specialized components for performing tensor calculations. This can significantly speed up computations specific to convolutional artificial neural networks. In addition, since the bulk of the data is transferred locally between neighboring computing elements, and is explicitly controlled, the device can be made extremely energy efficient.

Claims

1. A method of constructing neural network processors for output in convolutional neural networks, which consists in placing on one integrated circuit of a neural network processor computational elements that are located at the nodes of a regular grid, characterized by the fact that shared register files are placed on the same integrated circuit, which contain locally stored data, wherein each shared register file is connected to several neighboring computational elements, each of which in turn processes data from the shared register files, and then stores the output of the computations in these shared register files; at the same time, on one or several sides of the integrated circuit, I / O controllers located on the same side are connected to the set of extreme shared register files, which, in turn, allows data exchange with external processors to increase the total computing power and maintain the computing system, consisting of several neural network processors.

2. A neural network processor built in accordance with the method according to claim 1, containing computing elements, characterized in that it contains shared register files, each of which is connected to several neighboring computing elements, and the set of extreme shared register files on one or more sides of the integral the circuit is connected to the I / O controllers on the same side.

3. A neural network processor according to claim 2, characterized in that the shared register files consist of banks, and in order to minimize the amount of control logic, only one computing element can have access to each bank at a given time.

4. The neural network processor according to claim 2, characterized in that the computational elements are located at the nodes of a rectangular regular grid and each shared register file is connected to two adjacent computational elements.

5. The neural network processor according to claim 2, characterized in that the device includes: an exception handling unit that transmits information about errors and exceptional situations that have occurred to all computing elements; a synchronization unit containing a set of registers, each computing element can suspend its functions until a certain value occurs in these registers or change these registers in accordance with the program executed by the processor; a reconfiguration block that allows you to change the operations performed by the computing elements during operation.

6. The neural network processor according to claim 2, characterized in that each computing element contains an instruction memory, an instruction decoder, an instruction counter logic block, a hardware tensor section calculator, arithmetic logic modules, blocks for fetching and writing data to a shared register file, while an instruction being executed at any moment is determined by an instruction counter logic block capable of processing instructions to change the currently executing instruction - jump instructions.

7. The neural network processor according to claim 2, characterized in that the arithmetic-logical module of the computing element consists of three sub-pipelines: a tensor pipeline, a scalar pipeline and a data transmission system, while the tensor pipeline is designed to perform convolution operations with accumulation.

8. The neural network processor according to claim 2, characterized in that the logic block of the instruction counter of the computing element is capable of processing, in addition to jump instructions, also loop control instructions that allow executing a certain block of instructions a certain number of times without additional overhead costs - the cycles of the neural with the execution of jump instructions.

9. The neural network processor according to claim 2, characterized in that its computational elements are equipped with a hardware tensor section calculator controlled by an instruction counter logic unit that calculates an address in a multidimensional array - tensor, based on its size and current values of the tensor element index counters, and the elements of the shared register file, the address of which is calculated by the hardware tensor section calculator, are projected into other predefined shared register files using the memory switch.

10. The neural network processor according to claim 2, characterized in that the input-output controllers make it possible to organize transparent data transfer to the same neural network processor installed nearby.