KR102162749B1

KR102162749B1 - Neural network processor

Info

Publication number: KR102162749B1
Application number: KR1020180059775A
Authority: KR
Inventors: 김한준; 구본철; 강지훈; 이창만
Original assignee: 주식회사 퓨리오사에이아이
Priority date: 2018-04-03
Filing date: 2018-05-25
Publication date: 2020-10-07
Also published as: KR20190116024A; KR20190116040A; KR102191408B1

Abstract

본 명세서는 뉴럴 네트워크 프로세서를 개시하고 있다. 본 발명의 일 실시예는, 뉴럴 네트워크의 프로세싱을 수행하도록 구성되는 프로세싱 유닛에 있어서, 하나 이상의 인스트럭션을 구비하는 태스크들을 저장하도록 구성되는 인스트럭션 메모리와, 상기 태스크들과 연관된 데이터를 저장하도록 구성되는 데이터 메모리와, 상기 태스크들을 대상으로 데이터 준비 여부를 체크하고, 데이터 준비가 완료된 태스크의 순서대로 상기 태스크들의 준비 완료 여부를 컨트롤 플로우 엔진에 통지하도록 구성되는 데이터 플로우 엔진과, 상기 데이터 플로우 엔진으로부터 통지 받은 순서대로 태스크를 실행하도록 구성되는 컨트롤 플로우 엔진, 그리고, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하도록 구성되는 수행 유닛을 포함하는 뉴럴 네트워크 프로세싱 유닛을 제공한다.This specification discloses a neural network processor. According to an embodiment of the present invention, in a processing unit configured to perform processing of a neural network, an instruction memory configured to store tasks having one or more instructions, and data configured to store data associated with the tasks A data flow engine configured to check whether data is ready for the memory and the tasks, and notify the control flow engine of whether the tasks are ready in the order of the tasks for which data preparation is completed; and a data flow engine notified from the data flow engine. It provides a neural network processing unit comprising a control flow engine configured to execute tasks in sequence, and an execution unit configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine.

Description

Neural Network Processor {NEURAL NETWORK PROCESSOR}

본 개시는 뉴럴 네트워크 프로세서에 관한 것으로, 더욱 상세하게는, 본 개시는 인공 신경망에서의 프로세싱 속도를 향상시키도록 구성되는 뉴럴 네트워크 프로세싱 유닛과 방법 및 시스템에 관한 것이다.The present disclosure relates to a neural network processor, and more particularly, to a neural network processing unit, method, and system configured to improve processing speed in an artificial neural network.

인간의 뇌는 수많은 뉴런을 지니고 있고, 이러한 뉴런들의 전기화학적인 활동에 의해 지능이 형성된다. 각각의 뉴런은 시냅스를 통해 다른 뉴런들과 연결되어 있고, 뉴런들은 서로 전기 또는 화학적인 신호를 주고 받는다. 뉴런이 시냅스로부터 받은 신호들은 전기적인 신호로 변환되어 해당 뉴런을 자극한다. 이에 따라. 뉴런의 전위가 증가하여 역치 전위가 되면, 뉴런은 활동 전위(action potential)를 발생시켜 시냅스를 통해 다른 뉴런으로 전기화학적인 신호를 전달한다. The human brain has numerous neurons, and intelligence is formed by the electrochemical activity of these neurons. Each neuron is connected to other neurons through synapses, and the neurons exchange electrical or chemical signals with each other. The signals received by neurons from synapses are converted into electrical signals to stimulate the neurons. Accordingly. When the potential of a neuron increases and reaches a threshold potential, the neuron generates an action potential and transmits electrochemical signals to other neurons through synapses.

인공 신경망(Artificial Neural Network, ANN)은 인간의 뇌를 구성하고 있는 뉴런들을 수학적으로 모델링한 인공 뉴런들을 서로 연결하여 인공 지능을 구현한 것이다. 인공 뉴런에 대한 하나의 수학적 모델은 아래 수식 (1)과 같다. 구체적으로, 인공 뉴런은input signal x_i 를 입력 받아 x_i에 각각 대응되는 가중치 w_i를 곱하여 총합을 구한다. 다음, 해당 인공 뉴런은 활성화 함수(activation function)를 이용하여 활성화(activation) 값을 구하고, 이 활성화 값을 연결된 다음 인공 뉴런으로 전달한다. Artificial Neural Network (ANN) is the realization of artificial intelligence by connecting artificial neurons that mathematically model neurons that make up the human brain. One mathematical model for artificial neurons is shown in Equation (1) below. Specifically, the artificial neuron receives the input signal x _i is determined by multiplying the total weights w _i which respectively correspond to x _i. Next, the artificial neuron obtains an activation value using an activation function, and the activation value is connected and then transferred to the artificial neuron.

y = f(w₁ * x₁ + w₂ * x₂ + .... w_n * x_n) = f(Σ w_i * x_i) , where i = 1 .. n, n = # input signal - 수식 (1)y = f(w ₁ * x ₁ + w ₂ * x ₂ + .... w _n * x _n ) = f(Σ w _i * x _i ), where i = 1 .. n, n = # input signal -Equation (1)

ANN의 한 형태인 심층 인공 신경망(Deep neural network, DNN)은 인공 뉴런(노드)들이 계층화된 네트워크 구조(layered network architecture)를 갖는다. DNN은 입력층(Input layer), 출력층(Output layer) 및 입력층과 출력층 사이의 다수의 은닉층(Hidden layer)으로 구성된다. 입력층은 입력값이 입력되는 다수의 노드들로 구성되며, 입력층의 노드들은 입력층과 연결되어 있는 다음 은닉층의 노드들로 상기한 수학적 모델을 통해 계산된 출력값들을 전달한다. 은닉층의 노드들은 상기한 수학적 모델을 통해 입력값을 전송 받고, 출력값을 계산하며, 출력값을 출력층의 노드들로 전달한다. A deep neural network (DNN), a form of ANN, has a layered network architecture in which artificial neurons (nodes) are layered. The DNN is composed of an input layer, an output layer, and a plurality of hidden layers between the input and output layers. The input layer is composed of a plurality of nodes to which an input value is input, and the nodes of the input layer transfer output values calculated through the above mathematical model to the nodes of the next hidden layer connected to the input layer. The nodes of the hidden layer receive an input value through the above mathematical model, calculate an output value, and transmit the output value to the nodes of the output layer.

DNN에서 수행되는 기계 학습의 한 형태인 딥러닝(Deep learning)의 연산 과정은 주어진 DNN에서 학습 데이터를 바탕으로 계속해서 학습하여 해당 DNN의 연산 능력을 향상시키는 훈련(training) 과정과, 훈련 과정을 통해 학습된 DNN을 사용하여 새로운 입력 데이터에 대해 추론(inference)하는 과정으로 분류할 수 있다. The computational process of deep learning, a form of machine learning performed on a DNN, is a training process and a training process that continuously learns from a given DNN based on the training data to improve the computational ability of the DNN. It can be classified as a process of inference about new input data using the learned DNN.

딥러닝의 추론 과정은 입력 데이터를 입력층의 노드가 입력 받고, 이후 레이어의 순서에 따라 은닉층과 출력층에서 순차적으로 연산을 수행하는 전방향 전파(forward propagation) 방식으로 이루어진다. 최종적으로, 출력층 노드들은 은닉층들의 출력값을 바탕으로 추론 과정의 결론을 도출한다.The inference process of deep learning is performed in a forward propagation method in which a node of an input layer receives input data and then sequentially performs operations in the hidden layer and the output layer according to the order of the layers. Finally, the output layer nodes draw a conclusion of the inference process based on the output values of the hidden layers.

반면, 딥러닝의 훈련 과정은 추론 과정에 대한 결론과 실제 정답과의 차이를 줄이기 위해 노드들의 가중치(weight)를 조절하여 학습을 수행한다. 일반적으로 가중치는 기울기 하강(gradient descent) 방식을 통해 수정된다. 기울기 하강 방식을 구현하기 위해, 노드들 각각의 가중치를 대상으로 추론 과정의 결론과 실제 정답 간 차이의 미분값을 구해야 한다. 이 과정에서 DNN의 앞단 노드의 가중치의 미분값은 DNN의 뒷단 노드의 가중치에 대한 미분값의 체인룰(chain rule)에 의해 계산된다. 체인룰의 계산은 추론 과정 방향의 역방향으로 이루어지기 때문에 딥러닝의 학습 과정은 역방향 전파(back propagation) 방식을 갖는다.On the other hand, the deep learning training process performs learning by adjusting the weights of nodes to reduce the difference between the conclusion of the inference process and the actual correct answer. In general, the weight is modified through a gradient descent method. In order to implement the gradient descent method, the differential value of the difference between the conclusion of the inference process and the actual correct answer must be obtained for the weights of each of the nodes. In this process, the differential value of the weight of the node at the front end of the DNN is calculated by a chain rule of the differential value of the weight of the node at the rear end of the DNN. Since the chain rule is calculated in the reverse direction of the inference process, the deep learning learning process has a back propagation method.

다시 말해, DNN은 계층 구조를 가지며, 각각의 계층에 존재하는 노드들은 이전 계층에 존재하는 다수의 노드들로부터 결과값을 입력 받고, 상기한 노드의 수학적 모델에 기반한 연산을 수행하여 새로운 결과값을 출력하며, 새로운 결과값을 다음 계층의 노드들로 전달한다.In other words, DNN has a hierarchical structure, and nodes in each layer receive result values from a number of nodes in the previous layer, and perform an operation based on the mathematical model of the node to obtain a new result value. Output, and deliver the new result value to the nodes of the next layer.

한편, 종래 DNN의 대부분의 연산 구조는 각 계층에 존재하는 노드들이 수행하는 수많은 연산들을 다수의 연산 유닛에 분산하여 처리하는 분산 처리 구조를 지니고 있었다. 그러나, 각 계층에 존재하는 노드들이 수행하는 연산을 다수의 연산 유닛에 분산하여 처리하게 되면, 각 연산 유닛들은 연산에 필요한 이전 연산 결과들을 모두 전달 받은 이후에 연산을 수행해야 한다. 그러므로, 종래의 연산 처리 구조를 이용한 인공 신경망에서의 연산 방법은 계층 단위의 동기화(Synchronization) 과정이 필요하다. 종래 DNN의 연산 구조에서 동기화 과정은 이전 계층의 모든 연산이 완료되었음을 동기화 방법에 의해 보장하고 다음 계층의 연산을 수행하는 방식으로 진행되었다.Meanwhile, most of the operation structures of conventional DNNs have a distributed processing structure in which numerous operations performed by nodes in each layer are distributed to and processed by a plurality of operation units. However, when the operations performed by nodes in each layer are distributed to a plurality of operation units and processed, each operation unit must perform the operation after receiving all the previous operation results required for the operation. Therefore, an operation method in an artificial neural network using a conventional operation processing structure requires a synchronization process in units of layers. In the conventional DNN's operation structure, the synchronization process is performed in such a way as to guarantee that all operations of the previous layer have been completed by the synchronization method and perform the operation of the next layer.

상술한 바와 같이, 본래 DNN의 연산 처리 구조는 병렬 처리 방식에 유리한 데이터 플로우 방식이 적용되는 것이 자연스러움에도 불구하고, 종래 DNN은 조밀 행렬들 간의 연산(Dense Matrix Multiplication)과 같은 정규화된 연산(직렬 처리 방식)에 최적화된 연산 처리 구조를 지니고 있었다. 즉, 종래 DNN의 연산 처리 구조는 한 계층에 존재하는 모든 노드들이 비슷한 양의 일을 분산하여 처리하도록 함으로써 노드들의 작업 종료 시간을 비슷하게 하고, 한 계층의 작업이 종료되면 동기화 과정을 수행하여 순차적으로 다음 계층의 연산을 시작하는 정규화된 연산 방식에 최적화된 구조이다.As described above, although it is natural that the data flow method, which is advantageous to the parallel processing method, is applied in the original DNN's operation processing structure, the conventional DNN is a normalized operation such as Dense Matrix Multiplication. It has an operation processing structure optimized for processing method). That is, in the conventional DNN's operation processing structure, all nodes in a layer distribute and process a similar amount of work, making the end times of the nodes similar, and when the work of one layer is finished, the synchronization process is performed sequentially. This structure is optimized for the normalized operation method that starts the operation of the next layer.

그러나, 여러 계층 내에 분산되어 있는 노드들은 독립적이어서 각자 자신이 수행하는 연산에 필요한 이전 결과들만 있으면 바로 연산을 시작할 수 있다. 따라서, 종래의 DNN의 노드들은 다른 노드들의 연산 완료 여부와 상관 없이 자신의 연산을 수행할 수 있음에도 불구하고, 반드시 이전 계층의 모든 작업이 종료되어야만 연산을 시작할 수 있는 문제점을 가지고 있다. 또한, 연산 유닛 중 일부 연산 유닛들이 자신에게 할당된 모든 연산을 종료하지 못한 경우, 이미 연산을 마친 연산 유닛들은 대기해야 한다. 결국, 종래 DNN의 연산 구조는 노드들이 각기 다른 연산 양을 처리하는 비정규화된 연산(병렬 처리 방식)에는 비효율적인 구조이다.However, nodes scattered in several layers are independent, so they can start the operation as soon as they have the previous results necessary for the operation they perform. Therefore, although nodes of the conventional DNN can perform their own operations regardless of whether other nodes have completed their operations, they have a problem in that they can start operations only when all tasks in the previous layer are finished. In addition, when some of the computational units fail to complete all computations assigned to them, computational units that have already completed the computation must wait. Consequently, the conventional DNN's operation structure is inefficient for a denormalized operation (parallel processing method) in which nodes process different amounts of operation.

이와 관련하여, 딥러닝의 연산 과정에서는 비정규화된 연산(irregular computation)이 흔히 발생한다. 비정규화된 연산은 연산할 필요가 없는 0값을 다수 포함하는 행렬들의 연산 등으로 인해 발생할 수 있다. 이해의 편의를 위해, 다수의 연산 유닛이 동일하거나 비슷한 정도의 연산을 분산 처리하여 모두 정규화된 연산을 수행하도록 연산 구조가 프로그램화 된 DNN의 경우를 예로 든다. In this regard, irregular computation often occurs in the process of deep learning. The denormalized operation may occur due to the operation of matrices including a large number of zero values that do not need to be operated. For the convenience of understanding, a case of a DNN in which the operation structure is programmed so that a plurality of operation units distribute operations of the same or similar degree to perform normalized operations is taken as an example.

위 예에서, 연산할 필요가 없는 0 값을 포함하는 행렬 연산이 존재하는 경우, 연산 유닛이 해당 연산들을 생략하게 됨으로써 연산 유닛들이 처리해야 하는 연산량이 달라지는 비정규화된 연산이 발생한다. 또한, 인공 신경망을 구성하는 연결의 가중치에 0이 많이 존재하는 것과 같이 스파스(Sparse)한 경우 또는 인공 신경망의 처리 과정에서 발생하는 활성화 함수의 종류에 따라 0값이 발생하기 쉬운 경우에 비정규화된 연산이 자주 발생하게 된다. 또한, 딥러닝의 연상 과정에서 수많은 연산들이 다수의 연산 유닛들에 의해 나뉘어 실행되고, 이 과정에서 다수의 연산 유닛들이 서로 데이터를 주고 받는 통신(communication)을 수행하게 되면, 통신 네트워크 상에 발생하는 지연 속도와 그 차이로 인해 비정규화된 통신이 발생한다. 이 경우, 비정규화된 연산과 연산 유닛들 간의 통신으로 인해, DNN의 한 계층에 존재하는 노드들이 수행하는 연산들의 완료 시간이 불규칙해질 수 있다. 또한, 일부 늦게 처리되는 연산들로 인해 한 계층 내의 일부 연산 유닛들이 아이들(Idle)한 상태가 될 수 있다.In the above example, when there is a matrix operation including a value of 0 that does not need to be operated, the operation unit omits the operations, resulting in a non-normalized operation in which the amount of operation to be processed by the operation units is changed. In addition, it is denormalized when sparse, such as when there are many zeros in the weight of the connection constituting the artificial neural network, or when the value of 0 is likely to occur depending on the type of activation function occurring in the processing of the artificial neural network. The resulting operation occurs frequently. In addition, in the process of associating deep learning, a number of operations are divided and executed by a number of calculation units, and in this process, when a number of calculation units perform communication that exchanges data with each other, De-normalized communication occurs due to the delay rate and the difference. In this case, due to the denormalized operation and communication between the operation units, completion times of operations performed by nodes in one layer of the DNN may become irregular. In addition, due to some late-processed operations, some operation units within a layer may become idle.

종래 DNN은 상기한 상황에서도 반드시 한 계층 내에 모든 노드들의 연산과 통신이 완료될 때까지 기다린 후, 동기화 과정을 거쳐야만 다음 계층의 연산을 처리하도록 구성된다. 따라서, 종래 DNN은 노드들의 사용 효율(Utilization)이 떨어지는 문제점을 야기시키고, 노드들의 사용 효율이 떨어지면 결국 DNN 전체의 연산 효율이 떨어지는 문제 역시 발생하게 된다. DNN의 연산 효율을 높이기 위해 계층 단위가 아닌 개별 노드 단위로 상술한 동기화 과정을 설정할 수도 있으나, 이는 추가적인 하드웨어 비용을 발생시킨다.The conventional DNN is configured to process the operation of the next layer only after waiting for the operation and communication of all nodes in one layer to be completed even in the above-described situation and then undergoing a synchronization process. Accordingly, the conventional DNN causes a problem in that the utilization efficiency of nodes is deteriorated, and when the utilization efficiency of the nodes decreases, the operation efficiency of the entire DNN is also deteriorated. In order to increase the computational efficiency of the DNN, the above-described synchronization process may be configured in units of individual nodes rather than in units of layers, but this incurs additional hardware cost.

본 발명의 실시예들은, 비정규화된 연산이 흔히 발생하는 인공 신경망에서, 프로세싱 속도를 향상시키도록 구성되는 뉴럴 네트워크 프로세싱 유닛과 방법 및 시스템을 제공한다.Embodiments of the present invention provide a neural network processing unit, a method, and a system configured to improve processing speed in an artificial neural network where denormalized operations often occur.

본 발명의 실시예들이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제로 제한되지 않으며, 위에서 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 실시예들이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved by the embodiments of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned above are those of ordinary skill in the technical field to which the embodiments of the present invention belong from the following description. It will be able to be clearly understood by the person.

상기한 기술적 과제들을 해결하기 위해 본 발명의 일 실시예는, 뉴럴 네트워크의 프로세싱을 수행하도록 구성되는 프로세싱 유닛에 있어서, 하나 이상의 인스트럭션을 구비하는 태스크들을 저장하도록 구성되는 인스트럭션 메모리와, 상기 태스크들과 연관된 데이터를 저장하도록 구성되는 데이터 메모리와, 상기 태스크들을 대상으로 데이터 준비 여부를 체크하고, 데이터 준비가 완료된 태스크의 순서대로 상기 태스크들의 준비 완료 여부를 컨트롤 플로우 엔진에 통지하도록 구성되는 데이터 플로우 엔진과, 상기 데이터 플로우 엔진으로부터 통지 받은 순서대로 태스크의 실행을 제어하도록 구성되는 컨트롤 플로우 엔진, 그리고, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하도록 구성되는 수행 유닛을 포함하는 뉴럴 네트워크 프로세싱 유닛을 제공한다.In order to solve the above technical problems, an embodiment of the present invention is a processing unit configured to perform processing of a neural network, an instruction memory configured to store tasks having one or more instructions, and the tasks A data memory configured to store associated data, a data flow engine configured to check whether data is ready for the tasks, and notify the control flow engine of whether the tasks are ready in the order of the tasks in which data is ready; , A control flow engine configured to control execution of tasks in an order notified from the data flow engine, and an execution unit configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine. It provides a neural network processing unit.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 상기 데이터 메모리로부터 상기 수행 유닛에 페치하도록 구성되는 데이터 페치 유닛을 더 포함할 수 있다.In the present embodiment, the control flow engine may further include a data fetch unit configured to fetch operation target data according to one or more instructions of a task controlled to be executed from the data memory to the execution unit.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진은 상기 수행 유닛의 연산에 따른 결과 데이터를 상기 데이터 메모리에 저장할 수 있다.In this embodiment, the control flow engine may store result data according to the operation of the execution unit in the data memory.

본 실시예에 있어서, 외부 장치와 상기 프로세싱 유닛 사이의 통신을 중계하도록 구성되는 라우터를 더 포함할 수 있다.In this embodiment, a router configured to relay communication between an external device and the processing unit may be further included.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진에 의해 태스크가 실행되는 과정에서 사용되는 저장 공간인 레지스터를 하나 이상 구비하도록 구성되는 레지스터 파일을 더 포함할 수 있다.In the present embodiment, a register file configured to include at least one register, which is a storage space used in a process of executing a task by the control flow engine, may be further included.

본 실시예에 있어서, 상기 데이터 메모리는 동적 메모리를 포함하며, 상기 동적 메모리의 소정의 공간은 특정 태스크에 할당된 후, 상기 특정 태스크를 포함하는 태스크들이 상기 특정 태스크의 데이터를 사용하지 않는 경우 해제되도록 구성될 수 있다.In this embodiment, the data memory includes a dynamic memory, and after a predetermined space of the dynamic memory is allocated to a specific task, it is released when the tasks including the specific task do not use the data of the specific task. It can be configured to be.

본 실시예에 있어서, 상기 수행 유닛은 연산을 수행하는 연산 모듈이 특정 패턴을 이룬 형태로 형성될 수 있다. In this embodiment, the execution unit may be formed in a form in which an operation module that performs an operation forms a specific pattern.

본 실시예에 있어서, 상기 데이터 플로우 엔진은, 상기 인스트럭션 메모리로부터 상기 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받고, 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크하며, 데이터가 준비된 태스크들의 인덱스를 데이터 준비가 완료된 순서대로 상기 컨트롤 플로우 엔진으로 전송하도록 구성될 수 있다.In this embodiment, the data flow engine receives an index of each of the tasks and information of data associated with the tasks from the instruction memory, and checks whether data required by each of the tasks is prepared, It may be configured to transmit the indexes of tasks in which data is prepared to the control flow engine in the order in which data preparation is completed.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진은, 상기 데이터 플로우 엔진으로부터 데이터가 준비된 태스크들의 인덱스를 전송 받도록 구성되는 페치 준비 큐와, 상기 페치 준비 큐로부터 전송 받은 인덱스에 해당하는 태스크의 실행을 위해 필요한 데이터가 상기 데이터 메모리에 존재하는지 확인하고 없으면 외부로부터 데이터를 상기 데이터 메모리로 로드하도록 구성되는 페치 블록과, 상기 페치 블록에서 데이터 로드가 완료된 태스크의 인덱스를 전송 받도록 구성되는 러닝 준비 큐와, 상기 러닝 준비 큐로부터 상기 러닝 준비 큐에 마련된 태스크들의 인덱스를 순차적으로 전송 받도록 구성되는 러닝 블록, 그리고, 상기 러닝 블록에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행하도록 구성되는 프로그램 카운터를 포함할 수 있다.In this embodiment, the control flow engine includes a fetch preparation queue configured to receive indexes of tasks for which data has been prepared from the data flow engine, and a task required for execution of a task corresponding to an index transmitted from the fetch preparation queue. A fetch block configured to load data into the data memory from the outside if it is checked whether data exists in the data memory, and a running ready queue configured to receive an index of a task for which data has been loaded from the fetch block, and the running A running block configured to sequentially receive indexes of tasks provided in the running ready queue from the ready queue, and a program counter configured to execute one or more instructions of a task corresponding to the index transmitted to the running block. have.

또한, 상기 기술적 과제들을 해결하기 위해 본 발명의 다른 실시예는, 하나 이상의 인스트럭션을 구비하는 태스크들을 저장하도록 구성되는 인스트럭션 메모리, 상기 태스크들과 연관된 데이터를 저장하도록 구성되는 데이터 메모리, 데이터 플로우 엔진, 컨트롤 플로우 엔진 및 수행 유닛을 포함하는 뉴럴 네트워크 프로세싱 유닛을 이용하여 뉴럴 네트워크의 프로세싱을 수행하는 방법에 있어서, (i) 상기 데이터 플로우 엔진이, 상기 태스크들을 대상으로 데이터 준비 여부를 체크하고 데이터 준비가 완료된 순서대로 상기 태스크들의 준비 완료 여부를 상기 컨트롤 플로우 엔진에 통지하는 단계와, (ii) 상기 컨트롤 플로우 엔진이, 상기 데이터 플로우 엔진으로부터 통지 받은 순서대로 태스크를 실행하도록 제어하는 단계와, (iii) 상기 수행 유닛이, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하는 단계를 포함하는 뉴럴 네트워크 프로세싱 방법을 제공한다.In addition, in order to solve the above technical problems, another embodiment of the present invention includes an instruction memory configured to store tasks having one or more instructions, a data memory configured to store data associated with the tasks, a data flow engine, A method of performing processing of a neural network using a neural network processing unit including a control flow engine and an execution unit, the method comprising: (i) the data flow engine checks whether data is prepared for the tasks and prepares data Notifying the control flow engine of whether the preparation of the tasks is completed in the order of completion; (ii) controlling the control flow engine to execute tasks in the order notified from the data flow engine; and (iii) Provides a neural network processing method comprising the step of performing, by the execution unit, an operation according to one or more instructions of a task controlled to be executed by the control flow engine.

본 실시예에 있어서, 상기 뉴럴 네트워크 프로세싱 유닛은 데이터 페치 유닛을 더 포함하고, 상기 뉴럴 네트워크 프로세싱 방법은, 상기 (ii) 단계와 상기 (iii) 단계 사이에, 상기 데이터 페치 유닛이, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 상기 데이터 메모리로부터 상기 수행 유닛에 페치하는 단계를 더 포함할 수 있다.In this embodiment, the neural network processing unit further includes a data fetch unit, and the neural network processing method includes, between the step (ii) and the step (iii), the data fetch unit, the control flow The method may further include fetching operation target data according to one or more instructions of a task controlled to be executed by the engine from the data memory to the execution unit.

본 실시예에 있어서, 상기 (iii) 단계 이후에, 상기 컨트롤 플로우 엔진이 상기 수행 유닛의 연산에 따른 결과 데이터를 상기 데이터 메모리에 저장하는 단계를 더 포함할 수 있다.In this embodiment, after the step (iii), the control flow engine may further include storing result data according to the operation of the execution unit in the data memory.

본 실시예에 있어서, 상기 (i) 단계는, 상기 데이터 플로우 엔진이, 상기 인스트럭션 메모리로부터 상기 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받는 과정과, 상기 데이터 플로우 엔진이 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크하는 과정, 그리고, 상기 데이터 플로우 엔진이 데이터가 준비된 태스크들의 인덱스를 데이터 준비가 완료된 순서대로 상기 컨트롤 플로우 엔진으로 전송하는 과정을 포함할 수 있다. In this embodiment, the step (i) includes a process of receiving, by the data flow engine, the index of each of the tasks and information of data associated with the tasks from the instruction memory, and the data flow engine A process of checking whether data required by each of the tasks is prepared, and a process of transmitting, by the data flow engine, indexes of tasks in which data is prepared, to the control flow engine in an order in which data preparation is completed.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진은 페치 준비 큐, 페치 블록, 러닝 준비 큐, 러닝 블록 및 프로그램 카운터를 더 포함하고, 상기 (ii) 단계는, 상기 페치 준비 큐가 상기 데이터 플로우 엔진으로부터 데이터가 준비된 태스크들의 인덱스를 전송 받는 과정과, 상기 페치 준비 큐가 상기 페치 블록으로 상기 태스크들의 인덱스를 순차적으로 전송하면, 상기 페치 블록은 전송 받은 인덱스에 해당하는 태스크의 실행을 위해 필요한 데이터가 상기 데이터 메모리에 존재하는지 확인하고 없으면 데이터를 외부로부터 상기 데이터 메모리로 로드하는 과정과, 상기 러닝 준비 큐가 상기 페치 블록에서 데이터 로드가 완료된 태스크의 인덱스를 전송 받는 과정, 그리고, 상기 러닝 준비 큐가 상기 러닝 블록으로 태스크들의 인덱스를 순차적으로 전송하면, 상기 프로그램 카운터가 상기 러닝 블록에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행하는 과정을 포함할 수 있다.In this embodiment, the control flow engine further includes a fetch ready queue, a fetch block, a running ready queue, a running block, and a program counter, and in step (ii), the fetch ready queue is data from the data flow engine. When the process of receiving the indexes of the prepared tasks and the fetch preparation queue sequentially transmits the indexes of the tasks to the fetch block, the fetch block provides the data necessary for execution of the task corresponding to the received index. A process of loading data into the data memory from the outside if it exists in the memory and not, a process in which the running ready queue receives an index of a task for which data has been loaded from the fetch block, and the running ready queue is used for the running When the indexes of tasks are sequentially transmitted to the block, the program counter may include a process of executing one or more instructions of the task corresponding to the index transmitted to the running block.

또한, 상기한 기술적 과제들을 해결하기 위해 본 발명의 또 다른 실시예는, 뉴럴 네트워크의 프로세싱을 수행하도록 구성되는 시스템에 있어서, 중앙처리유닛 및 상기 중앙처리유닛의 명령에 따라 프로세싱을 수행하는 프로세싱 유닛을 하나 이상 구비하는 컴퓨팅 모듈을 포함하고, 상기 프로세싱 유닛은, 하나 이상의 인스트럭션을 구비하는 태스크들을 저장하도록 구성되는 인스트럭션 메모리와, 상기 태스크들과 연관된 데이터를 저장하도록 구성되는 데이터 메모리와, 상기 태스크들을 대상으로 데이터 준비 여부를 체크하고, 데이터 준비가 완료된 태스크의 순서대로 상기 태스크들의 준비 완료 여부를 컨트롤 플로우 엔진에 통지하도록 구성되는 데이터 플로우 엔진과, 상기 데이터 플로우 엔진으로부터 통지 받은 순서대로 태스크의 실행을 제어하도록 구성되는 컨트롤 플로우 엔진, 그리고, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하도록 구성되는 수행 유닛을 하나 이상 구비하는 수행 유닛을 포함하는 뉴럴 네트워크 프로세싱 시스템을 제공한다.In addition, in order to solve the above technical problems, another embodiment of the present invention is a system configured to perform processing of a neural network, a central processing unit and a processing unit that performs processing according to commands of the central processing unit. And a computing module having one or more, wherein the processing unit includes an instruction memory configured to store tasks having one or more instructions, a data memory configured to store data associated with the tasks, and the tasks A data flow engine configured to check whether data is ready for a target, and notify the control flow engine of whether the tasks are ready in the order of tasks for which data preparation is completed, and execution of the tasks in the order notified from the data flow engine. Provides a neural network processing system comprising a control flow engine configured to control, and an execution unit having at least one execution unit configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine. do.

본 실시예에 있어서, 상기 프로세싱 유닛은, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 상기 데이터 메모리로부터 상기 수행 유닛에 페치하도록 구성되는 데이터 페치 유닛을 더 포함할 수 있다.In this embodiment, the processing unit further includes a data fetch unit, configured to fetch operation target data according to one or more instructions of a task controlled to be executed by the control flow engine from the data memory to the execution unit. I can.

본 실시예에 있어서, 상기 데이터 메모리에 저장된 결과 데이터는 상기 데이터 메모리를 포함하는 프로세싱 유닛과 다른 프로세싱 유닛의 프로세싱에 사용될 수 있다.In this embodiment, the result data stored in the data memory may be used for processing in a processing unit different from the processing unit including the data memory.

본 실시예에 있어서, 상기 중앙처리유닛과 상기 컴퓨팅 모듈을 연결하여 상기 중앙처리유닛과 상기 컴퓨틸 모듈 간의 통신이 수행되도록 구성되는 호스트 인터페이스를 더 포함할 수 있다. In the present embodiment, a host interface configured to connect the central processing unit and the computing module to perform communication between the central processing unit and the computing module may be further included.

본 실시예에 있어서, 상기 중앙처리유닛의 명령을 상기 컴퓨팅 모듈로 전달하도록 구성되는 커맨드 프로세서를 더 포함할 수 있다. In this embodiment, it may further include a command processor configured to transmit the command of the central processing unit to the computing module.

본 실시예에 있어서, 상기 중앙처리유닛과 상기 컴퓨팅 모듈 각각의 데이터 전송 및 저장을 제어하도록 구성되는 메모리 컨트롤러를 더 포함할 수 있다. In this embodiment, a memory controller configured to control data transmission and storage of each of the central processing unit and the computing module may be further included.

본 실시예에 있어서, 상기 프로세싱 유닛은, 외부 장치와 상기 프로세싱 유닛 사이의 통신을 중계하도록 구성되는 라우터와, 상기 컨트롤 플로우 엔진에 의해 태스크가 실행되는 과정에서 사용되는 저장 공간인 레지스터를 하나 이상 구비하도록 구성되는 레지스터 파일을 더 포함할 수 있다. In this embodiment, the processing unit includes a router configured to relay communication between an external device and the processing unit, and at least one register, which is a storage space used in a process of executing a task by the control flow engine. It may further include a register file configured to be.

본 실시예에 있어서, 상기 데이터 메모리는 동적 메모리를 포함하며, 상기 동적 메모리의 소정의 공간은 특정 태스크에 할당된 후, 상기 특정 태스크를 포함하는 태스크들이 상기 특정 태스크의 데이터를 사용하지 않는 경우 해제되도록 구성될 수 있다. In this embodiment, the data memory includes a dynamic memory, and after a predetermined space of the dynamic memory is allocated to a specific task, it is released when the tasks including the specific task do not use the data of the specific task. It can be configured to be.

본 실시예에 있어서, 상기 데이터 플로우 엔진은, 상기 인스트럭션 메모리로부터 상기 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받고, 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크하며, 데이터가 준비된 태스크들의 인덱스를 데이터 준비가 완료된 순서대로 상기 컨트롤 플로우 엔진으로 전송하도록 구성될 수 있다. In this embodiment, the data flow engine receives an index of each of the tasks and information of data associated with the tasks from the instruction memory, and checks whether data required by each of the tasks is prepared, It may be configured to transmit the indexes of tasks in which data is prepared to the control flow engine in the order in which data preparation is completed.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진은, 상기 데이터 플로우 엔진으로부터 데이터가 준비된 태스크들의 인덱스를 전송 받도록 구성되는 페치 준비 큐와, 상기 페치 준비 큐로부터 전송 받은 인덱스에 해당하는 태스크의 실행을 위해 필요한 데이터가 상기 데이터 메모리에 존재하는지 확인하고 없으면 외부로부터 데이터를 상기 데이터 메모리로 로드하도록 구성되는 페치 블록과, 상기 페치 블록에서 데이터 로드가 완료된 태스크의 인덱스를 전송 받도록 구성되는 러닝 준비 큐와, 상기 러닝 준비 큐로부터 상기 러닝 준비 큐에 마련된 태스크들의 인덱스를 순차적으로 전송 받도록 구성되는 러닝 블록, 그리고, 상기 러닝 블록에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행하도록 구성되는 프로그램 카운터를 포함할 수 있다. In this embodiment, the control flow engine includes a fetch preparation queue configured to receive indexes of tasks for which data has been prepared from the data flow engine, and a task required for execution of a task corresponding to an index transmitted from the fetch preparation queue. A fetch block configured to load data into the data memory from the outside if it is checked whether data exists in the data memory, and a running ready queue configured to receive an index of a task for which data has been loaded from the fetch block, and the running A running block configured to sequentially receive indexes of tasks provided in the running ready queue from the ready queue, and a program counter configured to execute one or more instructions of a task corresponding to the index transmitted to the running block. have.

또한, 상기한 기술적 과제들을 해결하기 위해 본 발명의 또 다른 실시예는, 중앙처리유닛 및 상기 중앙처리유닛의 명령에 따라 프로세싱을 수행하는 프로세싱 유닛을 하나 이상 포함하는 컴퓨팅 모듈을 포함하되, 상기 프로세싱 유닛은 하나 이상의 인스트럭션을 구비하는 태스크들을 저장하도록 구성되는 인스트럭션 메모리, 상기 태스크들과 연관된 데이터를 저장하도록 구성되는 데이터 메모리, 데이터 플로우 엔진, 컨트롤 플로우 엔진 및 수행 유닛을 포함하는 뉴럴 네트워크 프로세싱 시스템을 이용하여 뉴럴 네트워크의 프로세싱을 수행하는 방법에 있어서, (1) 상기 중앙처리유닛이 상기 컴퓨팅 모듈에 명령을 전송하는 단계 및 (2) 상기 컴퓨팅 모듈이 상기 중앙처리유닛의 명령에 따라 프로세싱을 수행하는 단계를 포함하고, 상기 (2) 단계는, (2-1) 상기 데이터 플로우 엔진이, 상기 태스크들을 대상으로 데이터 준비 여부를 체크하고 데이터 준비가 완료된 순서대로 상기 태스크들의 준비 완료 여부를 상기 컨트롤 플로우 엔진에 통지하는 단계, (2-2) 상기 컨트롤 플로우 엔진이, 상기 데이터 플로우 엔진으로부터 통지 받은 순서대로 태스크를 실행하도록 제어하는 단계, 그리고, (2-3) 상기 수행 유닛이, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하는 단계를 포함하는 뉴럴 네트워크 프로세싱 방법을 제공한다.In addition, in order to solve the above technical problems, another embodiment of the present invention includes a computing module including a central processing unit and at least one processing unit that performs processing according to an instruction of the central processing unit, wherein the processing The unit uses an instruction memory configured to store tasks having one or more instructions, a data memory configured to store data associated with the tasks, a data flow engine, a control flow engine and a neural network processing system comprising an execution unit. A method of performing processing of a neural network, comprising: (1) transmitting a command from the central processing unit to the computing module, and (2) performing the processing by the computing module according to the command of the central processing unit. Including, the step (2), (2-1) the data flow engine, the control flow engine, checks whether the data is prepared for the tasks, and whether the preparation of the tasks is completed in the order in which data preparation is completed Notifying to, (2-2) controlling the control flow engine to execute tasks in the order notified from the data flow engine, and (2-3) the execution unit, wherein the control flow engine It provides a method for processing a neural network comprising performing an operation according to one or more instructions of a task that is controlled to be executed.

본 실시예에 있어서, 상기 프로세싱 유닛은 데이터 페치 유닛을 더 포함하고, 상기 뉴럴 네트워크 프로세싱 방법은, 상기 (2-2) 단계와 상기 (2-3) 단계 사이에, 상기 데이터 페치 유닛이, 상기 컨트롤 플로우 엔진이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 상기 데이터 메모리로부터 상기 수행 유닛에 페치하는 단계를 더 포함할 수 있다. In this embodiment, the processing unit further includes a data fetch unit, and in the neural network processing method, between the step (2-2) and the step (2-3), the data fetch unit, The control flow engine may further include fetching operation target data according to one or more instructions of a task controlled to be executed from the data memory to the execution unit.

본 실시예에 있어서, 상기 (2) 단계는 상기 (2-3) 단계 이후에, 상기 컨트롤 플로우 엔진이 상기 수행 유닛의 연산에 따른 결과 데이터를 상기 데이터 메모리에 저장하는 단계를 더 포함할 수 있다. In this embodiment, the step (2) may further include, after the step (2-3), the control flow engine storing result data according to the operation of the execution unit in the data memory. .

본 실시예에 있어서, 상기 결과 데이터가 저장된 데이터 메모리를 포함하는 프로세싱 유닛과 다른 프로세싱 유닛이 상기 결과 데이터를 로드하는 단계를 더 포함할 수 있다. In the present embodiment, the step of loading the result data by a processing unit different from a processing unit including a data memory in which the result data is stored may be further included.

본 실시예에 있어서, 상기 (1) 단계 이전에 상기 중앙처리유닛이 상기 컴퓨팅 모듈의 상태를 초기화하는 단계와, 상기 (2) 단계 이후에 상기 중앙처리유닛이 상기 컴퓨팅 모듈의 연산에 따른 결과 데이터를 상기 중앙처리유닛의 저장소로 저장하는 단계를 더 포함할 수 있다.In this embodiment, prior to step (1), the central processing unit initializes the state of the computing module, and after step (2), the central processing unit performs result data according to the computation of the computing module. It may further comprise the step of storing the storage in the central processing unit.

본 실시예에 있어서, 상기 (2-1) 단계는, 상기 데이터 플로우 엔진이 상기 인스트럭션 메모리로부터 상기 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받는 과정, 상기 데이터 플로우 엔진이 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크하는 과정, 그리고, 상기 데이터 플로우 엔진이 데이터가 준비된 태스크들의 인덱스를 데이터 준비가 완료된 순서대로 상기 컨트롤 플로우 엔진으로 전송하는 과정을 포함할 수 있다. In this embodiment, the step (2-1) is a process in which the data flow engine receives the index of each of the tasks and information of the data associated with the tasks from the instruction memory, the data flow engine A process of checking whether data required by each of the tasks is prepared, and a process of transmitting, by the data flow engine, indexes of tasks in which data is prepared, to the control flow engine in an order in which data preparation is completed.

본 실시예에 있어서, 상기 컨트롤 플로우 엔진은 페치 준비 큐, 페치 블록, 러닝 준비 큐, 러닝 블록 및 프로그램 카운터를 더 포함하고, 상기 (2-2) 단계는, 상기 페치 준비 큐가 상기 데이터 플로우 엔진으로부터 데이터가 준비된 태스크들의 인덱스를 전송 받는 과정과, 상기 페치 준비 큐가 상기 페치 블록으로 상기 태스크들의 인덱스를 순차적으로 전송하면, 상기 페치 블록은 전송 받은 인덱스에 해당하는 태스크의 실행을 위해 필요한 데이터가 상기 데이터 메모리에 존재하는지 확인하고 없으면 데이터를 외부로부터 상기 데이터 메모리로 로드하는 과정과, 상기 러닝 준비 큐가 상기 페치 블록에서 데이터 로드가 완료된 태스크의 인덱스를 전송 받는 과정, 그리고, 상기 러닝 준비 큐가 상기 러닝 블록으로 태스크들의 인덱스를 순차적으로 전송하면, 상기 프로그램 카운터가 상기 러닝 블록에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행하는 과정을 포함할 수 있다. In the present embodiment, the control flow engine further includes a fetch ready queue, a fetch block, a running ready queue, a running block, and a program counter, and in step (2-2), the fetch ready queue is the data flow engine When the process of receiving the indexes of tasks for which data has been prepared from and the fetch preparation queue sequentially transmits the indexes of the tasks to the fetch block, the fetch block stores data necessary for execution of the task corresponding to the received index. A process of loading data from the outside into the data memory if there is no presence in the data memory, a process in which the running preparation queue receives an index of a task for which data has been loaded from the fetch block, and the running preparation queue When the indexes of tasks are sequentially transmitted to the running block, the program counter may include a process of executing one or more instructions of a task corresponding to the index transmitted to the running block.

본 발명의 실시예들에 따르면, 인공 신경망에서의 프로세싱 실행 단위인 태스크들 각각은 다른 태스크들의 데이터 준비 완료 여부와 상관 없이, 각자 실행에 필요한 데이터가 준비되면 다음 프로세싱 절차를 수행할 수 있다. 이에 따라, 태스크들의 프로세싱을 담당하는 프로세싱 유닛의 사용 효율을 높일 수 있다.According to embodiments of the present invention, each of the tasks, which is a processing execution unit in the artificial neural network, may perform the next processing procedure when data required for execution is prepared, regardless of whether data preparation of other tasks is completed. Accordingly, it is possible to increase the efficiency of use of a processing unit in charge of processing tasks.

또한, 본 발명의 실시예들에 따르면, 데이터 준비가 완료된 태스크들의 인스트럭션들은 순차적으로 수행되도록 함으로써, 데이터 플로우 구조 방식에 따라 태스크들의 인스트럭션들을 실행할 때 발생하는 오버헤드를 방지할 수 있다.In addition, according to embodiments of the present invention, the instructions of tasks for which data preparation has been completed are sequentially executed, thereby preventing overhead incurred when executing the instructions of the tasks according to the data flow structure method.

또한, 본 발명의 실시예들에 따르면, 데이터 플로우 방식과 컨트롤 플로우 방식을 혼합 적용한 뉴럴 네트워크 프로세싱 유닛 및 시스템을 이용하여, 비정규화된 연산이 흔히 발생하는 인공 신경망에서의 프로세싱 속도를 향상시킬 수 있고, 이에 따라, 딥러닝의 학습, 훈련 및 추론 과정에 소요되는 시간을 줄일 수 있다.In addition, according to embodiments of the present invention, by using a neural network processing unit and a system in which a data flow method and a control flow method are mixed, it is possible to improve the processing speed in an artificial neural network in which denormalized operations are commonly occurring. , Accordingly, it is possible to reduce the time required for the deep learning learning, training, and inference process.

본 발명의 실시예들에 따른 효과는 상기한 효과로 한정되는 것은 아니며, 상세한 설명 또는 특허청구범위에 기재된 구성으로부터 추론 가능한 모든 효과를 포함하는 것으로 이해되어야 한다.The effects according to the embodiments of the present invention are not limited to the above effects, and should be understood to include all effects that can be inferred from the configurations described in the detailed description or claims.

도 1은 본 발명의 일 실시예에 따른 뉴럴 네트워크 프로세싱 시스템의 구성을 개략적으로 도시한 도면이다.
도 2는 도 1에 도시된 프로세싱 유닛의 세부 구성을 더욱 구체적으로 설명하기 위해 도시한 도면이다.
도 3은 도 1에 도시된 뉴럴 네트워크 프로세싱 시스템을 이용한 뉴럴 네트워크 프로세싱 방법의 절차를 도시한 흐름도이다.
도 4는 도 2에 도시된 프로세싱 유닛을 이용한 뉴럴 네트워크 프로세싱 방법의 절차를 도시한 흐름도이다.
도 5는 도 2에 도시된 인스트럭션 메모리와 데이터 메모리를 더욱 구체적으로 설명하기 위해 도시한 도면이다.
도 6은 도 4에 도시된 뉴럴 네트워크 프로세싱 방법의 세부 절차들의 구현예를 셜명하기 위해 도시한 도면이다.
도 7은 도 2에 도시된 데이터 플로우 엔진과 컨트롤 플로우 엔진의 세부 구성들을 더욱 상세하게 설명하기 위해 도시한 도면이다.
도 8은 도 2에 도시된 컨트롤 플로우 엔진과 인스트럭션 메모리간의 상호작용 및 데이터 페치 유닛과 데이터 메모리간의 상호작용을 설명하기 위해 도시한 도면이다.1 is a diagram schematically showing a configuration of a neural network processing system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a detailed configuration of the processing unit shown in FIG. 1 in more detail.
3 is a flowchart illustrating a procedure of a neural network processing method using the neural network processing system shown in FIG. 1.
4 is a flowchart illustrating a procedure of a neural network processing method using the processing unit shown in FIG. 2.
5 is a diagram illustrating the instruction memory and data memory shown in FIG. 2 in more detail.
FIG. 6 is a diagram illustrating an implementation example of detailed procedures of the neural network processing method shown in FIG. 4.
7 is a diagram illustrating detailed configurations of the data flow engine and the control flow engine shown in FIG. 2 in more detail.
8 is a diagram illustrating an interaction between a control flow engine and an instruction memory shown in FIG. 2 and an interaction between a data fetch unit and a data memory.

이하에서는 첨부한 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 한다. 그러나 본 발명의 실시예들은 여러 가지 다양하고 상이한 형태로 구현될 수 있다. 따라서 본 명세서에 개시된 실시예들이 여기에서 설명하는 실시예들로 한정되는 것은 아니다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예들을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않는다. 본 발명의 실시예들이 가진 기술적 사상은 여기에 기재된 실시예들의 기술적 사상 및 기술 범위에 포함되는 모든 변경물, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 그리고, 도면에서 본 발명의 실시예들을 명확하게 설명하기 위한 설명과 관계없는 부분은 생략하였으며, 도면에 나타난 각 구성요소의 크기, 형태, 형상은 다양하게 변형될 수 있고, 동일/유사한 부분에 대해서는 동일/유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the embodiments of the present invention may be implemented in a variety of different forms. Therefore, the embodiments disclosed herein are not limited to the embodiments described herein. In addition, the accompanying drawings are only for making it easier to understand the embodiments disclosed in the present specification, and the technical idea disclosed in the present specification is not limited by the accompanying drawings. It is to be understood that the technical idea of the embodiments of the present invention includes all modifications, equivalents, and substitutes included in the technical idea and scope of the embodiments described herein. In the drawings, portions irrelevant to the description for clearly explaining the embodiments of the present invention are omitted, and the size, shape, and shape of each component shown in the drawings may be variously modified, and the same/similar portions The same/similar reference numerals are attached.

이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈", "유닛", “프로세서”, “엔진” 및 “어레이” 등은 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시예들을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시예들의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략하였다.The suffixes "module", "unit", "processor", "engine", and "array" for the components used in the following description are given or used interchangeably in consideration of only the ease of writing the specification. It does not have a distinct meaning or role. In addition, in describing the embodiments disclosed in the present specification, when it is determined that a detailed description of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof has been omitted.

본 명세서 전체에서, 어떤 부분이 다른 부분과 "연결(접속, 접촉 또는 결합)"되어 있다고 할 때, 이는 "직접적으로 연결(접속, 접촉 또는 결합)"되어 있는 경우뿐만 아니라, 그 중간에 다른 무엇인가를 사이에 두고 "간접적으로 연결(접속, 접촉 또는 결합)"되어 있는 경우 또는 “전기적으로 연결(접속, 접촉 또는 결합)”된 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함(구비 또는 마련)"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 “포함(구비 또는 마련)”할 수 있다는 것을 의미한다.Throughout this specification, when a part is said to be “connected (connected, contacted, or bonded)” with another part, it is not only “directly connected (connected, contacted or bonded)”, but also something else in between. It includes the case of being “indirectly connected (connected, contacted, or coupled)” with an authorization in between or “electrically connected (connected, contacted or coupled)”. In addition, when a part is said to "include (equipment or prepare)" a certain component, it does not exclude other components, but may further "include (equip or prepare)" other components unless otherwise stated. Means you can.

본 명세서에서 사용된 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명의 실시예들을 한정하려는 의도가 아니다. 여기에 기재된 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함하며, 분산되어 실시되는 구성요소들은 특별한 제한이 있지 않는 한 결합된 형태로 실시될 수도 있다. The terms used in this specification are only used to describe specific embodiments, and are not intended to limit the embodiments of the present invention. The singular expressions described herein include a plurality of expressions unless the context clearly indicates otherwise, and components implemented in a distributed manner may be implemented in a combined form unless there is a specific limitation.

또한, 본 명세서에서 사용되는 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 실시예들의 기술 범위를 벗어나지 않으면서 제1구성요소는 제2구성요소로 명명될 수 있고, 유사하게 제2구성요소도 제1구성 요소로 명명될 수 있다. In addition, terms including ordinal numbers such as first and second used herein may be used to describe various elements, but the elements should not be limited by the terms. These terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the technical scope of embodiments of the present invention, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component.

또한, 본 명세서에서 특별히 다르게 정의하지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본원에 개시된 특정 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 또는 협소하거나 형식적인 의미로 해석되지 아니한다.In addition, unless otherwise defined in the specification, all terms, including technical or scientific terms, used herein are the same as those generally understood by those of ordinary skill in the art to which specific embodiments disclosed herein belong. It has meaning. Terms as defined in a commonly used dictionary should be construed as having a meaning consistent with the meaning of the related technology, and unless explicitly defined in this specification, an ideal, excessive, narrow or formal meaning Is not interpreted as.

도 1은 본 발명의 일 실시예에 따른 뉴럴 네트워크 프로세싱 시스템(100)(neural network processing system)의 구성을 개략적으로 도시한 도면이다.1 is a diagram schematically illustrating a configuration of a neural network processing system 100 according to an embodiment of the present invention.

도 1을 참조하면, 본 실시예에 따른 뉴럴 네트워크 프로세싱 시스템(100)은 중앙처리유닛(Central Processing Unit, CPU)(110)과 컴퓨팅 모듈(computing module)(160)을 포함한다.Referring to FIG. 1, a neural network processing system 100 according to the present embodiment includes a central processing unit (CPU) 110 and a computing module 160.

중앙처리유닛(110)은 컴퓨팅 모듈(160)을 포함한 시스템 내의 다른 구성요소에 다양한 명령을 내리는 호스트 역할 및 기능을 수행하도록 구성된다. 중앙처리유닛(110)은 저장소(120)와 연결되어 있을 수 있고, 내부에 별도의 저장소를 구비할 수도 있다. 수행 기능에 따라 중앙처리유닛(110)은 호스트(host)로, 중앙처리유닛(110)과 연결된 저장소(120)는 호스트 메모리(host memory)로 명명될 수 있다.The central processing unit 110 is configured to perform a host role and function that issues various commands to other components in the system including the computing module 160. The central processing unit 110 may be connected to the storage 120 and may have a separate storage therein. Depending on the function performed, the central processing unit 110 may be referred to as a host, and the storage 120 connected to the central processing unit 110 may be referred to as a host memory.

컴퓨팅 모듈(160)은 중앙처리유닛(110)의 명령을 전송 받아 연산과 같은 특정 기능을 수행하도록 구성된다. 또한, 컴퓨팅 모듈(160)은 중앙처리유닛(110)의 명령에 따라 인공 신경망에서의 프로세싱을 수행하도록 구성되는 적어도 하나 이상의 프로세싱 유닛(processing unit)(161)을 포함한다. 예를 들어, 컴퓨팅 모듈(160)은 프로세싱 유닛(161)을 4개 내지 4096개를 구비하고 있을 수 있으나, 반드시 이에 제한되는 것은 아니다. 즉, 컴퓨팅 모듈(160)은 4개 미만 또는 4096개 초과의 프로세싱 유닛(161)을 구비할 수도 있다. The computing module 160 is configured to receive a command from the central processing unit 110 and perform a specific function such as an operation. In addition, the computing module 160 includes at least one processing unit 161 configured to perform processing in an artificial neural network according to an instruction of the central processing unit 110. For example, the computing module 160 may include 4 to 4096 processing units 161, but is not limited thereto. That is, the computing module 160 may have less than 4 or more than 4096 processing units 161.

컴퓨팅 모듈(160)도 역시 저장소(170)와 연결되어 있을 수 있고, 내부에 별도의 저장소를 구비할 수도 있다. 컴퓨팅 모듈(160)이 구비하는 프로세싱 유닛(161)은 아래에서 도 2를 참조하여 더욱 상세히 설명하도록 한다.The computing module 160 may also be connected to the storage 170 and may have a separate storage therein. The processing unit 161 included in the computing module 160 will be described in more detail with reference to FIG. 2 below.

저장소들(120; 170;)은 DRAM일 수 있으나, 반드시 이에 제한되는 것은 아니며, 데이터를 저장하기 위한 저장소의 형태라면 어떠한 형태로도 구현될 수 있다. The storages 120 (170;) may be DRAM, but are not limited thereto, and may be implemented in any form as long as it is a storage type for storing data.

다시, 도 1을 참조하면, 뉴럴 네트워크 프로세싱 시스템(100)은 호스트 인터페이스(Host I/F)(130), 커맨드 프로세서(command processor)(140), 및 메모리 컨트롤러(memory controller)(150)를 더 포함할 수 있다. Referring again to FIG. 1, the neural network processing system 100 further includes a host interface (Host I/F) 130, a command processor 140, and a memory controller 150. Can include.

호스트 인터페이스(130)는 중앙처리유닛(110)과 컴퓨팅 모듈(160)을 연결하도록 구성되며, 중앙처리유닛(110)과 컴퓨팅 모듈(160)간의 통신이 수행되도록 한다. The host interface 130 is configured to connect the central processing unit 110 and the computing module 160, and allows communication between the central processing unit 110 and the computing module 160 to be performed.

커맨드 프로세서(140)는 호스트 인터페이스(130)를 통해 중앙처리유닛(110)으로부터 명령을 수신하여 컴퓨팅 모듈(160)에 전달하도록 구성된다.The command processor 140 is configured to receive a command from the central processing unit 110 through the host interface 130 and transmit the command to the computing module 160.

메모리 컨트롤러(150)는 중앙처리유닛(110)과 컴퓨팅 모듈(160) 각각 또는 서로간의 데이터 전송 및 데이터 저장을 제어하도록 구성된다. 예컨대, 메모리 컨트롤러(150)는 프로세싱 유닛(161)의 연산 결과 등을 컴퓨팅 모듈(160)의 저장소(170)에 저장하도록 제어할 수 있다. The memory controller 150 is configured to control data transmission and data storage between the central processing unit 110 and the computing module 160, respectively. For example, the memory controller 150 may control to store an operation result of the processing unit 161 in the storage 170 of the computing module 160.

구체적으로, 호스트 인터페이스(130)는 컨트롤 상태(control status) 레지스터를 구비할 수 있다. 호스트 인터페이스(130)는 컨트롤 상태(control status) 레지스터를 이용하여, 중앙처리유닛(110)에게 컴퓨팅 모듈(160)의 상태 정보를 제공하고, 커맨드 프로세서(140)로 명령을 전달할 수 있는 인터페이스를 제공한다. 예컨대, 호스트 인터페이스(130)는 중앙처리유닛(110)으로 데이터를 전송하기 위한 PCIe 패킷을 생성하여 목적지로 전달하거나, 또는 중앙처리유닛(110)으로부터 전달받은 패킷을 지정된 곳으로 전달할 수 있다. Specifically, the host interface 130 may include a control status register. The host interface 130 provides the central processing unit 110 with status information of the computing module 160 using a control status register, and an interface capable of transmitting commands to the command processor 140 do. For example, the host interface 130 may generate a PCIe packet for transmitting data to the central processing unit 110 and transmit it to a destination, or may transmit a packet received from the central processing unit 110 to a designated place.

호스트 인터페이스(130)는 패킷을 중앙처리유닛(110)의 개입 없이 대량으로 전송하기 위해 DMA (Direct Memory Access) 엔진을 구비하고 있을 수 있다. 또한, 호스트 인터페이스(130)는 커맨드 프로세서(140)의 요청에 의해 저장소(120)에서 대량의 데이터를 읽어 오거나 저장소(120)로 데이터를 전송할 수 있다. The host interface 130 may be equipped with a direct memory access (DMA) engine to transmit packets in large quantities without intervention of the central processing unit 110. In addition, the host interface 130 may read a large amount of data from the storage 120 or transmit data to the storage 120 at the request of the command processor 140.

또한, 호스트 인터페이스(130)는 PCIe 인터페이스를 통해 접근 가능한 컨트롤 상태 레지스터를 구비할 수 있다. 본 실시예에 따른 시스템의 부팅 과정에서 호스트 인터페이스(130)는 시스템의 물리적인 주소를 할당 받게 된다(PCIe enumeration). 호스트 인터페이스(130)는 할당 받은 물리적인 주소 중 일부를 통해 컨트롤 상태 레지스터에 로드, 저장 등의 기능을 수행하는 방식으로 레지스터의 공간을 읽거나 쓸 수 있다. 호스트 인터페이스(130)의 레지스터들에는 호스트 인터페이스(130), 커맨드 프로세서(140), 메모리 컨트롤러(150), 그리고, 컴퓨팅 모듈(160)의 상태 정보가 저장되어 있을 수 있다. In addition, the host interface 130 may include a control status register accessible through a PCIe interface. During the booting process of the system according to the present embodiment, the host interface 130 is assigned a physical address of the system (PCIe enumeration). The host interface 130 may read or write the space of the register in a manner that performs functions such as loading and storing in the control status register through some of the allocated physical addresses. Registers of the host interface 130 may store state information of the host interface 130, the command processor 140, the memory controller 150, and the computing module 160.

도 1에서는 메모리 컨트롤러(150)가 중앙처리유닛(110)과 컴퓨팅 모듈(160)의 사이에 위치하지만, 반드시 그러한 것은 아니다. 예컨대, 중앙처리유닛(110)과 컴퓨팅 모듈(160)은 각기 다른 메모리 컨트롤러를 보유하거나, 각각 별도의 메모리 컨트롤러와 연결되어 있을 수 있다.In FIG. 1, the memory controller 150 is located between the central processing unit 110 and the computing module 160, but this is not necessarily the case. For example, the central processing unit 110 and the computing module 160 may have different memory controllers or may be connected to separate memory controllers.

상술한 뉴럴 네트워크 프로세싱 시스템(100)에서, 이미지 판별과 같은 특정 작업은 소프트웨어로 기술되어 저장소(120)에 저장되어 있을 수 있고, 중앙처리유닛(110)에 의해 실행될 수 있다. 중앙처리유닛(110)은 프로그램 실행 과정에서 별도의 저장 장치(HDD, SSD 등)에서 뉴럴 네트워크의 가중치를 저장소(120)로 로드하고, 이것을 재차 컴퓨팅 모듈(160)의 저장소(170)로 로드할 수 있다. 이와 유사하게 중앙처리유닛(110)은 이미지 데이터를 별도의 저장 장치에서 읽어와 저장소(120)에 로드하고, 일부 변환 과정을 수행한 뒤 컴퓨팅 모듈(160)의 저장소(170)에 저장할 수 있다.In the above-described neural network processing system 100, a specific operation such as image determination may be described in software and stored in the storage 120, and may be executed by the central processing unit 110. The central processing unit 110 loads the weight of the neural network into the storage 120 from a separate storage device (HDD, SSD, etc.) during the program execution process, and loads this into the storage 170 of the computing module 160 again. I can. Similarly, the central processing unit 110 may read image data from a separate storage device, load it into the storage 120, perform a partial conversion process, and then store it in the storage 170 of the computing module 160.

이후, 중앙처리유닛(110)은 컴퓨팅 모듈(160)의 저장소(170)에서 가중치와 이미지 데이터를 읽어 딥러닝의 추론 과정을 수행하도록 컴퓨팅 모듈(160)에 명령을 내릴 수 있다. 컴퓨팅 모듈(160)의 각 프로세싱 유닛(161)은 중앙처리유닛(110)의 명령에 따라 프로세싱을 수행할 수 있다. 추론 과정이 완료된 뒤 결과는 저장소(170)에 저장될 수 있다. 중앙처리유닛(110)은 해당 결과를 저장소(170)에서 저장소(120)로 전송하도록 커맨드 프로세서(140)에 명령을 내릴 수 있으며, 최종적으로 사용자가 사용하는 소프트웨어로 그 결과를 전송할 수 있다. Thereafter, the central processing unit 110 may instruct the computing module 160 to read weights and image data from the storage 170 of the computing module 160 and perform a deep learning inference process. Each processing unit 161 of the computing module 160 may perform processing according to an instruction of the central processing unit 110. After the inference process is completed, the result may be stored in the storage 170. The central processing unit 110 may issue a command to the command processor 140 to transmit the corresponding result from the storage 170 to the storage 120, and finally transmit the result to software used by the user.

도 2는 도 1에 도시된 프로세싱 유닛(161)의 세부 구성을 더욱 구체적으로 설명하기 위한 도면이다. FIG. 2 is a diagram illustrating a detailed configuration of the processing unit 161 shown in FIG. 1 in more detail.

도 2를 참조하면, 본 실시예에 따른 프로세싱 유닛(200)은 인스트럭션 메모리(instruction memory)(210), 데이터 메모리(data memory)(220), 데이터 플로우 엔진(data flow engine)(240), 컨트롤 플로우 엔진(control flow engine)(250) 및 수행 유닛(functional unit)(280)을 포함한다. 또한, 프로세싱 유닛(200)은 라우터(router)(230), 레지스터 파일(260), 그리고, 데이터 페치 유닛(data fetch unit)(270)을 더 포함할 수 있다. Referring to FIG. 2, the processing unit 200 according to the present embodiment includes an instruction memory 210, a data memory 220, a data flow engine 240, and a control. It includes a flow engine (control flow engine) 250 and a functional unit (functional unit) 280. In addition, the processing unit 200 may further include a router 230, a register file 260, and a data fetch unit 270.

인스트럭션 메모리(210)는 하나 이상의 태스크(task)를 저장하도록 구성된다. 태스크는 하나 이상의 인스트럭션(instruction)으로 구성될 수 있다. 인스트럭션은 명령어 형태의 코드(code)일 수 있으나, 반드시 이에 제한되는 것은 아니다. 인스트럭션은 컴퓨팅 모듈(160)과 연결된 저장소(170), 컴퓨팅 모듈(160) 내부에 마련된 저장소 및 중앙처리유닛(110)과 연결된 저장소(120)에 저장될 수도 있다. The instruction memory 210 is configured to store one or more tasks. A task may consist of one or more instructions. The instruction may be a code in the form of an instruction, but is not limited thereto. Instructions may be stored in a storage 170 connected to the computing module 160, a storage provided inside the computing module 160, and a storage 120 connected to the central processing unit 110.

본 명세서에서 설명되는 태스크는 프로세싱 유닛(200)에서 실행되는 프로그램의 실행 단위를 의미하고, 인스트럭션은 컴퓨터 명령어 형태로 형성되어 태스크를 구성하는 요소이다. 인공 신경망에서의 하나의 노드는f(Σ w_i x x_i) 등과 같은 복잡한 연산을 수행하게 되는데, 이러한 연산이 여러개의 태스크들에 의해 나눠져서 수행될 수 있다. 에컨대, 하나의 태스크로 인공 신경망에서의 하나의 노드가 수행하는 연산을 모두 수행할 수 있고, 또는 인공 신경망에서의 다수의 노드가 수행하는 연산을 하나의 태스크를 통해 연산하도록 할 수도 있다. 또한 위와 같은 연산을 수행하기 위한 명령어는 인스트럭션으로 구성될 수 있다.A task described in this specification refers to an execution unit of a program executed in the processing unit 200, and an instruction is formed in the form of a computer instruction to constitute a task. One node in the artificial neural network performs complex operations such as f(Σ w _i xx _i ), and such operations can be divided and performed by several tasks. For example, one task may perform all operations performed by one node in the artificial neural network, or operations performed by multiple nodes in the artificial neural network may be performed through one task. Also, an instruction for performing the above operation can be composed of an instruction.

이해의 편의를 위해 태스크가 복수개의 인스트럭션들로 구성되고 각 인스트럭션은 컴퓨터 명령어 형태의 코드로 구성되는 경우를 예로 든다. 이 예시에서 하기한 컨트롤 플로우 엔진(250)의 프로그램 카운터(255)는 태스크가 보유한 복수개의 인스트럭션들을 순차적으로 실행하여 각 인스트럭션의 코드를 분석한다. 본 명세서에서는 이러한 과정을 통틀어서 “태스크를 실행한다” 또는 “태스크의 인스트럭션을 실행한다”라고 표현한다. 또한, 프로그램 카운터(255)에 의해 분석된 코드에 따른 수학적 연산은 하기한 수행 유닛(280)이 수행할 수 있으며, 수행 유닛(280)이 수행하는 작업을 본 명세서에서는 “연산”이라고 표현한다.For ease of understanding, a case is taken where a task is composed of a plurality of instructions, and each instruction is composed of code in the form of computer instructions. In this example, the program counter 255 of the control flow engine 250 described below sequentially executes a plurality of instructions held by the task and analyzes the code of each instruction. In this specification, these processes are collectively expressed as "execute a task" or "execute an instruction of a task". In addition, the mathematical operation according to the code analyzed by the program counter 255 can be performed by the following execution unit 280, and the operation performed by the execution unit 280 is referred to herein as “operation”.

데이터 메모리(220)는 태스크들과 연관된 데이터(data)를 저장하도록 구성된다. 여기서, 태스크들과 연관된 데이터는 태스크의 실행 또는 태스크의 실행에 따른 연산에 사용되는 입력 데이터, 출력 데이터, 가중치 또는 활성화값(activations)일 수 있으나, 반드시 이에 제한되는 것은 아니다.The data memory 220 is configured to store data associated with tasks. Here, the data associated with the tasks may be input data, output data, weights, or activations used for execution of a task or an operation according to execution of the task, but is not limited thereto.

라우터(230)는 뉴럴 네트워크 프로세싱 시스템(100)을 구성하는 구성요소들 간의 통신을 수행하도록 구성되며, 뉴럴 네트워크 프로세싱 시스템(100)을 구성하는 구성요소들 간의 중계 역할을 수행한다. 예컨대, 라우터(230)는 프로세싱 유닛들 간의 통신 또는 커맨드 프로세서(140)와 메모리 컨트롤러(150) 사이의 통신을 중계할 수 있다. 이러한 라우터(230)는 네트워크 온 칩(Network on Chip, NOC) 형태로 프로세싱 유닛(200) 내에 마련될 수 있다.The router 230 is configured to perform communication between the components constituting the neural network processing system 100 and serves as a relay between the components constituting the neural network processing system 100. For example, the router 230 may relay communication between processing units or between the command processor 140 and the memory controller 150. The router 230 may be provided in the processing unit 200 in the form of a network on chip (NOC).

데이터 플로우 엔진(240)은 태스크들을 대상으로 데이터 준비 여부를 체크하고, 데이터 준비가 완료된 태스크의 순서대로 상기 태스크들의 준비 완료 여부를 컨트롤 플로우 엔진(250)에 통지하도록 구성된다.The data flow engine 240 is configured to check whether data is ready for tasks, and to notify the control flow engine 250 of whether the tasks are ready in the order of tasks for which data is ready.

컨트롤 플로우 엔진(250)은 데이터 플로우 엔진(240)으로부터 통지 받은 순서대로 태스크의 실행을 제어하도록 구성된다. 또한, 컨트롤 플로우 엔진(250)은 태스크들의 인스트럭션을 실행함에 따라 발생하는 더하기, 빼기, 곱하기, 나누기와 같은 계산을 수행할 수도 있다. The control flow engine 250 is configured to control execution of tasks in an order notified from the data flow engine 240. In addition, the control flow engine 250 may perform calculations such as addition, subtraction, multiplication, and division that occur as an instruction of tasks is executed.

데이터 플로우 엔진(240)과 컨트롤 플로우 엔진(250)에 대해서는 아래에서 더욱 상세하게 설명하도록 한다.The data flow engine 240 and the control flow engine 250 will be described in more detail below.

레지스터 파일(260)은 프로세싱 유닛(200)에 의해 빈번하게 사용되는 저장 공간으로서, 프로세싱 유닛(200)에 의해 코드들이 실행되는 과정에서 사용되는 레지스터를 하나 이상 포함한다.The register file 260 is a storage space frequently used by the processing unit 200 and includes one or more registers used in the process of executing codes by the processing unit 200.

데이터 페치 유닛(270)은 컨트롤 플로우 엔진(240)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 데이터 메모리(220)로부터 수행 유닛(280)에 페치하도록 구성된다. 또한, 데이터 페치 유닛(270)은 수행 유닛(280)이 구비한 복수개의 연산 모듈(281) 각각에 동일하거나 각기 다른 연산 대상 데이터를 페치할 수 있다.The data fetch unit 270 is configured to fetch operation target data according to one or more instructions of a task controlled to be executed by the control flow engine 240 from the data memory 220 to the execution unit 280. In addition, the data fetch unit 270 may fetch the same or different operation target data to each of the plurality of operation modules 281 included in the execution unit 280.

수행 유닛(280)은 컨트롤 플로우 엔진(240)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하도록 구성되며, 실제 연산을 수행하는 연산 모듈(computation module)(281)을 하나 이상 포함하도록 구성된다. 연산 모듈(281)들은 각각 더하기, 빼기, 곱셈, 곱셈 누적(multiply and accumulate, MAC)과 같은 수학적 연산을 수행하도록 구성된다. 수행 유닛(280)은 연산 모듈(281)들이 특정 단위 간격 또는 특정 패턴을 이룬 형태로 형성될 수 있다. 이와 같이 연산 모듈(281)들이 어레이 형태로 형성되는 경우 어레이 형태의 연산 모듈(281)들은 병렬적으로 연산을 수행하여 복잡한 행렬 연산과 같은 연산들을 일시에 처리할 수 있다.The execution unit 280 is configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine 240, and to include one or more computation modules 281 that perform an actual operation. Is composed. The arithmetic modules 281 are configured to perform mathematical operations such as addition, subtraction, multiplication, and multiply and accumulate (MAC), respectively. The execution unit 280 may be formed in a form in which the calculation modules 281 form a specific unit interval or a specific pattern. When the arithmetic modules 281 are formed in an array form as described above, the arithmetic modules 281 in the array form may perform operations in parallel to process operations such as complex matrix operations at once.

도 2에서는 수행 유닛(280)은 컨트롤 플로우 엔진(240)과 분리된 형태로 도시되어 있으나, 수행 유닛(280)이 컨트롤 플로우 엔진(240)에 포함된 형태로 프로세싱 유닛(200)이 구현될 수 있다. In FIG. 2, the execution unit 280 is shown in a separate form from the control flow engine 240, but the processing unit 200 may be implemented in a form in which the execution unit 280 is included in the control flow engine 240. have.

수행 유닛(280)의 연산에 따른 결과 데이터는 컨트롤 플로우 엔진(250)에 의해 데이터 메모리(220)에 저장될 수 있다. 여기서, 데이터 메모리(220)에 저장된 결과 데이터는 해당 데이터 메모리를 포함하는 프로세싱 유닛과 다른 프로세싱 유닛의 프로세싱에 사용될 수 있다. 예컨대, 제1 프로세싱 유닛의 수행 유닛의 연산에 따른 결과 데이터는 제1 프로세싱 유닛의 데이터 메모리에 저장될 수 있고, 제1 프로세싱 유닛의 데이터 메모리에 저장된 결과 데이터는 제2 프로세싱 유닛에 이용될 수 있다.The result data according to the operation of the execution unit 280 may be stored in the data memory 220 by the control flow engine 250. Here, the result data stored in the data memory 220 may be used for processing in a processing unit different from the processing unit including the data memory. For example, result data according to an operation of the execution unit of the first processing unit may be stored in the data memory of the first processing unit, and result data stored in the data memory of the first processing unit may be used in the second processing unit. .

상술한 뉴럴 네트워크 프로세싱 시스템(100) 및 여기에 포함되는 프로세싱 유닛(200)을 이용하여 인공 신경망에서의 데이터 처리 장치 및 방법, 인공 신경망에서의 연산 장치 및 방법을 구현할 수 있다. Using the above-described neural network processing system 100 and the processing unit 200 included therein, an apparatus and method for processing data in an artificial neural network, and an apparatus and method for computing in an artificial neural network may be implemented.

도 3은 본 발명의 일 실시예에 따른 뉴럴 네트워크 프로세싱 시스템(100) 및 여기에 포함된 프로세싱 유닛(200)에 의해 수행되는 뉴럴 네트워크 프로세싱 방법을 설명하는 흐름도이다. 3 is a flowchart illustrating a neural network processing method performed by the neural network processing system 100 and the processing unit 200 included therein according to an embodiment of the present invention.

도 3을 참조하면, 본 실시예에 따른 뉴럴 네트워크 프로세싱 방법은 중앙처리유닛(110)이 컴퓨팅 모듈(160)의 상태를 초기화하는 단계(s310). 중앙처리유닛(110)이 컴퓨팅 모듈(160)이 특정 기능을 수행하도록 특정 구성요소에 명령을 전송하는 단계(s320), 컴퓨팅 모듈(160)이 중앙처리유닛(110)의 명령에 따라 프로세싱을 수행하는 단계(s330), 그리고, 중앙처리유닛(110)이 컴퓨팅 모듈(160)의 저장소(170)에 저장되어 있는 결과를 저장소(120)로 저장하는 단계(s340)를 포함할 수 있다. s310 내지 s340의 단계에 대한 더욱 상세한 설명은 다음과 같다. Referring to FIG. 3, in the neural network processing method according to the present embodiment, the central processing unit 110 initializes the state of the computing module 160 (S310 ). The central processing unit 110 transmitting a command to a specific component so that the computing module 160 performs a specific function (s320), the computing module 160 performs processing according to the command of the central processing unit 110 And storing the result stored in the storage 170 of the computing module 160 as the storage 120 by the central processing unit 110 (s340 ). A more detailed description of the steps s310 to s340 is as follows.

본 실시예의 뉴럴 네트워크 프로세싱 방법에 따르면, 먼저, 중앙처리유닛(110)은 컴퓨팅 모듈(160)의 상태를 초기화할 수 있다(s310). 이 때, 중앙처리유닛(110)은 호스트 인터페이스(130)를 통해 커맨드 프로세서(140)에 컴퓨팅 모듈(160) 상태를 초기화하도록 명령할 수 있다.According to the neural network processing method of this embodiment, first, the central processing unit 110 may initialize the state of the computing module 160 (s310). In this case, the central processing unit 110 may instruct the command processor 140 to initialize the state of the computing module 160 through the host interface 130.

다음, 중앙처리유닛(110)은 컴퓨팅 모듈(160)이 특정 기능을 수행하도록 특정 구성요소에 명령을 전송할 수 있다(s320). 예컨대, 중앙처리유닛(110)은 저장소(120)에 있는 하나 이상의 인스트럭션(instruction)이 하나 이상의 프로세싱 유닛(200)의 인스트럭션 메모리(210)에 로드(load)되도록 하는 명령을 커맨드 프로세서(140)에 전송할 수 있다. Next, the central processing unit 110 may transmit a command to a specific component so that the computing module 160 performs a specific function (s320). For example, the central processing unit 110 sends an instruction to the command processor 140 to load one or more instructions in the storage 120 into the instruction memory 210 of the one or more processing units 200. Can be transmitted.

여기서, 중앙처리유닛(110)은 저장소(120)에 있는 데이터를 메모리 컨트롤러(150)를 통해 컴퓨팅 모듈(160)의 저장소(170)에 저장하도록 커맨드 프로세서(140)에 명령할 수 있다. 또한, 중앙처리유닛(110)은 하나 이상의 가중치 및 입력값의 어드레스(address) 중 적어도 어느 하나 이상을 프로세싱 유닛(200)의 인스트럭션 메모리(210)에 저장된 태스크에 전달하여 컴퓨팅 모듈(160)이 프로세싱을 시작하도록 명령할 수도 있다. 이 때, 하나 이상의 가중치 및 입력값의 어드레스(address)는 컴퓨팅 모듈(160)의 저장소(170) 또는 중앙처리유닛(110)의 저장소(120)에 저장되어 있을 수 있다.Here, the central processing unit 110 may instruct the command processor 140 to store data in the storage 120 in the storage 170 of the computing module 160 through the memory controller 150. Further, the central processing unit 110 transfers at least one of one or more weights and an address of an input value to a task stored in the instruction memory 210 of the processing unit 200 so that the computing module 160 processes it. You can also order to start. In this case, one or more weights and addresses of input values may be stored in the storage 170 of the computing module 160 or the storage 120 of the central processing unit 110.

다음, 컴퓨팅 모듈(160)은 중앙처리유닛(110)의 명령에 따라 프로세싱을 수행한다(s330). 여기서, 프로세싱 유닛(200)은 주어진 하나 이상의 인스트럭션에 따른 데이터 처리와 연산을 수행하고, 그 결과를 메모리 컨트롤러(150)를 통해 컴퓨팅 모듈(160)의 저장소(170)에 저장할 수 있다.Next, the computing module 160 performs processing according to the command of the central processing unit 110 (S330). Here, the processing unit 200 may perform data processing and operation according to one or more given instructions, and store the result in the storage 170 of the computing module 160 through the memory controller 150.

다음, 중앙처리유닛(110)은 컴퓨팅 모듈(160)의 저장소(170)에 저장되어 있는 결과를 저장소(120)로 저장한다(s340). 저장 과정은 중앙처리유닛(110)이 커맨드 프로세서(140)에 명령하는 것을 통해 이루어질 수 있다.Next, the central processing unit 110 stores the result stored in the storage 170 of the computing module 160 in the storage 120 (s340). The storage process may be performed by the central processing unit 110 instructing the command processor 140.

상술한 s310 내지 s340의 단계들은 반드시 순차적으로 실행될 필요가 없으며, 각 단계들의 순서는 위와 다를 수 있고, 각 단계들은 거의 동시에 진행될 수도 있다.The steps s310 to s340 described above do not necessarily need to be sequentially executed, the order of the steps may be different from the above, and the steps may be performed almost simultaneously.

또한, 중앙처리유닛(110)이 컴퓨팅 모듈(160)의 상태를 초기화하는 단계(s310)와, 중앙처리유닛(110)이 컴퓨팅 모듈(160)의 저장소(170)에 저장되어 있는 결과를 저장소(120)로 저장하는 단계(s340)는 반드시 필요한 과정이 아니므로 본 실시예에 따른 뉴럴 네트워크 프로세싱 방법은 s320단계와 s330단계만을 포함한 형태로 구현될 수도 있다.In addition, the central processing unit 110 initializing the state of the computing module 160 (s310), and the central processing unit 110 stores the results stored in the storage 170 of the computing module 160 in the storage ( Since the step s340 of storing as 120) is not a necessary process, the neural network processing method according to the present embodiment may be implemented in a form including only steps s320 and s330.

도 4는 도 2에 도시된 프로세싱 유닛(200)을 이용한 뉴럴 네트워크 프로세싱 방법을 설명하는 흐름도이다. 4 is a flowchart illustrating a method of processing a neural network using the processing unit 200 shown in FIG. 2.

도 2 및 도 4를 참조하면, 본 실시예에 따른 뉴럴 네트워크 프로세싱 방법은 데이터 플로우 엔진(240)이 태스크들을 대상으로 데이터 준비 여부를 체크하고, 데이터 준비가 완료된 순서대로 상기 태스크들의 준비 완료 여부를 컨트롤 플로우 엔진(250)에 통지하는 단계(s410), 컨트롤 플로우 엔진(250)이 데이터 플로우 엔진(240)으로부터 통지 받은 순서대로 태스크를 실행하도록 제어하는 단계(s420) 및 수행 유닛(280)이 컨트롤 플로우 엔진(250)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하는 단계(s440)를 포함한다. 2 and 4, in the neural network processing method according to the present embodiment, the data flow engine 240 checks whether to prepare data for tasks, and determines whether the tasks are prepared in the order in which data preparation is completed. Notifying the control flow engine 250 (s410), controlling the control flow engine 250 to execute tasks in the order notified from the data flow engine 240 (s420), and controlling the execution unit 280 And a step (s440) of performing an operation according to one or more instructions of a task that the flow engine 250 controls to be executed.

s420 단계와 s440 단계 사이에 데이터 페치 유닛(270)이 컨트롤 플로우 엔진(250)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 데이터 메모리(220)로부터 수행 유닛(280)에 페치하는 단계(s430)와, s440단계 이후에, 컨트롤 플로우 엔진(250)이 수행 유닛(280)의 연산에 따른 결과 데이터를 데이터 메모리(220)에 저장하는 단계(s450)를 더 포함할 수 있다.Between steps s420 and s440, the data fetch unit 270 fetches the operation target data according to one or more instructions of the task controlled to be executed by the control flow engine 250 from the data memory 220 to the execution unit 280. Steps 430 and after step s440, the control flow engine 250 may further include storing result data according to the operation of the execution unit 280 in the data memory 220 (s450).

구체적으로, s410 단계와 같이, 하나의 태스크 자체는 데이터 플로우 방식에 따라 자신의 인스트럭션을 실행하는데 필요한 모든 데이터(예컨대, 피연산자)를 다른 구성요소로부터 전달 받아 실행할 준비가 완료되면 기타 다른 태스크들의 데이터 준비 여부와 상관 없이 자신의 다음 프로세싱 절차를 진행할 수 있다. Specifically, as in step s410, one task itself receives all the data (e.g., operands) necessary to execute its own instruction according to the data flow method, and when it is ready to execute, prepares data for other tasks. Regardless of whether or not, you can proceed with your own next processing procedure.

반면, 상술한 s420 단계와 같이, 각 태스크들의 인스트럭션들의 실행은 폰 노이만 방식(컨트롤 플로우 방식)에 따라 컨트롤 플로우 엔진(250)의 프로그램 카운터(255)에 의해 순차적으로 실행될 수 있다. On the other hand, as in step S420 described above, the execution of the instructions of each task may be sequentially executed by the program counter 255 of the control flow engine 250 according to the von Neumann method (control flow method).

상술한 데이터 플로우 방식이란 프로세싱을 수행하는 요소들이 별도의 순서에 구속되지 않고 자신의 연산에 필요한 데이터들만 준비되면 각자 연산을 실행하는 방식을　의미한다. 또한, 폰 노이만 방식(컨트롤 플로우 방식)이란 연산 대상체들이 기 정해진 연산 순서에 따라 연산을 수행하는 것을 의미한다.The above-described data flow method refers to a method in which each operation is performed when elements that perform processing are not restricted to a separate order and only data necessary for their operation is prepared. In addition, the von Neumann method (control flow method) means that operation objects perform operations according to a predetermined order of operations.

이해의 편의를 위해 0번 인덱스(index)를 가진 태스크, 1번 인덱스를 부여 받은 태스크 및 2번 인덱스를 부여 받은 태스크가 존재하는 경우를 예로 든다. 이 예시에서, 데이터 플로우 방식을 사용하는 데이터 플로우 엔진(240)은 2번 태스크가 자신의 실행에 필요한 데이터가 준비된 경우 0번 테스크 및 1번 태스크의 실행에 필요한 데이터가 준비되었는지 여부에 관계 없이 2번 태스크의 인덱스를 컨트롤 플로우 엔진(250)으로 전송한다. 반면, 같은 예시에서 폰 노이만 방식을 사용하는 컨트롤 플로우 엔진(250)은 데이터 플로우 엔진(250)으로부터 2번 태스크의 인덱스, 1번 태스크의 인덱스 및 0번 태스크의 인덱스를 순차적으로 수신한 경우, 처음으로 2번 태스크를 실행하고 두번째로 1번 태스크를 실행하며 마지막으로 0번 태스크를 실행한다. For the convenience of understanding, the case where there is a task with an index of 0, a task with an index of 1, and a task with an index of 2 are given as an example. In this example, the data flow engine 240 using the data flow method is 2 regardless of whether the data required for the execution of the 0 task and the 1 task is prepared when the data required for the execution of the task 2 is prepared. The index of task 1 is transmitted to the control flow engine 250. On the other hand, in the same example, when the control flow engine 250 using the von Neumann method sequentially receives the index of task 2, the index of task 1, and the index of task 0 from the data flow engine 250, the first Execute No. 2 task, secondly, execute No. 1, and finally, execute No. 0.

이와 관련하여, 별개의 수많은 연산을 수행하는 노드들을 포함하는 뉴럴 네트워크는 연산 효율을 높이기 위해 병렬 처리 구조에 유리한 데이터 플로우 방식을 활용한 연산 처리 구조를 채택하는 것이 자연스럽다. 폰 노이만 방식과 달리 데이터 플로우 방식을 뉴럴 네트워크의 연산에 적용하면, 비정규화된 병렬 연산 처리를 가능하게 하여 연산 효율이 높아진다. 하지만, 뉴럴 네트워크의 연산 처리 구조가 온전히 데이터 플로우 방식으로만 설계된다면, 폰 노이만 방식에 따라 순서적으로 실행되어야 유리한 연산의 경우에 일일이 연산에 필요한 데이터 준비 여부를 체크하는데 상당한 시간이 소요되므로, 불필요한 오버헤드(overhead)를 발생시키는 문제점이 있다. In this regard, it is natural for a neural network including nodes that perform a number of separate operations to adopt an operation processing structure utilizing a data flow method that is advantageous for a parallel processing structure in order to increase operation efficiency. Unlike the Von Neumann method, if the data flow method is applied to the operation of a neural network, it enables denormalized parallel operation processing, thereby increasing the operation efficiency. However, if the arithmetic processing structure of the neural network is designed entirely in the data flow method, it takes a considerable amount of time to check whether the data required for each operation is prepared in case of advantageous operations only when they are executed sequentially according to the von Neumann method. There is a problem of generating overhead.

그러므로, 본 발명의 실시예들에서, 데이터 플로우 엔진(240)은 데이터 플로우 방식을 이용하여 특정 태스크가 필요로 하는 모든 데이터가 준비 상태가 되면 다른 태스크들의 준비 상태 여부나 순서에 관계없이 해당 태스크의 정보를 컨트롤 플로우 엔진(250)으로 전송한다. 결과적으로, 단일 태스크의 실행 여부는 다른 태스크들의 데이터 준비 여부와 관계 없이 독립적으로 판단되므로, 전체적인 프로세싱의 관점에서 보면 태스크들의 프로세싱 진행 속도 및 실행 속도를 높일 수 있다. Therefore, in the embodiments of the present invention, the data flow engine 240 uses a data flow method to determine whether all data required by a specific task is in a ready state, regardless of whether the other tasks are in a ready state or order. The information is transmitted to the control flow engine 250. As a result, since whether or not a single task is executed is independently determined regardless of whether other tasks are preparing data, it is possible to increase the processing speed and execution speed of the tasks from the perspective of overall processing.

또한, 본 발명의 실시예들에서, 컨트롤 플로우 엔진(250)은 컨트롤 플로우 방식을 이용하여 데이터 플로우 엔진(240)으로부터 전달받은 태스크들의 정보의 순서에 따라 태스크 내에서의 인스트럭션들이 실행되도록 한다. 따라서, 본 실시예에 따른 컨트롤 플로우 엔진(250)을 이용하면, 데이터 플로우 방식에 따라 태스크 내에서의 인스트럭션들이 실행될 때 일일이 인스트럭션들의 데이터 준비 여부를 체크함에 따라 발생할 수 있는 오버헤드를 방지할 수 있다.In addition, in the embodiments of the present invention, the control flow engine 250 uses a control flow method to execute instructions within a task according to the order of information of tasks received from the data flow engine 240. Therefore, if the control flow engine 250 according to the present embodiment is used, it is possible to prevent overhead that may occur by individually checking whether the instructions are prepared for data when the instructions in the task are executed according to the data flow method. .

이와 같이, 본 발명의 실시예에 따르면, 데이터 플로우 방식과 컨트롤 플로우 방식을 혼합 적용하여 인공 신경망에서의 연산을 수행함으로써, 비정규화된 연산에 불리한 컨트롤 플로우 방식의 한계를 극복하고, 데이터 플로우 방식에서 발생할 수 있는 오버헤드를 없앨 수 있다. 이에 따라, 인공 신경망에서의 프로세싱 속도를 향상시킬 수 있고, 딥러닝의 학습, 훈련 및 추론 과정에 소모되는 시간을 줄일 수 있다.As described above, according to an embodiment of the present invention, by performing an operation in an artificial neural network by applying a mixture of a data flow method and a control flow method, the limitation of the control flow method unfavorable to the denormalized operation is overcome, and the data flow method You can eliminate any overhead that may occur. Accordingly, it is possible to improve the processing speed in the artificial neural network, and reduce the time spent in learning, training, and inference processes of deep learning.

도 4에 도시된 s410 내지 s450의 단계들은 도 3에 도시된 컴퓨팅 모듈(160)이 중앙처리유닛(110)의 명령에 따라 프로세싱을 수행하는 단계(s330)에 포함되는 세부 과정에 해당할 수 있다. Steps s410 to s450 illustrated in FIG. 4 may correspond to detailed processes included in step s330 in which the computing module 160 illustrated in FIG. 3 performs processing according to a command of the central processing unit 110. .

도 5는 도 2에 도시된 인스트럭션 메모리(210)와 데이터 메모리(220)를 더욱 구체적으로 설명하기 위해 도시한 도면이다.5 is a diagram illustrating the instruction memory 210 and data memory 220 shown in FIG. 2 in more detail.

도 2를 참조하여 설명한 바와 같이, 인스트럭션 메모리(210)는 하나 이상의 인스트럭션이 저장되는 공간으로서, 복수개의 인스트럭션들로 구성되는 태스크들(211)이 저장될 수 있다.As described with reference to FIG. 2, the instruction memory 210 is a space in which one or more instructions are stored, and tasks 211 composed of a plurality of instructions may be stored.

데이터 메모리(220)는 동적 메모리(dynamic memory)(221)와 정적 메모리(static memory)(222)를 포함한다. 동적 메모리(221)는 컨트롤 플로우 엔진(250)에 의해 현재 실행되는 태스크에 동적으로 할당되는 메모리이다. 정적 메모리(222)는 태스크가 실행되고 완료되는 것과 무관하게 프로그램이 데이터 메모리(220)에 로드될 때 할당되어 있는 메모리이다.The data memory 220 includes a dynamic memory 221 and a static memory 222. The dynamic memory 221 is a memory that is dynamically allocated to a task currently being executed by the control flow engine 250. Static memory 222 is memory that is allocated when a program is loaded into data memory 220 regardless of whether a task is executed and completed.

한편, 태스크들(211) 각각은 데이터 메모리(220)의 동적 메모리(221)로부터 할당 받은 소정의 동적 메모리 공간을 사용한다. 또한, 태스크들(211) 각각은 해제되지 않고 사용되어야 하는 데이터를 정적 메모리(222)의 일부 공간으로부터 할당 받아 사용할 수 있다.Meanwhile, each of the tasks 211 uses a predetermined dynamic memory space allocated from the dynamic memory 221 of the data memory 220. In addition, each of the tasks 211 may be used by being allocated data from a partial space of the static memory 222 without being released.

동적으로 할당 받은 메모리 공간을 사용하는 태스크는 그 동적 메모리 공간에 있는 데이터를 다른 태스크에서도 사용할 수 있도록 다른 태스크들에게 전달할 수 있으며, 이 때 참조 계수의 숫자가 증가하게 된다. 참조 계수란 동적으로 할당 받은 메모리를 쓰는 태스크의 데이터를 참조하여 사용하는 태스크의 수를 의미한다. 데이터를 전달 받은 태스크가 데이터의 사용을 마쳤거나(즉, 태스크의 연산이 완료되었거나), 데이터를 복사해 가는 경우 참조 계수의 숫자는 감소한다.A task using a dynamically allocated memory space can transfer the data in the dynamic memory space to other tasks so that it can be used by other tasks, and the number of reference counts increases. The reference count refers to the number of tasks to be used by referring to data of tasks that write memory dynamically allocated. When the task that received the data finishes using the data (ie, the task operation is completed) or copies data, the number of reference counts decreases.

예컨대, 태스크는 컨트롤 플로우 엔진(250)에 의해 필요한 동적 메모리 공간을 데이터 메모리(220)의 동적 메모리(221)로부터 할당 받을 수 있으며, 동적으로 할당된 메모리는 다른 태스크로 값(value) 또는 레퍼런스(Reference) 형태로 전달될 수 있다. 이 때, 메모리를 전달 받은 태스크는 해당 메모리를 사용 후 응답(ack)을 처음 동적 메모리(221)로부터 할당 받은 태스크에 전달하여 참조 계수를 감소시키고, 이에 따라, 데이터 메모리(220)는 참조 계수가 0이 되는 시점에 태스크에 동적으로 할당된 메모리를 해제할 수 있다. For example, the task may be allocated a dynamic memory space required by the control flow engine 250 from the dynamic memory 221 of the data memory 220, and the dynamically allocated memory is a value or reference ( Reference) can be delivered. At this time, the task receiving the memory decreases the reference count by transmitting the response (ack) after using the memory to the task allocated from the dynamic memory 221 first, and accordingly, the data memory 220 When it becomes 0, the memory dynamically allocated to the task can be released.

이와 같이, 동적 메모리(221)의 소정의 공간은 특정 태스크에 할당된 후, 상기 특정 태스크를 포함하는 태스크들이 상기 특정 태스크의 데이터를 사용하지 않는 경우 해제되도록 구성될 수 있다. As such, after a predetermined space of the dynamic memory 221 is allocated to a specific task, it may be configured to be released when tasks including the specific task do not use the data of the specific task.

도 6은 도 4에 도시된 뉴럴 네트워크 프로세싱 방법의 세부 절차들의 구현예를 셜명하기 위해 도시한 도면이다.FIG. 6 is a diagram illustrating an implementation example of detailed procedures of the neural network processing method shown in FIG. 4.

도 4 및 도 6을 참조하면, 도 4에 도시된 s410 단계는 데이터 플로우 엔진(240)이 인스트럭션 메모리(210)로부터 상기 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받는 과정(s601), 데이터 플로우 엔진(240)이 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크하는 과정(s602), 그리고, 데이터 플로우 엔진(240)이 데이터가 준비된 태스크들의 인덱스를 데이터 준비가 완료된 태스크의 순서대로 컨트롤 플로우 엔진(250)으로 전송하는 과정(s603)을 포함할 수 있다.4 and 6, step s410 shown in FIG. 4 is a process in which the data flow engine 240 receives the index of each of the tasks from the instruction memory 210 and information of the data associated with the tasks ( s601), a process in which the data flow engine 240 checks whether the data required by each of the tasks is prepared (s602), and the data flow engine 240 determines the indexes of the tasks for which data is prepared The process of transmitting to the control flow engine 250 in the order of (s603) may be included.

이후 진행되는 s420 단계는 페치 준비 큐(251)가 데이터 플로우 엔진(240)으로부터 데이터가 준비된 태스크들의 인덱스를 전송 받는 과정(s604), 페치 준비 큐(251)가 페치 블록(252)으로 상기 태스크들의 인덱스를 순차적으로 전송하면, 페치 블록(252)은 전송 받은 인덱스에 해당하는 태스크의 실행을 위해 필요한 데이터가 데이터 메모리(220)에 존재하는지 확인하고 없으면 데이터를 외부로부터 데이터 메모리(220)로 로드하는 과정(s605), 러닝 준비 큐(253)가 페치 블록(252)에서 데이터 로드가 완료된 태스크의 인덱스를 전송 받는 과정(s606), 그리고, 러닝 준비 큐(253)가 러닝 블록(254)으로 태스크들의 인덱스를 순차적으로 전송하면(s607), 프로그램 카운터(255)가 러닝 블록(254)에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행하는 과정(s608)을 포함할 수 있다. s607 과정에서 러닝 블록(254)은 러닝 준비 큐(253)로부터 러닝 준비 큐(253)에 마련된 태스크의 인덱스를 순차적으로 전송 받게 되고, 이어서 s608과정에서 프로그램 카운터(255)는 러닝 블록(254)에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션을 실행한다.Subsequently, in step s420, the fetch preparation queue 251 receives the indexes of tasks for which data has been prepared from the data flow engine 240 (s604), and the fetch preparation queue 251 is a fetch block 252 of the tasks. When the index is sequentially transmitted, the fetch block 252 checks whether the data necessary for execution of the task corresponding to the received index exists in the data memory 220, and if not, loads the data into the data memory 220 from the outside. Process (s605), a process in which the running preparation queue 253 receives the index of the task for which data has been loaded from the fetch block 252 (s606), and the running preparation queue 253 When the indexes are sequentially transmitted (s607), the program counter 255 may include a process (s608) of executing one or more instructions of a task corresponding to the index transmitted to the running block 254 (s608). In process s607, the running block 254 sequentially receives the index of the task provided in the running-ready queue 253 from the running-ready queue 253, and then, in the process s608, the program counter 255 is transferred to the running block 254. Execute one or more instructions of the task corresponding to the transmitted index.

도 4에 도시된 뉴럴 네트워크 프로세싱 방법을 참고하면, 상술한 s601 내지 s608의 과정들이 진행된 후 데이터 페치 유닛(270)이 컨트롤 플로우 엔진(250)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산 대상 데이터를 데이터 메모리(220)에서 수행 유닛(280)에 페치하는 단계(s430)와, 수행 유닛(280)이 컨트롤 플로우 엔진(250)이 실행하도록 제어하는 태스크의 하나 이상의 인스트럭션에 따른 연산을 수행하는 단계(s440) 및 컨트롤 플로우 엔진(250)이 수행 유닛(280)의 연산에 따른 결과 데이터를 데이터 메모리(220)에 저장하는 단계(s450)가 진행될 수 있다.Referring to the neural network processing method shown in FIG. 4, after the above-described processes of s601 to s608 are performed, the data fetch unit 270 controls the control flow engine 250 to execute the operation according to one or more instructions of the task The step of fetching data from the data memory 220 to the execution unit 280 (s430), and the execution unit 280 performing an operation according to one or more instructions of a task controlled to be executed by the control flow engine 250 The operation s440 and the control flow engine 250 storing result data according to the operation of the execution unit 280 in the data memory 220 may be performed (s450 ).

추가적으로, 수행 유닛(280)의 연산에 따른 결과 데이터가 저장된 데이터 메모리를 포함하는 프로세싱 유닛과 다른 프로세싱 유닛은 수행 유닛(280)의 연산에 따른 결과 데이터를 로드하여 자신의 프로세싱에 사용할 수 있다.Additionally, a processing unit different from the processing unit including a data memory in which result data according to the operation of the execution unit 280 is stored may load result data according to the operation of the execution unit 280 and use it for their own processing.

도 7은 도 2에 도시된 데이터 플로우 엔진(240)과 컨트롤 플로우 엔진(250)의 세부 구성들을 더욱 상세하게 설명하기 위해 도시한 도면이다.7 is a diagram illustrating detailed configurations of the data flow engine 240 and the control flow engine 250 shown in FIG. 2 in more detail.

도 6 및 도 7을 함께 참조하면, 데이터 플로우 엔진(240)은 태스크들(인덱스 0, 1, 2 ...)에 대한 상태를 관리한다. 예컨대, 데이터 플로우 엔진(240)은 인스트럭션 메모리(210)로부터 태스크들 각각의 인덱스와 상기 태스크들과 연관된 데이터의 정보를 전송 받고(s601), 상기 태스크들 각각이 필요로 하는 데이터가 준비되었는지 체크한다(s602)). 여기서, 데이터 플로우 엔진(240)은 태스크들의 실행에 필요한 데이터의 어드레스를 별도의 저장소로부터 전송 받을 수 있다.6 and 7 together, the data flow engine 240 manages the state of tasks (index 0, 1, 2 ...). For example, the data flow engine 240 receives the index of each task and the information of the data associated with the tasks from the instruction memory 210 (s601), and checks whether data required by each of the tasks is prepared. (s602)). Here, the data flow engine 240 may receive an address of data required for execution of tasks from a separate storage.

다음, 데이터 플로우 엔진(240)은 하나의 태스크 내에서 필요한 모든 데이터가 준비 상태가 되면, 컨트롤 플로우 엔진(250)에 있는 페치 준비 큐(ready to fetch queue)(251)에 준비 상태가 된 해당 태스크의 정보를 전송한다(s603). 여기서, 데이터 플로우 엔진(240)은 페치 준비 큐(251)로 준비 완료 상태 태스크의 인덱스만을 전송할 수 있다. Next, the data flow engine 240, when all necessary data in one task is in the ready state, the corresponding task in the ready state in the ready to fetch queue 251 in the control flow engine 250 Information is transmitted (s603). Here, the data flow engine 240 may transmit only the index of the ready state task to the fetch ready queue 251.

다음, 컨트롤 플로우 엔진(250)은 페치 준비 큐(251)에 있는 태스크들을 선입선출법(First In First Out, FIFO) 방식으로 처리한다. 컨트롤 플로우 엔진(250)은 페치 준비 큐(251)의 순서상 가장 앞에 있는 태스크(예컨대, 도 7에서 가장 아래의 큐에 배치된 태스크)의 인덱스를 페치 블록(252)으로 전송한다. 다음, 컨트롤 플로우 엔진(250)은 페치 블록(252)에 전송된 인덱스에 해당하는 태스크가 필요로 하는 메모리를 데이터 메모리(220)의 동적 메모리(221)로부터 할당 받는다. 다음, 컨트롤 플로우 엔진(250)은 동적 메모리(221)의 소정의 공간이 할당된 태스크의 데이터 로드를 수행한다(s604). 여기서, 데이터 로드는 컨트롤 플로우 엔진(250)이 태스크가 실행되기 위해 필요로 하는 데이터를 데이터 메모리(220)에 있는지 확인하여 가져오거나, 해당 데이터가 데이터 메모리(220)에 없는 경우 외부 장치(다른 프로세싱 유닛 또는 저장소 등)에서 해당 데이터를 데이터 메모리(220)로 로드하는 것을 의미한다.Next, the control flow engine 250 processes the tasks in the fetch preparation queue 251 in a First In First Out (FIFO) method. The control flow engine 250 transmits to the fetch block 252 the index of the task at the front of the fetch preparation queue 251 (eg, a task placed in the lowest queue in FIG. 7 ). Next, the control flow engine 250 receives the memory required by the task corresponding to the index transmitted to the fetch block 252 from the dynamic memory 221 of the data memory 220. Next, the control flow engine 250 loads data of a task to which a predetermined space of the dynamic memory 221 is allocated (S604). Here, the data load is performed by checking whether the data required for the task to be executed by the control flow engine 250 is in the data memory 220 and is imported, or if the data is not in the data memory 220, an external device (other processing Unit or storage, etc.) into the data memory 220.

다음, 컨트롤 플로우 엔진(250)은 페치 블록(252)에서 데이터 로드가 완료된 태스크의 인덱스를 러닝 준비 큐(253)에 전송한다(s605), 그러면, 러닝 준비 큐(253)는 러닝 준비 큐에 있는 태스크들의 인덱스를 순서대로 러닝 블록(254)으로 전달한다(s606), 러닝 블록(254)에 전송된 인덱스에 해당하는 태스크의 하나 이상의 인스트럭션은 프로그램 카운터(255)에 의해 폰 노이만 방식으로 순차적으로 실행된다(s607). Next, the control flow engine 250 transmits the index of the task for which the data has been loaded in the fetch block 252 to the running ready queue 253 (s605). Then, the running ready queue 253 is in the running ready queue. The indexes of the tasks are sequentially transferred to the running block 254 (s606), and one or more instructions of the task corresponding to the index transmitted to the running block 254 are sequentially executed in the von Neumann method by the program counter 255. It becomes (s607).

도 7에 도시된 페치 준비 큐(251) 및 러닝 준비 큐(253)에는 다수의 태스크들이 할당될 수 있으므로, 페치 준비 큐(251) 및 러닝 준비 큐(253)는 태스크들이 할당되는 블록을 각각 1개 내지 128개 또는 그 이상 구비할 수 있다.Since a plurality of tasks may be assigned to the fetch preparation queue 251 and the running preparation queue 253 shown in FIG. 7, the fetch preparation queue 251 and the running preparation queue 253 each have 1 block to which tasks are allocated. It may be provided with one to 128 or more.

이해의 편의를 위해, 명령어 형태의 코드(곱하기, 더하기 등)로 구성된 다수의 인스트럭션들을 포함하고 있고 인덱스가 0번인 태스크가 존재하고, 인스트럭션 메모리(210)의 1000, 1004, 1008 등의 번지 수에 해당하는 공간에 각각 0번 태스크의 인스트럭션들이 저장되어 있는 경우를 예로 든다. For the convenience of understanding, there is a task that includes a number of instructions composed of instruction-type codes (multiplication, addition, etc.) and has an index of 0, and the number of addresses such as 1000, 1004, 1008, etc. of the instruction memory 210 An example is a case where instructions of task 0 are stored in the corresponding space.

위 예에서, 0번 태스크가 필요로 하는 모든 데이터가 준비되면 데이터 플로우 엔진(240)은 0번 태스크의 인덱스를 컨트롤 플로우 엔진(250)으로 전송한다. 이 때, 컨트롤 플로우 엔진(250)은 데이터 플로우 엔진(240)이 이미 알고 있는 정보에 의해 인스트럭션 메모리(210)의 1000, 1004, 1008 번지에 0번 태스크의 인스트럭션이 존재하는 것을 알 수 있다. 이후, 컨트롤 플로우 엔진(250)은 인스트럭션 메모리(210)의 1000번지에서 0번 태스크의 인스트럭션을 가져와서 실행하고, 그 다음 1004 및 1008번지에서 0번 태스크의 인스트럭션을 가져와서 실행한다. 이 때, 프로그램 카운터(255)에는 컨트롤 플로우 엔진(250)이 현재 실행하고 있는 번지수가 할당될 수 있다. In the above example, when all data required by task 0 is prepared, the data flow engine 240 transmits the index of the task 0 to the control flow engine 250. At this time, the control flow engine 250 can know that the instruction of task 0 exists at addresses 1000, 1004, and 1008 of the instruction memory 210 based on information already known to the data flow engine 240. Thereafter, the control flow engine 250 fetches and executes the instruction of task 0 from address 1000 of the instruction memory 210, and then fetches and executes the instruction of task 0 at addresses 1004 and 1008. At this time, the program counter 255 may be assigned an address number currently being executed by the control flow engine 250.

또한, 컨트롤 플로우 엔진(250)에 의해 실행된 인스트럭션들에 따른 연산 대상 데이터는 데이터 메모리(220)에 저장될 수 있다. 이후, 데이터 페치 유닛(270)은 연산 대상 데이터를 가져와 수행 유닛(280)에 마련된 하나 이상의 연산 모듈(281) 각각에 데이터를 페치할 수 있고, 연산 모듈(281)들은 연산 대상 데이터에 따른 수학적 연산을 수행할 수 있다.Further, operation target data according to instructions executed by the control flow engine 250 may be stored in the data memory 220. Thereafter, the data fetch unit 270 may fetch the data to be calculated and fetch the data to each of the one or more calculation modules 281 provided in the execution unit 280, and the calculation modules 281 may perform mathematical calculations according to the calculation target data. Can be done.

이에 더하여, 컨트롤 플로우 엔진(250)은 수행 유닛(280)의 연산에 따른 결과 데이터를 데이터 메모리(220)에 저장할 수 있고, 이러한 결과 데이터는 추후 상기 결과 데이터를 필요로 하는 태스크에 이용될 수 있다. 또한, 데이터 플로우 엔진(240) 또는 다른 프로세싱 유닛의 데이터 플로우 엔진은 데이터 메모리(220)에 저장된 결과 데이터를 필요로 하는 태스크의 데이터 준비 여부를 체크할 수 있다.In addition, the control flow engine 250 may store result data according to the operation of the execution unit 280 in the data memory 220, and the result data may be later used for a task that requires the result data. . In addition, the data flow engine 240 or the data flow engine of another processing unit may check whether the data of a task requiring result data stored in the data memory 220 is prepared.

위 내용을 참조하여, 본 발명의 일 실시예에 따라 데이터 플로우 엔진(240)과 컨트롤 플로우 엔진(250)이 수행하는 일련의 과정들의 구체적인 구현예에 대해 설명하면 다음과 같다.With reference to the above, a specific implementation example of a series of processes performed by the data flow engine 240 and the control flow engine 250 according to an embodiment of the present invention will be described as follows.

먼저, 프로세싱 유닛(200)은 외부(자신이 보유한 저장소, 자신과 연결된 외부의 저장소, 다른 프로세싱 유닛, 또는 커맨드 프로세서)로부터 데이터 정보 패킷을 전달받는다. 데이터 정보 패킷은 각각 목적지가 되는 태스크의 주소(또는 ID)와 피연산자 주소, 그리고 데이터의 소스 어드레스(source address) 중 적어도 어느 하나 이상을 포함할 수 있다. 프로세싱 유닛(200)의 데이터 플로우 엔진(240)은 전달받은 데이터 정보 패킷의 목적지에 해당하는 태스크의 피연산자 상태를 업데이트한다. 이와 같이, 데이터 플로우 엔진(240)은 태스크의 피연산자 상태, 데이터의 준비 상태 여부 및 데이터의 소스 어드레스를 관리할 수 있다.First, the processing unit 200 receives a data information packet from an external (a storage owned by itself, an external storage connected to it, another processing unit, or a command processor). Each data information packet may include at least one of an address (or ID) of a task to be a destination, an operand address, and a source address of data. The data flow engine 240 of the processing unit 200 updates the state of an operand of the task corresponding to the destination of the received data information packet. In this way, the data flow engine 240 may manage a state of an operand of a task, whether data is in a ready state, and a source address of data.

다음, 데이터 플로우 엔진(240)은 데이터 정보 패킷을 전송받아 태스크의 피연산자 상태를 체크한다. 이후, 만약 그 태스크가 필요로 하는 모든 피연산자가 준비된 상태라면, 데이터 플로우 엔진(240)은 해당 태스크의 인덱스를 컨트롤 플로우 엔진(250)의 페치 준비 큐(251)에 전송한다.Next, the data flow engine 240 receives the data information packet and checks the state of the operand of the task. Thereafter, if all operands required by the task are in a ready state, the data flow engine 240 transmits the index of the task to the fetch preparation queue 251 of the control flow engine 250.

다음, 컨트롤 플로우 엔진(250)은 페치 준비 큐(251)의 순서상 가장 앞에 위치한 태스크의 인덱스를 팝(pop)하여 페치 블록(252)으로 전송한다. 또한, 컨트롤 플로우 엔진(250)은 페치 블록(252)에 전송된 인덱스에 해당하는 태스크가 필요로 하는 메모리를 데이터 메모리(220)의 동적 메모리(221)로부터 할당 받는다. 여기서, 컨트롤 플로우 엔진(250)은 페치 블록(252)에 전송된 인덱스에 해당하는　태스크에 대한 데이터를 데이터 플로우 엔진(240)에 저장되어 있는 피연산자의 정보로부터 읽어올 수 있다. 피연산자의 정보에는 데이터의 소스 어드레스가 존재하므로, 컨트롤 플로우 엔진(250)은 데이터의 소스 어드레스로 읽기 요청(read request) 패킷을 전송한다. 읽기 요청 패킷은 소스 어드레스에 따라 다른 프로세싱 유닛 또는 컴퓨팅 모듈의 메모리 컨트롤러 또는 호스트의 메모리 컨트롤러로 전달될 수 있다. 또한, 컨트롤 플로우 엔진(250)은 판독한 소스 어드레스가 자신이 포함된 프로세싱 유닛인 경우 데이터 전송을 생략할 수 있다. Next, the control flow engine 250 pops the index of the task located at the earliest in the order of the fetch preparation queue 251 and transmits it to the fetch block 252. In addition, the control flow engine 250 is allocated a memory required by the task corresponding to the index transmitted to the fetch block 252 from the dynamic memory 221 of the data memory 220. Here, the control flow engine 250 may read data on a task corresponding to an index transmitted to the fetch block 252 from information of an operand stored in the data flow engine 240. Since the data source address exists in the operand information, the control flow engine 250 transmits a read request packet to the data source address. The read request packet may be transmitted to a memory controller of another processing unit or a computing module or a memory controller of a host according to the source address. In addition, the control flow engine 250 may omit data transmission when the read source address is a processing unit including itself.

다음, 컨트롤 플로우 엔진(250)이 페치 블록(252)에 전송된 인덱스에 해당하는 태스크로 전송한 읽기 요청 패킷에 대한 읽기 응답(read response)을 받으면, 프로세싱 유닛(200)은 읽기 응답을 할당 받은 동적 메모리에 저장한다. 컨트롤 플로우 엔진(250)은 자신이 전송한 모든 읽기 요청 패킷에 대한 모든 읽기 응답을 받은 태스크의 인덱스를 러닝 준비 큐(253)에 전송한다. 이후 페치 블록(252)은 전술한 과정들을 반복 수행할 수 있다.Next, when the control flow engine 250 receives a read response to the read request packet transmitted to the task corresponding to the index transmitted to the fetch block 252, the processing unit 200 receives the read response. Store in dynamic memory. The control flow engine 250 transmits the index of a task that has received all read responses to all read request packets transmitted by itself to the running preparation queue 253. Thereafter, the fetch block 252 may repeat the above-described processes.

다음, 컨트롤 플로우 엔진(250)은 러닝 준비 큐(253)의 순서상 앞에 위치한 태스크의 인덱스를 팝하여, 러닝 블록(254)에 전송한다. 또한, 컨트롤 플로우 엔진(250)은 러닝 블록(254)에 전송된 인덱스에 해당하는 태스크에 대한 인스트럭션 메모리(210)의 어드레스를 데이터 플로우 엔진(240)으로부터 읽어와 프로그램 카운터(255)에 넣는다. 이후, 컨트롤 플로우 엔진(250)은 해당 태스크의 실행을 위해 해당 태스크의 하나 이상의 인스트럭션을 인스트럭션 메모리(210)에서 읽어와 실행한다. 그리고 프로그램 카운터(255)는 해당 태스크에 포함된 인스트럭션들의 수만큼 참조 계수를 증가시켜 해당 태스크의 모든 인스트럭션들을 수행한다. 수행 과정에서 컨트롤 플로우 엔진(250)은 각 태스크의 연산 결과를 데이터 메모리(220)에 저장할 수 있으며, 데이터 정보 패킷을 통해 데이터 메모리(220)에 저장된 연산 결과를 기 지정된 다음 연산 대상 태스크로 전달할 수 있다. 데이터 정보 패킷을 전달 받은 프로세싱 유닛은 상술한 과정들을 계속해서 수행할 수 있다.Next, the control flow engine 250 pops the index of the task located in front of the running preparation queue 253 and transmits it to the running block 254. In addition, the control flow engine 250 reads the address of the instruction memory 210 for the task corresponding to the index transmitted to the learning block 254 from the data flow engine 240 and puts it in the program counter 255. Thereafter, the control flow engine 250 reads and executes one or more instructions of the task from the instruction memory 210 in order to execute the task. In addition, the program counter 255 executes all instructions of the task by increasing the reference count by the number of instructions included in the task. During the execution process, the control flow engine 250 may store the operation result of each task in the data memory 220, and transmit the operation result stored in the data memory 220 through a data information packet to the next designated task. have. The processing unit receiving the data information packet may continue to perform the above-described processes.

상술한 과정들은 하나의 프로세싱 유닛을 기준으로 설명되었으나, 컴퓨팅 모듈(160)의 모든 프로세싱 유닛들은 상기한 과정들을 각각 수행할 수 있다. 또한, 상기한 과정들은 반드시 순차적으로 실행될 필요가 없으며, 거의 동시에 진행될 수도 있다.The above-described processes have been described based on one processing unit, but all processing units of the computing module 160 may perform the above-described processes, respectively. In addition, the above-described processes do not necessarily need to be executed sequentially, and may be performed almost simultaneously.

도 8은 컨트롤 플로우 엔진(250)과 인스트럭션 메모리(210)간의 상호작용 및 데이터 페치 유닛(270)과 데이터 메모리(220)간의 상호작용을 설명하기 위해 도시한 도면이다. 8 is a diagram illustrating an interaction between the control flow engine 250 and the instruction memory 210 and between the data fetch unit 270 and the data memory 220.

도 8을 참조하면, 앞서 도 7을 참조하여 설명한 예에서 컨트롤 플로우 엔진(250)의 프로그램 카운터(255)가 인스트럭션 메모리(210)로부터 특정 번지수에 해당하는 인스트럭션의 명령어를 가져올 수 있음을 알 수 있다. 또한, 같은 예에서, 데이터 페치 유닛(270)이 데이터 메모리(220)에 저장된 태스크 실행에 따른 연산 대상 데이터를 수행 유닛(280)에 페치하고, 복수개의 연산 모듈(281)들 각각은 연산 대상 데이터에 따른 연산을 수행할 수 있음을 알 수 있다.Referring to FIG. 8, in the example described with reference to FIG. 7 above, it can be seen that the program counter 255 of the control flow engine 250 can fetch an instruction of an instruction corresponding to a specific address number from the instruction memory 210. have. In addition, in the same example, the data fetch unit 270 fetches the operation target data according to the task execution stored in the data memory 220 to the execution unit 280, and each of the plurality of operation modules 281 is It can be seen that the operation according to can be performed.

한편, 컨트롤 플로우 엔진(250)이 태스크를 실행하도록 제어하는 과정에서 실행되는 태스크는 다른 태스크로 데이터를 전송할 수 있으며, 이 때 상술한 바와 같이 참조 계수가 증가할 수 있다. 이는 컨트롤 플로우 엔진(250)이 태스크를 실행하도록 제어하는 과정에서 연산 결과가 발생하는데, 본 실시예에 따라 태스크들이 상호 간에 데이터를 주고 받을 때, 동적 메모리의 할당 및 해제를 하드웨어에서 지원함으로써 프로그래밍 모델을 간단하게 구성할 수 있도록 하기 위함이다.Meanwhile, a task executed in the process of controlling the control flow engine 250 to execute the task may transmit data to another task, and in this case, the reference coefficient may increase as described above. This occurs in the process of controlling the control flow engine 250 to execute a task, and according to the present embodiment, when tasks exchange data with each other, the hardware supports dynamic memory allocation and release, thereby providing a programming model. This is to make it simple to configure.

예컨대, 순차적인 연산이 복수개의 태스크에 나누어 처리되도록 뉴럴 네트워크 프로세싱 시스템이 구현된 경우, 컨트롤 플로우 엔진(250)은 태스크를 실행하도록 제어하는 과정에서 발생한 연산 결과를 다음 태스크가 받도록 하고, 연산 결과를 받은 태스크는 다음 연산을 수행하도록 수 있다. 컨트롤 플로우 엔진(250)이 태스크를 실행하는 과정에서 연산 결과가 나오면 그 즉시 연산 결과를 필요로 하는 다른 태스크에 전달하고, 연산 결과를 전달 받은 태스크는 다음 연산을 수행할 수 있으므로, 프로세싱 유닛(200)의 연산 처리 속도를 높일 수 있다.For example, when a neural network processing system is implemented so that sequential operations are divided into a plurality of tasks and processed, the control flow engine 250 allows the next task to receive the operation result generated in the process of controlling the task to be executed, and the operation result. The received task can perform the following operation. In the process of executing the task by the control flow engine 250, when an operation result comes out, the operation result is immediately transferred to another task in need, and the task receiving the operation result can perform the next operation, so the processing unit 200 ) Can increase the processing speed.

또한, 본 발명의 실시예들에 따르면, 전체적인 프로세싱의 관점에서 태스크들의 프로세싱 진행 속도 및 실행 속도를 높이기 위해 ,데이터 플로우 엔진이 태스크들의 데이터 준비 완료 여부를 체크하여 태스크들이 실행되도록 컨트롤 플로우 엔진에 전송하는 단계까지는 데이터 플로우 방식을 사용한다. 또한, 본 발명의 실시예들에 따르면, 데이터 플로우 방식에 따른 오버헤드를 제거하기 위해 컨트롤 플로우 엔진에 의해 순차적인 실행이 필요한 태스크들의 실행을 제어하는 단계와 실행하는 단계에서는 폰 노이만 방식을 사용할 수 있다. In addition, according to embodiments of the present invention, in order to increase the processing speed and execution speed of tasks in terms of overall processing, the data flow engine checks whether the data is ready for tasks and transmits the tasks to the control flow engine to be executed. Up to the stage, the data flow method is used. In addition, according to embodiments of the present invention, in order to remove the overhead according to the data flow method, the von Neumann method can be used in the step of controlling and executing the tasks that need to be sequentially executed by the control flow engine. have.

상술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains can understand that it is possible to easily transform it into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. The scope of the present invention is indicated by the claims to be described later, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention.

Claims

In the processing unit configured to perform processing of a neural network,
An instruction memory configured to store tasks having one or more instructions;
A data memory configured to store data associated with the tasks in a dynamic memory;
A data flow engine configured to check whether data is ready for the tasks, and notify the control flow engine of whether the tasks are ready in order of the tasks for which data preparation is completed;
A control flow engine configured to control execution of tasks in an order notified from the data flow engine; And
A neural network processing unit comprising an execution unit configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine,
The data flow engine,
After allocating a predetermined space of the dynamic memory to a specific task, releases when tasks including the specific task do not use data of the specific task by using a reference coefficient,
Receives the index of each of the tasks and information of the data associated with the tasks from the instruction memory, checks whether the data required by each of the tasks is prepared, and the order in which data preparation is completed Neural network processing unit, characterized in that configured to transmit to the control flow engine.

The method of claim 1,
And a data fetch unit configured to fetch operation target data according to one or more instructions of a task controlled to be executed by the control flow engine from the data memory to the execution unit.

The method of claim 1,
And the control flow engine stores result data according to the operation of the execution unit in the data memory.

The method of claim 1,
And a router configured to relay communication between an external device and the processing unit.

The method of claim 1,
And a register file configured to include at least one register, which is a storage space used in a process of executing a task by the control flow engine.

delete

The method of claim 1,
The control flow engine,
A fetch preparation queue configured to receive indexes of tasks for which data is prepared from the data flow engine;
A fetch block configured to check whether data required for execution of a task corresponding to an index transmitted from the fetch preparation queue exists in the data memory and, if not, load data from the outside into the data memory;
A running preparation queue configured to receive an index of a task for which data has been loaded from the fetch block;
A running block configured to sequentially receive indexes of tasks provided in the running ready queue from the running ready queue; And
And a program counter configured to execute one or more instructions of a task corresponding to an index transmitted to the learning block.

The method of claim 1,
The execution unit is a neural network processing unit, characterized in that the operation module for performing the operation is formed in a form in which a specific pattern is formed.

A neural network processing unit including an instruction memory configured to store tasks having one or more instructions, a data memory configured to store data associated with the tasks in a dynamic memory, a data flow engine, a control flow engine, and an execution unit. In a method of performing processing of a neural network by using,
(i) the data flow engine checking whether data is prepared for the tasks and notifying the control flow engine of whether the tasks are prepared in the order in which data preparation is completed;
(ii) controlling the control flow engine to execute tasks in the order notified from the data flow engine; And
(iii) a neural network processing method comprising the step of performing, by the execution unit, an operation according to one or more instructions of a task controlled to be executed by the control flow engine,
The step (i),
Receiving, by the data flow engine, an index of each of the tasks and information of data associated with the tasks from the instruction memory, and allocating a predetermined space of the dynamic memory to a specific task;
Checking, by the data flow engine, whether data required by each of the tasks is prepared; And
And transmitting, by the data flow engine, indexes of tasks in which data is prepared to the control flow engine in an order in which data preparation is completed,
After the step (iii), the data flow engine uses a reference coefficient to determine a predetermined space of the dynamic memory allocated to the specific task when tasks including the specific task do not use data of the specific task. Neural network processing method, characterized in that to release.

The method of claim 10,
The neural network processing unit further includes a data fetch unit,
In the neural network processing method, between the step (ii) and the step (iii), the data fetch unit transmits data to be calculated according to one or more instructions of a task controlled to be executed by the control flow engine from the data memory. And fetching to the execution unit.

The method of claim 10,
After the step (iii), the control flow engine further comprises the step of storing the result data according to the operation of the execution unit in the data memory.

delete

The method of claim 10,
The control flow engine further includes a fetch ready queue, a fetch block, a running ready queue, a running block, and a program counter,
The step (ii),
Receiving, by the fetch preparation queue, indexes of tasks for which data has been prepared from the data flow engine;
When the fetch ready queue sequentially transmits the indexes of the tasks to the fetch block, the fetch block checks whether data necessary for execution of the task corresponding to the received index exists in the data memory, and if not, transfers the data from the outside. Loading data into the data memory;
Receiving, by the running preparation queue, an index of a task for which data has been loaded from the fetch block;
And when the running preparation queue sequentially transmits the index of tasks to the running block, the program counter executes one or more instructions of the task corresponding to the index transmitted to the running block. Processing method.

In a system configured to perform processing of a neural network,
A computing module including a central processing unit and at least one processing unit that performs processing according to an instruction of the central processing unit,
The processing unit,
An instruction memory configured to store tasks having one or more instructions;
A data memory configured to store data associated with the tasks in a dynamic memory;
A data flow engine configured to check whether data is ready for the tasks, and notify the control flow engine of whether the tasks are ready in order of the tasks for which data preparation is completed;
A control flow engine configured to control execution of tasks in an order notified from the data flow engine; And
A neural network processing system comprising an execution unit having at least one execution unit configured to perform an operation according to one or more instructions of a task controlled to be executed by the control flow engine,
The data flow engine,
After allocating a predetermined space of the dynamic memory to a specific task, releases when tasks including the specific task do not use data of the specific task by using a reference coefficient,
Receives the index of each of the tasks and information of the data associated with the tasks from the instruction memory, checks whether the data required by each of the tasks is prepared, and the order in which data preparation is completed Neural network processing system, characterized in that configured to transmit to the control flow engine.

The method of claim 15,
The processing unit
And a data fetch unit, configured to fetch operation object data according to one or more instructions of a task controlled to be executed by the control flow engine from the data memory to the execution unit.

The method of claim 15,
And the control flow engine stores result data according to the operation of the execution unit in the data memory.

The method of claim 16,
The result data stored in the data memory is used for processing in a processing unit different from a processing unit including the data memory.

The method of claim 15,
And a host interface configured to connect the central processing unit and the computing module to perform communication between the central processing unit and the computing module.

The method of claim 15,
And a command processor configured to transfer the command of the central processing unit to the computing module.

The method of claim 15,
And a memory controller configured to control data transmission and storage of each of the central processing unit and the computing module.

The method of claim 15,
The processing unit
A router configured to relay communication between an external device and the processing unit, and a register file configured to include at least one register, which is a storage space used in a process of executing a task by the control flow engine. Neural network processing system.

delete

The method of claim 15,
The execution unit is a neural network processing system, characterized in that the operation module for performing the operation is formed in a form in which a specific pattern is formed.

delete

The method of claim 15,
The control flow engine,
A fetch preparation queue configured to receive indexes of tasks for which data is prepared from the data flow engine;
A fetch block configured to check whether data required for execution of a task corresponding to an index transmitted from the fetch preparation queue exists in the data memory and, if not, load data from the outside into the data memory;
A running preparation queue configured to receive an index of a task for which data has been loaded from the fetch block;
A running block configured to sequentially receive indexes of tasks provided in the running ready queue from the running ready queue; And
And a program counter configured to execute one or more instructions of a task corresponding to an index transmitted to the learning block.

A computing module including a central processing unit and at least one processing unit that performs processing according to an instruction of the central processing unit, wherein the processing unit is an instruction memory configured to store tasks having one or more instructions, the task A method of performing processing of a neural network using a neural network processing system including a data memory, a data flow engine, a control flow engine, and an execution unit configured to store data associated with them in a dynamic memory,
(1) the central processing unit transmitting a command to the computing module; And
(2) the computing module performing processing according to the command of the central processing unit,
Step (2),
(2-1) checking, by the data flow engine, whether data is ready for the tasks and notifying the control flow engine of whether the tasks are ready in the order in which data preparation is completed;
(2-2) controlling the control flow engine to execute tasks in the order notified from the data flow engine; And
(2-3) A neural network processing method comprising the step of performing, by the execution unit, an operation according to one or more instructions of a task controlled to be executed by the control flow engine,
The step (2-1),
Receiving, by the data flow engine, an index of each of the tasks and information on data associated with the tasks from the instruction memory, and allocating a predetermined space of the dynamic memory to a specific task;
Checking, by the data flow engine, whether data required by each of the tasks is prepared; And
And transmitting, by the data flow engine, indexes of tasks for which data is prepared, to the control flow engine in an order in which data preparation is completed,
After the step (2-3), when the tasks including the specific task do not use the data of the specific task, the data flow engine uses a reference coefficient to determine a predetermined value of the dynamic memory allocated to the specific task. Neural network processing method, characterized in that the space is released.

The method of claim 27,
The processing unit further includes a data fetch unit,
In the neural network processing method, between the step (2-2) and the step (2-3), the data fetch unit stores data to be calculated according to one or more instructions of a task controlled to be executed by the control flow engine. And fetching from the data memory to the execution unit.

The method of claim 27,
The step (2) further comprises the step of storing, by the control flow engine, the result data according to the operation of the execution unit in the data memory after the step (2-3). Way.

The method of claim 29,
And a processing unit other than a processing unit including a data memory in which the result data is stored, loading the result data.

The method of claim 27,
Before the step (1), the central processing unit initializes the state of the computing module, and after the step (2), the central processing unit transmits result data according to the computation of the computing module to the central processing unit. The method of processing a neural network, further comprising storing in storage.

delete

The method of claim 27,
The control flow engine further includes a fetch ready queue, a fetch block, a running ready queue, a running block, and a program counter,
The step (2-2),
Receiving, by the fetch preparation queue, indexes of tasks for which data has been prepared from the data flow engine;
When the fetch ready queue sequentially transmits the indexes of the tasks to the fetch block, the fetch block checks whether data necessary for execution of the task corresponding to the received index exists in the data memory, and if not, transfers the data from the outside. Loading into the data memory;
Receiving, by the running preparation queue, an index of a task for which data has been loaded from the fetch block;
And when the running preparation queue sequentially transmits the index of tasks to the running block, the program counter executes one or more instructions of the task corresponding to the index transmitted to the running block. Processing method.