KR102539573B1

KR102539573B1 - Network-on-chip data processing method and device

Info

Publication number: KR102539573B1
Application number: KR1020207034138A
Authority: KR
Inventors: 리우샤올리; 리젠; 장야오
Original assignee: 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드
Priority date: 2018-11-21
Filing date: 2019-10-18
Publication date: 2023-06-01
Also published as: CN111209231A; KR20200138414A; CN111209231B

Abstract

본 출원은 네트워크 온칩 처리 시스템에 적용되는 네트워크 온칩 데이터 처리 방법에 관한 것이며, 상기 네트워크 온칩 처리 시스템은 기계 학습 계산을 수행하고, 저장 장치 및 계산 장치를 포함한다. 상기 방법은, 상기 네트워크 온칩 처리 시스템의 제1 계산 장치를 통해 상기 네트워크 온칩 처리 시스템의 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 단계; 상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 단계; 및 상기 제1 연산 결과를 상기 네트워크 온칩 처리 시스템의 제2 계산 장치로 송신하는 단계를 포함한다. 상기 방법은 연산 오버헤드를 줄이고, 데이터의 판독/기록 효율을 향상시킬 수 있다.The present application relates to a network on-chip data processing method applied to a network on-chip processing system, wherein the network on-chip processing system performs machine learning calculations and includes a storage device and a calculation device. The method includes: obtaining first operation data by accessing a storage device of the network on-chip processing system through a first calculation device of the network on-chip processing system; obtaining a first calculation result by calculating the first calculation data through the first calculation device; and sending the first calculation result to a second calculation device of the network-on-chip processing system. The method can reduce computational overhead and improve read/write efficiency of data.

Description

Network on-chip data processing method and device {NETWORK-ON-CHIP DATA PROCESSING METHOD AND DEVICE}

관련 출원related application

본 출원은 2018년 10월 18일에 출원한 출원번호가 201811216718.9이며 발명의 명칭이 "네트워크 온칩 처리 시스템 및 네트워크 온칩 데이터 처리 방법"이고, 출원번호가 201811215820.7이며 발명의 명칭이 "네트워크 온칩 처리 시스템 및 네트워크 온칩 데이터 처리 방법"이고, 출원번호가 201811215978.4이며 발명의 명칭이 "네트워크 온칩 처리 시스템 및 네트워크 온칩 데이터 처리 방법"이고, 출원번호가 201811216857.1이며 발명의 명칭이 "네트워크 온칩 데이터 처리 방법, 저장 매체, 컴퓨터 설비 및 장치"이고, 2018년 11월 21일에 출원한 출원번호가 201811392232.0이며 발명의 명칭이 "데이터 처리 방법, 장치 및 관련 제품"이고, 출원번호가 201811392262.1이며 발명의 명칭이 "데이터 처리 방법, 장치 및 관련 제품"이고, 출원번호가 201811392279.7이며 발명의 명칭이 "데이터 처리 장치, 방법 및 관련 제품"이고, 출원번호가 201811393352.2이며 발명의 명칭이 "데이터 처리 장치, 방법 및 관련 제품"이고, 출원번호가 201811390409.3이며 발명의 명칭이 "데이터 처리 장치, 방법 및 관련 제품"이고, 출원번호가 201811390428.6이며 발명의 명칭이 "데이터 처리 장치 및 관련 제품"이고, 출원번호가 201811392270.6이며 발명의 명칭이 "데이터 처리 장치 및 관련 제품"의 중국 특허출원의 우선권을 주장하고 있으며 그 전체 내용을 참고로 인용하고 있다.The application number of this application filed on October 18, 2018 is 201811216718.9 and the title of the invention is "Network on-chip processing system and method for processing network on-chip data", and the application number is 201811215820.7 and the title of the invention is "Network on-chip processing system and Network on-chip data processing method", the application number is 201811215978.4, the title of the invention is "Network on-chip processing system and network on-chip data processing method", the application number is 201811216857.1, and the invention name is "Network on-chip data processing method, storage medium, Computer equipment and device", the application number applied on November 21, 2018 is 201811392232.0, and the title of the invention is "Data processing method, device and related products", the application number is 201811392262.1, and the invention name is "Data processing method , Apparatus and related products", the application number is 201811392279.7, the title of the invention is "Data processing apparatus, method and related product", the application number is 201811393352.2 and the invention title is "Data processing apparatus, method and related product", The application number is 201811390409.3 and the title of the invention is "Data processing apparatus, method and related products", the application number is 201811390428.6 and the title of the invention is "Data processing apparatus and related products", the application number is 201811392270.6 and the invention title is " Priority is claimed for the Chinese patent application for "Data Processing Apparatus and Related Products", the entire content of which is hereby incorporated by reference.

본 출원은 정보 처리 기술분야에 관한 것으로, 구체적으로 네트워크 온칩 데이터 처리 방법 및 장치에 관한 것이다.The present application relates to the field of information processing technology, and specifically to a method and apparatus for processing network on-chip data.

반도체 공정 기술이 발전함에 따라 단일 칩 상에 수억 개의 트랜지스터를 집적시키는 것이 이미 현실화되었다. 네트워크 온칩(Network on Chip, NoC)은 단일 칩 상에 대량의 계산 자원을 집적시켜 온칩 통신을 구현할 수 있다.With advances in semiconductor processing technology, integration of hundreds of millions of transistors on a single chip has already become a reality. Network on Chip (NoC) may implement on-chip communication by integrating a large amount of computing resources on a single chip.

신경망에서는 대량의 계산을 진행해야 하므로 일부 계산에 대해 정방향 연산, 역방향 연산, 가중치 업데이트 등 처리를 병행 수행해야 한다. 트랜지스터의 수가 많은 칩 구조에서 칩 설계는 메모리 액세스 오버헤드가 크고 대역폭의 막힘이 많고 데이터의 판독/기록 효율이 낮은 등 문제에 직면하게 될 것이다.Since a large amount of calculations must be performed in a neural network, processing such as forward operation, backward operation, and weight update must be performed concurrently for some calculations. In a chip structure with a large number of transistors, chip design will face problems such as large memory access overhead, high bandwidth blockage, and low data read/write efficiency.

본 출원은 적어도 관련 기술에 존재하는 문제점을 어느 정도 해결하기 위해 상호 작용 방법, 장치 및 지능 단말을 제공한다.This application provides an interactive method, an apparatus and an intelligent terminal to solve at least some of the problems existing in the related art.

제1 측면에 따르면, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은 동일한 칩 위에 설치된 저장 장치 및 복수의 계산 장치를 포함하고, 적어도 하나 이상의 계산 장치는 상기 저장 장치에 연결되며 적어도 두개 이상의 계산 장치는 서로 연결되어 있다.According to a first aspect, the present application provides a network on-chip processing system. The system includes a storage device and a plurality of computing devices installed on the same chip, at least one computing device is connected to the storage device, and at least two or more computing devices are connected to each other.

일 실시예에서, 상기 복수의 계산 장치 중 임의의 두 계산 장치는 직접적으로 연결되어 있다.In one embodiment, any two computing devices of the plurality of computing devices are directly coupled.

일 실시예에서, 상기 복수의 계산 장치는 제1 계산 장치 및 복수의 제2 계산 장치를 포함하고, 상기 제1 계산 장치는 상기 저장 장치에 연결되고, 상기 복수의 제2 계산 장치 중 적어도 하나 이상의 제2 계산 장치는 상기 제1 계산 장치에 연결된다.In one embodiment, the plurality of computing devices includes a first computing device and a plurality of second computing devices, the first computing device being coupled to the storage device, and at least one or more of the plurality of second computing devices. A second computing device is coupled to the first computing device.

일 실시예에서, 상기 복수의 제2 계산 장치 중 적어도 두개 이상의 제2 계산 장치는 서로 연결되되, 상기 제1 계산 장치를 통해 상기 저장 장치에 연결된다.In one embodiment, at least two or more second computing devices among the plurality of second computing devices are coupled to each other and connected to the storage device through the first computing device.

일 실시예에서, 상기 복수의 제2 계산 장치 중 임의의 두 제2 계산 장치는 상기 제1 계산 장치에 직접적으로 연결된다.In one embodiment, any two second computing devices of the plurality of second computing devices are directly coupled to the first computing device.

일 실시예에서, 상기 복수의 계산 장치 중 각각의 계산 장치는 모두 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치는 서로 연결된다.In one embodiment, each computing device among the plurality of computing devices is connected to the storage device, and at least two or more computing devices are connected to each other.

제2 측면에 따르면, 본 출원의 실시예는 상기 네트워크 온칩 처리 시스템 내의 하나 또는 복수의 상기 계산 장치를 포함하는 신경망 연산 장치를 제공한다. 상기 신경망 연산 장치는 다른 처리 장치로부터 피연산 입력 데이터 및 제어 정보를 획득하고 지정된 기계 학습 연산을 수행하여 I/O 인터페이스를 통해 수행 결과를 다른 처리 장치로 전달한다.According to a second aspect, an embodiment of the present application provides a neural network computing device including one or a plurality of the computing devices in the network on-chip processing system. The neural network operation unit acquires operand input data and control information from another processing unit, performs a designated machine learning operation, and transfers the execution result to the other processing unit through an I/O interface.

상기 신경망 연산 장치가 복수의 상기 계산 장치를 포함할 때, 복수의 상기 계산 장치들은 특정한 구조를 통해 연결되어 데이터를 전송할 수 있다.When the neural network computing device includes a plurality of computing devices, the plurality of computing devices may be connected through a specific structure to transmit data.

여기서, 복수의 상기 계산 장치는 PCI 버스를 통해 인터커넥트되어 데이터를 전송하므로 더욱 큰 규모의 기계 학습 연산을 지원하며, 복수의 상기 계산 장치는 하나의 제어 시스템을 공유하거나 혹은 각자의 제어 시스템을 구비하고, 복수의 상기 계산 장치는 메모리를 공유하거나 혹은 각자의 메모리를 구비하며, 복수의 상기 계산 장치의 인터커넥트 방식은 임의의 인터커넥트 토폴로지이다.Here, the plurality of computing devices are interconnected through a PCI bus to transmit data, thereby supporting larger-scale machine learning operations, and the plurality of computing devices share a single control system or have their own control systems. , The plurality of computing devices share memory or have their own memory, and an interconnection method of the plurality of computing devices is an arbitrary interconnect topology.

제3 측면에 따르면, 본 출원의 실시예는 제2 측면에 따른 기계 학습 처리 장치, 범용 인터케넥트 인터페이스 및 다른 처리 장치를 포함하는 통합 처리 장치를 제공한다. 상기 신경망 연산 장치와 상기 다른 처리 장치는 상호 작용하여 공동으로 사용자가 지정한 컴퓨팅 동작을 수행한다. 상기 통합 처리 장치는 상기 신경망 연산 장치 및 상기 다른 처리 장치에 각각 연결되어 상기 신경망 연산 장치 및 상기 다른 처리 장치의 데이터를 저장하는 저장 장치를 더 포함할 수 있다.According to a third aspect, embodiments of the present application provide an integrated processing device including the machine learning processing device according to the second aspect, a universal interconnect interface and other processing devices. The neural network computing device and the other processing device interact to jointly perform a computing operation specified by a user. The integrated processing unit may further include a storage device connected to the neural network operation unit and the other processing unit to store data of the neural network operation unit and the other processing unit.

제4 측면에 따르면, 본 출원의 실시예는 상기 네트워크 온칩 처리 시스템의 계산 장치, 상기 제2 측면에 따른 신경망 연산 장치 또는 상기 제3 측면에 따른 통합 처리 장치를 포함하는 신경망 칩을 제공한다.According to a fourth aspect, an embodiment of the present application provides a neural network chip including a computing device of the network-on-chip processing system, a neural network computing device according to the second aspect, or an integrated processing device according to the third aspect.

제5 측면에 따르면, 본 출원의 실시예는 상기 제4 측면에 따른 신경망 칩을 포함하는 신경망 칩 패키지 구조를 제공한다.According to a fifth aspect, an embodiment of the present application provides a neural network chip package structure including the neural network chip according to the fourth aspect.

제6 측면에 따르면, 본 출원의 실시예는 제5 측면에 따른 신경망 칩 패키지 구조를 포함하는 보드 카드를 제공한다.According to a sixth aspect, an embodiment of the present application provides a board card including the neural network chip package structure according to the fifth aspect.

제7 측면에 따르면, 본 출원의 실시예는 상기 제4 측면에 따른 신경망 칩 또는 상기 제6 측면에 따른 보드 카드를 포함하는 전자장치를 제공한다.According to the seventh aspect, an embodiment of the present application provides an electronic device including the neural network chip according to the fourth aspect or the board card according to the sixth aspect.

제8 측면에 따르면, 본 출원의 실시예는 기계 학습 계산을 수행하는 네트워크 온칩 데이터 처리 방법을 더 제공한다. According to an eighth aspect, an embodiment of the present application further provides a network on-chip data processing method for performing machine learning computation.

상기 방법은, The method,

제1 계산 장치를 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 단계;obtaining first calculation data by accessing a storage device through a first calculation device;

상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 단계; 및obtaining a first calculation result by calculating the first calculation data through the first calculation device; and

상기 제1 연산 결과를 제2 계산 장치로 송신하는 단계를 포함한다. and sending the first calculation result to a second calculation device.

일 실시예에서, 상기 방법은 상기 제2 계산 장치를 통해 상기 저장 장치에 액세스하여 제2 연산 데이터를 획득하는 단계를 더 포함한다.In one embodiment, the method further comprises obtaining second operational data by accessing the storage device through the second computing device.

일 실시예에서, 상기 방법은 상기 제2 계산 장치를 통해 상기 제2 연산 데이터와 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는 단계를 더 포함한다.In one embodiment, the method further includes obtaining a second calculation result by calculating the second calculation data and the first calculation result through the second calculation device.

네트워크 온칩 처리 시스템은, 동일한 칩 위에 설치된 저장 장치 및 복수의 계산 장치 그룹을 포함하고, 계산 장치 그룹마다 복수의 계산 장치를 포함하고, 상기 복수의 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹은 상기 저장 장치에 연결되며, 적어도 두개 이상의 계산 장치 그룹은 서로 연결되어 있다.A network on-chip processing system includes a storage device installed on the same chip and a plurality of computing device groups, each computing device group including a plurality of computing devices, at least one of the plurality of computing device groups including the storage device. A group of at least two or more computing devices are connected to each other.

일 실시예에서, 상기 복수의 계산 장치 그룹 중 임의의 두 계산 장치 그룹은 직접적으로 연결되어 있다.In one embodiment, any two groups of computing devices of the plurality of groups of computing devices are directly coupled.

일 실시예에서, 각각의 상기 계산 장치 그룹은 다른 상기 계산 장치 그룹의 적어도 하나 이상의 계산 장치에 연결된 적어도 하나 이상의 계산 장치를 포함한다.In one embodiment, each said group of computing devices includes at least one computing device connected to at least one or more computing devices of another said group of computing devices.

일 실시예에서, 상기 복수의 계산 장치 그룹은 상기 복수의 계산 장치 그룹 내의 임의의 한 계산 장치를 통해 서로 연결된다.In one embodiment, the plurality of computing device groups are coupled to each other through any one computing device within the plurality of computing device groups.

일 실시예에서, 각각의 상기 계산 장치 그룹 중 적어도 하나 이상의 계산 장치는 상기 저장 장치에 연결되고 적어도 두개 이상의 계산 장치는 서로 연결된다.In one embodiment, at least one computing device of each group of computing devices is coupled to the storage device and at least two or more computing devices are coupled to each other.

일 실시예에서, 각각의 상기 계산 장치 그룹 중 임의의 두 계산 장치는 직접적으로 연결된다.In one embodiment, any two computing devices of each of said groups of computing devices are directly connected.

일 실시예에서, 각각의 상기 계산 장치 그룹은 제1 계산 장치 및 복수의 제2 계산 장치를 포함하고, 상기 제1 계산 장치는 상기 저장 장치에 연결되고, 상기 복수의 제2 계산 장치 중 적어도 하나 이상의 제2 계산 장치는 상기 제1 계산 장치에 연결된다.In one embodiment, each said computing device group comprises a first computing device and a plurality of second computing devices, said first computing device coupled to said storage device, and at least one of said plurality of second computing devices. The above second computing device is connected to the first computing device.

일 실시예에서, 각각의 상기 계산 장치 그룹에서 복수의 제2 계산 장치 중 적어도 두개 이상의 제2 계산 장치는 서로 연결되고, 상기 제1 계산 장치를 통해 상기 저장 장치에 연결된다.In one embodiment, at least two or more second computing devices of the plurality of second computing devices in each of the computing device groups are connected to each other and to the storage device via the first computing device.

일 실시예에서, 각각의 상기 계산 장치 그룹에서 복수의 제2 계산 장치 중 임의의 두 제2 계산 장치는 상기 제1 계산 장치에 직접적으로 연결된다.In one embodiment, any two second computing devices of the plurality of second computing devices in each of the computing device groups are directly coupled to the first computing device.

일 실시예에서, 각각의 상기 계산 장치 그룹에서 복수의 계산 장치 중 각각의 계산 장치는 모두 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치는 서로 연결되어 있다.In one embodiment, each computing device among the plurality of computing devices in each group of computing devices is all connected to the storage device, and at least two or more computing devices are connected to each other.

일 실시예에서, 상기 복수의 계산 장치 그룹은 메인 계산 장치 그룹 및 복수의 서브 계산 장치 그룹을 포함하고, 상기 메인 계산 장치 그룹은 상기 저장 장치에 연결되고, 상기 복수의 서브 계산 장치 그룹 중 적어도 하나 이상의 서브 계산 장치 그룹은 상기 메인 계산 장치 그룹에 연결된다.In an embodiment, the plurality of computing device groups include a main computing device group and a plurality of sub computing device groups, the main computing device group being connected to the storage device, and at least one of the plurality of sub computing device groups. The above sub computing device groups are connected to the main computing device group.

일 실시예에서, 상기 복수의 서브 계산 장치 그룹 중 적어도 두개 이상의 서브 계산 장치 그룹은, 서로 연결되고 상기 메인 계산 장치 그룹을 통해 상기 저장 장치에 연결된다.In one embodiment, at least two or more sub-computing device groups among the plurality of sub-computing device groups are connected to each other and connected to the storage device through the main computing device group.

일 실시예에서, 상기 복수의 서브 계산 장치 그룹 중 임의의 두 서브 계산 장치 그룹은 상기 메인 계산 장치 그룹에 직접적으로 연결된다.In one embodiment, any two sub-computing device groups of the plurality of sub-computing device groups are directly connected to the main computing device group.

일 실시예에서, 상기 복수의 계산 장치 그룹 중 각각의 계산 장치 그룹마다 모두 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치 그룹은 서로 연결된다.In one embodiment, each group of computing devices among the plurality of groups of computing devices is connected to the storage device, and at least two or more groups of computing devices are connected to each other.

본 출원의 실시예는 네트워크 온칩 데이터 처리 방법을 더 제공한다. 상기 방법은, An embodiment of the present application further provides a network on-chip data processing method. The method,

복수의 제1 계산 장치를 포함하는 제1 계산 장치 그룹을 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 단계;obtaining first operation data by accessing a storage device through a first group of computing devices including a plurality of first computing devices;

상기 제1 계산 장치 그룹이 상기 복수의 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 단계; 및obtaining a first calculation result by calculating, by the first calculation device group, the plurality of first calculation data; and

상기 제1 연산 결과를 제2 계산 장치 그룹으로 송신하는 단계를 포함한다.and transmitting the first calculation result to a second group of computing devices.

일 실시예에서, 상기 방법은 복수의 제2 계산 장치를 포함하는 상기 제2 계산 장치 그룹을 통해 상기 저장 장치에 액세스하여 제2 연산 데이터를 획득하는 단계를 더 포함한다.In one embodiment, the method further comprises obtaining second operational data by accessing the storage device via the second group of computing devices comprising a plurality of second computing devices.

일 실시예에서, 상기 방법은 상기 제2 계산 장치 그룹을 통해 상기 제2 연산 데이터와 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는 단계를 더 포함한다.In one embodiment, the method further comprises obtaining a second calculation result by calculating the second calculation data and the first calculation result via the second calculation device group.

일 실시예에서, 상기 방법은 상기 제2 계산 장치 그룹을 통해 상기 제2 연산 데이터와 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는 단계를 더 포함하고, 상기 단계는,In one embodiment, the method further comprises obtaining a second calculation result by performing an operation on the second calculation data and the first calculation result through the second calculation device group, wherein the step comprises:

상기 복수의 제2 계산 장치 사이에서 상기 제2 연산 데이터와 상기 제1 연산 결과를 연산하고 포워딩하여 상기 제2 연산 결과를 얻는 단계를 포함한다.and calculating and forwarding the second calculation data and the first calculation result between the plurality of second calculation devices to obtain the second calculation result.

네트워크 온칩 처리 시스템은, 동일한 칩 위에 설치되고 서로 연결된 복수의 네트워크 온칩 처리 모듈을 포함하고, 각각의 네트워크 온칩 처리 모듈은 적어도 하나 이상의 저장 장치 및 복수의 계산 장치를 포함하고, 각각의 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결되고, 상기 복수의 계산 장치 중 적어도 두개 이상의 계산 장치는 서로 연결된다.The network on-chip processing system includes a plurality of network on-chip processing modules installed on the same chip and connected to each other, each network on-chip processing module including at least one storage device and a plurality of computing devices, each network on-chip processing module At least one computing device is connected to at least one storage device inside the network on-chip processing module, and at least two of the plurality of computing devices are connected to each other.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈 중 복수의 계산 장치는 제1 계산 장치 및 복수의 제2 계산 장치를 포함하고, 상기 제1 계산 장치는 상기 네트워크 온칩 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결되고 상기 복수의 제2 계산 장치 중 적어도 하나 이상의 제2 계산 장치는 상기 제1 계산 장치에 연결된다.In one embodiment, the plurality of computing devices in each network-on-chip processing module includes a first computing device and a plurality of second computing devices, the first computing device comprising at least one or more storage devices within the network-on-chip processing module. and at least one second computing device of the plurality of second computing devices is coupled to the first computing device.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈 중 적어도 두개 이상의 제2 계산 장치는, 서로 연결되고 상기 제1 계산 장치를 통해 상기 네트워크 온칩 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결된다.In one embodiment, at least two or more second computing devices of each network-on-chip processing module are connected to each other and via the first computing device to at least one or more storage devices within the network-on-chip processing module.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈 중 임의의 두 제2 계산 장치는 상기 제1 계산 장치에 직접적으로 연결된다.In one embodiment, any two second computing devices of each network-on-chip processing module are directly coupled to the first computing device.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈 중 각 계산 장치는 모두 상기 네트워크 온칩 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치는 서로 연결된다.In one embodiment, each computing device in each network on-chip processing module is connected to at least one or more storage devices inside the network on-chip processing module, and at least two or more computing devices are connected to each other.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈의 임의의 두 계산 장치는 직접적으로 연결된다.In one embodiment, any two computing devices in each network-on-chip processing module are directly connected.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈은 복수의 저장 장치를 포함하고, 상기 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내부의 상기 복수의 저장 장치에 연결된다.In one embodiment, each network-on-chip processing module includes a plurality of storage devices, and at least one computing device in the network-on-chip processing module is coupled to the plurality of storage devices within the network-on-chip processing module.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈 중 각 계산 장치는 모두 상기 네트워크 온칩 처리 모듈 내부의 상기 복수의 저장 장치에 연결된다.In one embodiment, each computing device in each network-on-chip processing module is all connected to the plurality of storage devices within the network-on-chip processing module.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈은 적어도 하나 이상의 계산 장치를 포함하여 다른 네트워크 온칩 처리 모듈의 적어도 하나 이상의 계산 장치에 연결된다.In one embodiment, each network-on-chip processing module includes at least one computing device coupled to at least one or more computing devices of other network-on-chip processing modules.

일 실시예에서, 상기 복수의 네트워크 온칩 처리 모듈은 각각의 네트워크 온칩 처리 모듈의 임의의 한 계산 장치를 통해 서로 연결된다.In one embodiment, the plurality of network-on-chip processing modules are interconnected through any one computing device of each network-on-chip processing module.

일 실시예에서, 각각의 네트워크 온칩 처리 모듈에서 각 계산 장치는 모두 저장 장치에 연결되고 각각의 상기 계산 장치와 상기 저장 장치 사이의 거리는 제1 통신 거리이다.In an embodiment, each computing device in each network on-chip processing module is connected to a storage device, and a distance between each computing device and the storage device is a first communication distance.

일 실시예에서, 임의의 두 네트워크 온칩 처리 모듈은 직접적으로 연결된다.In one embodiment, any two network on-chip processing modules are directly connected.

제1 네트워크 온칩 처리 모듈을 통해 제1 연산 데이터를 획득하는 단계, -여기서, 상기 제1 네트워크 온칩 처리 모듈은 제1 저장 장치 및 복수의 제1 계산 장치를 포함하고 상기 제1 연산 데이터는 상기 제1 저장 장치에 저장됨-;obtaining first operational data through a first network-on-chip processing module, wherein the first network-on-chip processing module includes a first storage device and a plurality of first computational devices, wherein the first operational data comprises the first computational data; 1 stored in storage device-;

상기 제1 네트워크 온칩 처리 모듈 내의 복수의 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 단계; 및obtaining a first calculation result by calculating the first calculation data through a plurality of first calculation devices in the first network on-chip processing module; and

상기 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신하는 단계를 포함한다.and sending the first operation result to a second network on-chip processing module.

일 실시예에서, 상기 방법은, 상기 제2 네트워크 온칩 처리 모듈을 통해 제2 연산 데이터를 획득하는 단계를 더 포함하고, 여기서, 상기 제2 네트워크 온칩 처리 모듈은 제2 저장 장치 및 복수의 제2 계산 장치를 포함하고, 상기 제2 연산 데이터는 상기 제2 저장 장치에 저장된다.In one embodiment, the method further comprises obtaining second operational data via the second network-on-chip processing module, wherein the second network-on-chip processing module comprises a second storage device and a plurality of second operational data. and a calculation device, wherein the second calculation data is stored in the second storage device.

일 실시예에서, 상기 방법은 상기 제2 네트워크 온칩 처리 모듈 내의 복수의 제2 계산 장치를 통해 상기 제2 연산 데이터 및 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는 단계를 더 포함한다.In one embodiment, the method further comprises calculating on the second calculation data and the first calculation result through a plurality of second calculation devices in the second network-on-chip processing module to obtain a second calculation result. .

일 실시예에서, 상기 방법은, In one embodiment, the method,

상기 복수의 제2 계산 장치 사이에서 상기 제2 연산 데이터 및 상기 제1 연산 결과에 대해 연산하여 상기 제2 연산 결과를 얻는 단계; 및obtaining the second calculation result by calculating the second calculation data and the first calculation result between the plurality of second calculation devices; and

상기 제2 연산 결과를 상기 제2 저장 장치에 저장하는 단계를 더 포함한다.The method may further include storing the second operation result in the second storage device.

일 실시예에서, 상기 방법은, In one embodiment, the method,

상기 제1 네트워크 온칩 처리 모듈 내의 제1 메인 계산 장치를 통해 상기 제1 저장 장치에 액세스하여 상기 제1 연산 데이터를 획득하는 단계;obtaining the first operational data by accessing the first storage device through a first main computing device in the first network-on-chip processing module;

상기 제1 네트워크 온칩 처리 모듈 내의 제1 메인 계산 장치와 복수의 제1 서브 계산 장치 사이에서 상기 제1 연산 데이터를 포워딩하는 단계; 및forwarding the first operation data between a first main computing device and a plurality of first sub computing devices in the first network-on-chip processing module; and

상기 제1 네트워크 온칩 처리 모듈 내의 제1 메인 계산 장치와 복수의 제1 서브 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 상기 제1 연산 결과를 얻는 단계를 더 포함하고,obtaining the first calculation result by calculating the first calculation data through a first main calculation device and a plurality of first sub-calculation devices in the first network on-chip processing module;

여기서, 상기 제1 계산 장치는 제1 메인 계산 장치 및 복수의 제1 서브 계산 장치를 포함한다.Here, the first computing device includes a first main computing device and a plurality of first sub computing devices.

저장 장치 및 계산 장치를 포함하고 기계 학습 계산을 수행하는 네트워크 온칩 처리 시스템에 적용되는 네트워크 온칩 데이터 처리 방법으로서, A network on-chip data processing method applied to a network on-chip processing system that includes a storage device and a calculation device and performs machine learning calculations, comprising:

상기 방법은, The method,

상기 네트워크 온칩 처리 시스템 내의 제1 계산 장치를 통해 상기 네트워크 온칩 처리 시스템 내의 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 단계;obtaining first operation data by accessing a storage device in the network on-chip processing system through a first calculation device in the network on-chip processing system;

상기 제1 연산 결과를 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치로 송신하는 단계를 포함한다. and sending the first calculation result to a second calculation device in the network on-chip processing system.

일 실시예에서, 상기 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고 In one embodiment, the computing device comprises a computing unit and a controller unit;

상기 네트워크 온칩 처리 시스템의 제1 계산 장치를 통해 상기 네트워크 온칩 처리 시스템의 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 단계는,The step of obtaining first operation data by accessing a storage device of the network on-chip processing system through a first calculation device of the network on-chip processing system,

상기 제1 계산 장치의 컨트롤러 유닛이 상기 저장 장치로부터 상기 제1 연산 데이터 및 계산 명령을 획득하는 단계를 포함한다.and obtaining, by a controller unit of the first computing device, the first calculation data and calculation instructions from the storage device.

일 실시예에서, 상기 연산 유닛은 하나의 마스트(master) 처리회로 및 복수의 슬레이브(slave) 처리회로를 포함한다.In one embodiment, the computing unit includes one master processing circuit and a plurality of slave processing circuits.

상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 단계는,Obtaining a first calculation result by calculating the first calculation data through the first calculation device,

상기 제1 계산 장치 내의 컨트롤러 유닛을 통해 상기 계산 명령을 분석하여 복수의 연산 명령을 얻으며, 상기 제1 계산 장치 내의 컨트롤러 유닛이 상기 복수의 연산 명령 및 상기 제1 연산 데이터를 상기 제1 계산 장치의 마스트 처리회로로 송신하는 단계;A controller unit in the first computing device analyzes the calculation instructions to obtain a plurality of calculation instructions, and the controller unit in the first calculation device converts the plurality of calculation instructions and the first calculation data to the first calculation device. sending to the master processing circuit;

상기 제1 계산 장치 내의 마스트 처리회로를 통해 상기 제1 연산 데이터에 대해 프리오더 처리를 수행하고, 상기 제1 계산 장치 중 복수의 슬레이브 처리회로와 데이터 및 연산 명령을 전송하는 단계;performing preorder processing on the first calculation data through a master processing circuit in the first calculation device, and transmitting data and calculation commands to a plurality of slave processing circuits among the first calculation devices;

상기 제1 계산 장치 중 복수의 슬레이브 처리회로가 상기 제1 계산 장치 내의 마스트 처리회로에서 전송된 연산 데이터 및 연산 명령에 의해 중간 연산을 수행하여 복수의 중간 결과를 얻고 상기 복수의 중간 결과를 상기 제1 계산 장치의 마스트 처리회로로 전송하는 단계; 및A plurality of slave processing circuits of the first computing device perform intermediate calculations based on the calculation data and calculation instructions transmitted from the master processing circuit in the first calculation device to obtain a plurality of intermediate results, and convert the plurality of intermediate results to the first calculation device. 1 sending to the master processing circuit of the computing device; and

상기 제1 계산 장치의 마스트 처리회로가 상기 복수의 중간 결과에 대해 후속 처리를 수행하여 상기 계산 명령의 제1 연산 결과를 얻는 단계를 포함한다.and obtaining a first calculation result of the calculation instruction by performing subsequent processing on the plurality of intermediate results by a master processing circuit of the first calculation device.

일 실시예에서, 상기 제1 연산 결과를 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치로 송신하는 단계는, In one embodiment, sending the result of the first operation to a second computing device in the network-on-chip processing system comprises:

상기 제1 계산 장치의 컨트롤러 유닛이 상기 제1 연산 결과를 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치로 송신하는 단계를 포함한다.and sending, by a controller unit of the first computing device, the first calculation result to a second computing device in the network-on-chip processing system.

일 실시예에서, 상기 기계 학습 계산은 인공 신경망 연산을 포함하고, 상기 제1 연산 데이터는 입력 뉴런 데이터 및 가중치를 포함하고, 상기 제1 연산 결과는 출력 뉴런 데이터이다.In an embodiment, the machine learning calculation includes an artificial neural network calculation, the first calculation data includes input neuron data and weights, and the first calculation result is output neuron data.

일 실시예에서, 상기 계산 장치는 저장 유닛 및 직접 메모리 액세스 유닛을 더 포함하고, 상기 저장 유닛은 레지스터, 캐시 중 임의의 조합을 포함하며, In one embodiment, the computing device further comprises a storage unit and a direct memory access unit, wherein the storage unit comprises any combination of a register and a cache;

상기 캐시는 상기 제1 연산 데이터를 저장하고,The cache stores the first operation data,

상기 레지스터는 상기 제1 연산 데이터 중의 스칼라를 저장한다.The register stores a scalar in the first operation data.

일 실시예에서, 상기 컨트롤러 유닛은 명령 저장 유닛, 명령 처리 유닛 및 저장 큐 유닛을 포함하고,In one embodiment, the controller unit includes an instruction storage unit, an instruction processing unit and a storage queue unit;

상기 명령 저장 유닛은 상기 인공 신경망과 관련있는 계산 명령을 저장하며,the instruction storage unit stores calculation instructions related to the artificial neural network;

상기 명령 처리 유닛은 상기 계산 명령을 분석하여 복수의 연산 명령을 얻으며,the command processing unit analyzes the calculation command to obtain a plurality of calculation commands;

상기 저장 큐 유닛은 명령 큐를 포함하고, 상기 명령 큐는 상기 명령 큐의 선후 순서에 따라 수행하고자 하는 복수의 연산 명령 및/또는 계산 명령을 포함한다.The storage queue unit includes a command queue, and the command queue includes a plurality of calculation commands and/or calculation commands to be executed according to the precedence order of the command queue.

일 실시예에서, 상기 마스트 처리회로는 의존 관계 처리 유닛을 포함하고,In one embodiment, the master processing circuit includes a dependency processing unit;

상기 의존 관계 처리 유닛은 제1 연산 명령과 상기 제1 연산 명령 이전의 제0 연산 명령 사이에 연관 관계가 있는지 여부를 확정하며, 상기 제1 연산 명령과 상기 제0 연산 명령 사이에 연관 관계가 있으면 상기 제1 연산 명령을 상기 명령 저장 유닛 내에 캐싱하고, 상기 제0 연산 명령을 수행한 후 상기 명령 저장 유닛으로부터 상기 제1 연산 명령을 추출하여 상기 연산 유닛으로 전송하며;The dependency relationship processing unit determines whether there is an association between a first operating instruction and a zeroth operating instruction preceding the first operating instruction, and if there is an association between the first operating instruction and the zeroth operating instruction, cache the first operation instruction in the instruction storage unit, extract the first operation instruction from the instruction storage unit after executing the zeroth operation instruction, and send it to the operation unit;

상기 제1 연산 명령과 제1 연산 명령 이전의 제0 연산 명령 사이에 연관 관계가 있는지 여부를 확정하는 단계는,Determining whether or not there is a correlation between the first operation command and the 0th operation command preceding the first operation command,

상기 제1 연산 명령에 의해 상기 제1 연산 명령에 필요한 데이터의 제1 저장 주소 구간을 추출하고, 상기 제0 연산 명령에 의해 상기 제0 연산 명령에 필요한 데이터의 제0 저장 주소 구간을 추출하되, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하면, 상기 제1 연산 명령과 상기 제0 연산 명령 사이에 연관 관계가 있다고 판단하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하지 않으면, 상기 제1 연산 명령과 상기 제0 연산 명령 사이에 연관 관계가 없다고 확정하는 단계를 포함한다.extracting a first storage address range of data required for the first operation command by the first operation command, and extracting a 0th storage address range of data required for the zero operation command by the 0th operation command; If a region overlapping the first storage address interval and the zeroth storage address interval exists, it is determined that there is a correlation between the first operation command and the zeroth operation command, and the first storage address interval and the first operation command are determined. and determining that there is no correlation between the first operation instruction and the zeroth operation instruction when there is no overlapping area in the 0 storage address interval.

일 실시예에서, 상기 연산 유닛은 트리 모듈을 포함하고, 상기 트리 모듈은 하나의 루트 포트 및 복수의 브랜치 포트를 포함하고, 상기 트리 모듈의 루트 포트는 상기 마스트 처리회로에 연결되고, 상기 트리 모듈의 복수의 브랜치 포트는 복수의 슬레이브 처리회로 중 하나의 슬레이브 처리회로에 각각 연결되며,In an embodiment, the computing unit includes a tree module, the tree module includes a root port and a plurality of branch ports, a root port of the tree module is connected to the master processing circuit, and the tree module A plurality of branch ports of are respectively connected to one slave processing circuit among a plurality of slave processing circuits,

상기 트리 모듈은 상기 마스트 처리회로와 상기 복수의 슬레이브 처리회로 사이의 데이터 블록, 가중치 및 연산 명령을 포워딩한다.The tree module forwards data blocks, weights, and operation commands between the master processing circuit and the plurality of slave processing circuits.

일 실시예에서, 상기 연산 유닛은 하나 또는 복수의 분기 처리회로를 포함하고, 각각의 분기 처리회로는 적어도 하나 이상의 슬레이브 처리회로에 연결되며,In one embodiment, the arithmetic unit includes one or more branch processing circuits, and each branch processing circuit is connected to at least one or more slave processing circuits;

상기 마스트 처리회로는 상기 입력 뉴런을 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 배부 데이터를 복수의 데이터 블록으로 배분하고, 상기 복수의 데이터 블록 중 적어도 하나 이상의 데이터 블록, 브로드캐스트 데이터 및 복수의 연산 명령 중 적어도 하나 이상의 연산 명령을 상기 분기 처리회로로 송신하고,The master processing circuit determines the input neuron as broadcast data and the weight as distribution data, distributes the distribution data into a plurality of data blocks, and determines at least one of the plurality of data blocks, broadcast data and a plurality of data blocks. Sending at least one operation command among the operation commands of the branch processing circuit;

상기 분기 처리회로는 상기 마스트 처리회로와 상기 복수의 슬레이브 처리회로 사이의 데이터 블록, 브로드캐스트 데이터 및 연산 명령을 포워딩하고,the branch processing circuit forwards data blocks, broadcast data, and operation commands between the master processing circuit and the plurality of slave processing circuits;

상기 복수의 슬레이브 처리회로는 상기 연산 명령에 의해 수신된 데이터 블록 및 브로드캐스트 데이터에 대해 연산하여 중간 결과를 얻고 상기 중간 결과를 상기 분기 처리회로에 전송하며,the plurality of slave processing circuits calculate an intermediate result by calculating the data block and broadcast data received by the operation command, and transmit the intermediate result to the branch processing circuit;

상기 마스트 처리회로는 분기 처리회로에서 송신된 중간 결과에 대해 후속 처리하여 상기 계산 명령의 제1 연산 결과를 얻고, 상기 계산 명령의 제1 연산 결과를 상기 컨트롤러 유닛으로 송신한다.The master processing circuit performs subsequent processing on the intermediate result sent from the branch processing circuit to obtain a first operation result of the calculation instruction, and transmits the first operation result of the calculation instruction to the controller unit.

일 실시예에서, 상기 복수의 슬레이브 처리회로는 어레이 분포를 이루고, 각각의 슬레이브 처리회로는 인접되어 있는 다른 슬레이브 처리회로에 연결되며, 상기 마스트 처리회로는 상기 복수의 슬레이브 처리회로 중의 K개 슬레이브 처리회로에 연결되고, 상기 K개 기초 회로는 제1 행의 n개 슬레이브 처리회로, 제m 행의 n개 슬레이브 처리회로 및 제1 열의 m개 슬레이브 처리회로며,In one embodiment, the plurality of slave processing circuits form an array distribution, each slave processing circuit is connected to another adjacent slave processing circuit, and the master processing circuit processes K slave processing circuits of the plurality of slave processing circuits. circuit, wherein the K basic circuits are n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column;

상기 K개 슬레이브 처리회로는 상기 마스트 처리회로 및 복수의 슬레이브 처리회로 사이에서 데이터 및 명령을 포워딩하며,The K slave processing circuits forward data and commands between the master processing circuit and a plurality of slave processing circuits;

상기 마스트 처리회로는 상기 입력 뉴런을 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 배부 데이터를 복수의 데이터 블록으로 배분하며 상기 복수의 데이터 블록 중 적어도 하나 이상의 데이터 블록 및 복수의 연산 명령 중 적어도 하나 이상의 연산 명령을 상기 K개 슬레이브 처리회로로 송신하며,The master processing circuit determines the input neuron as broadcast data and the weight as distribution data, distributes the distribution data into a plurality of data blocks, and at least one of at least one data block and a plurality of operation instructions among the plurality of data blocks. Sending one or more arithmetic commands to the K slave processing circuits;

상기 K개 슬레이브 처리회로는 상기 마스트 처리회로와 상기 복수의 슬레이브 처리회로 사이에서 데이터를 포워딩하며,the K slave processing circuits forward data between the master processing circuit and the plurality of slave processing circuits;

상기 복수의 슬레이브 처리회로는 상기 연산 명령에 따라 수신된 데이터 블록에 대해 연산하여 중간 결과를 얻고, 연산 결과를 상기 K개 슬레이브 처리회로로 전송하며,the plurality of slave processing circuits obtain intermediate results by performing calculations on data blocks received according to the calculation instructions, and transmit calculation results to the K slave processing circuits;

상기 마스트 처리회로는 상기 K개 슬레이브 처리회로가 송신한 중간 결과에 대해 후속 처리하여 상기 계산 명령의 제1 연산 결과를 얻고, 상기 계산 명령의 제1 연산 결과를 상기 컨트롤러 유닛으로 송신한다.The master processing circuit performs subsequent processing on intermediate results transmitted by the K slave processing circuits to obtain a first calculation result of the calculation command, and transmits the first calculation result of the calculation command to the controller unit.

일 실시예에서, 상기 마스트 처리회로는, 복수의 처리회로에서 송신된 중간 결과를 조합하고 정렬하여 상기 계산 명령의 제1 연산 결과를 얻거나, 혹은In one embodiment, the master processing circuit combines and sorts intermediate results transmitted from a plurality of processing circuits to obtain a first operation result of the calculation instruction; or

복수의 처리회로에서 송신된 중간 결과에 대해 조합하고 정렬할 뿐만 아니라 활성화 처리를 거쳐 상기 계산 명령의 제1 연산 결과를 얻는다.Intermediate results transmitted from a plurality of processing circuits are not only combined and sorted, but also subjected to activation processing to obtain a first operation result of the calculation instruction.

일 실시예에서, 상기 마스트 처리회로는 변환 처리회로, 활성화 처리회로, 덧셈 처리회로 중 하나 또는 임의의 조합을 포함하고,In one embodiment, the master processing circuit includes one or any combination of conversion processing circuitry, activation processing circuitry, and addition processing circuitry;

상기 변환 처리회로는 상기 제1 연산 데이터에 대해 프리오더 처리를 진행하되, 구체적으로 마스트 처리회로가 수신한 데이터 또는 중간 결과에 대해 제1 데이터구조와 제2 데이터구조 사이의 호환을 진행하거나, 혹은 마스트 처리회로가 수신한 데이터 또는 중간 결과에 대해 제1 데이터 유형과 제2 데이터 유형 사이의 호환을 수행하며,The conversion processing circuit performs pre-order processing on the first operation data, and specifically, performs compatibility between the first data structure and the second data structure on the data or intermediate result received by the master processing circuit, or Performs compatibility between the first data type and the second data type on the data or intermediate results received by the processing circuit;

상기 활성화 처리회로는 상기 후속 처리를 수행하되, 구체적으로 마스트 처리회로 내의 데이터의 활성화 연산을 수행하고,The activation processing circuit performs the subsequent processing, but specifically performs an activation operation of data in the master processing circuit,

상기 덧셈 처리회로는 상기 후속 처리를 수행하되, 구체적으로 덧셈 연산 또는 누적 연산을 수행한다.The addition processing circuit performs the subsequent processing, and specifically performs an addition operation or an accumulation operation.

일 실시예에서, 상기 슬레이브 처리회로는 곱셈 처리회로를 포함하고In one embodiment, the slave processing circuit includes a multiplication processing circuit;

상기 곱셈 처리회로는 수신된 데이터 블록에 대해 곱셈 연산을 수행하여 곱셈 결과를 얻는다.The multiplication processing circuit performs a multiplication operation on the received data block to obtain a multiplication result.

일 실시예에서, 상기 슬레이브 처리회로는 누적 처리회로를 포함하며, 상기 누적 처리회로는 상기 곱셈 결과에 대해 누적 연산을 수행하여 상기 중간 결과를 얻는다.In one embodiment, the slave processing circuit includes an accumulation processing circuit, and the accumulation processing circuit performs an accumulation operation on the multiplication result to obtain the intermediate result.

일 실시예에서, 상기 트리 모듈은 n 진수 트리 구조이며, 상기 n은 2보다 크거나 같은 정수이다.In one embodiment, the tree module is an n hexadecimal tree structure, where n is an integer greater than or equal to two.

일 실시예에서, 상기 방법은, 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치를 통해 상기 네트워크 온칩 처리 시스템 내의 저장 장치에 액세스하여 제2 연산 데이터를 획득하는 단계를 더 포함한다.In one embodiment, the method further comprises accessing a storage device in the network-on-chip processing system via a second computing device in the network-on-chip processing system to obtain second operational data.

일 실시예에서, 상기 방법은 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치를 통해 상기 제2 연산 데이터와 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는 단계를 더 포함한다.In one embodiment, the method further includes calculating the second calculation data and the first calculation result through a second calculation device in the network-on-chip processing system to obtain a second calculation result.

본 출원의 실시예는 기계 학습 계산을 수행하는 네트워크 온칩 데이터 처리 장치를 제공한다. 상기 장치는, An embodiment of the present application provides a network on-chip data processing device that performs machine learning calculations. The device,

상기 네트워크 온칩 처리 시스템 내의 제1 계산 장치를 통해 상기 네트워크 온칩 처리 시스템 내의 저장 장치에 액세스하여 제1 연산 데이터를 획득하는 제1 연산 데이터 획득 모듈;a first calculation data obtaining module configured to obtain first calculation data by accessing a storage device in the network on-chip processing system through a first calculation device in the network on-chip processing system;

상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는 연산 모듈; 및a calculation module that obtains a first calculation result by calculating the first calculation data through the first calculation device; and

상기 제1 연산 결과를 상기 네트워크 온칩 처리 시스템 내의 제2 계산 장치에 송신하는 제1 연산 결과 송신 모듈을 포함한다.and a first calculation result transmitting module configured to transmit the first calculation result to a second calculation device in the network on-chip processing system.

본 출원은 데이터 처리 방법을 제공한다. 상기 방법은, This application provides a data processing method. The method,

내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하는 단계, -여기서, 상기 데이터 동작 신호는 동작 도메인 및 동작 코드를 포함하고, 상기 동작 코드는 제1 유형 플래그 비트를 포함하며, 상기 동작 도메인은 제2 유형 플래그 비트를 포함하고, 상기 제1 유형 플래그 비트는 상기 데이터 동작 신호가 I/O 명령인지 여부를 나타내며, 상기 제2 유형 플래그 비트는 상기 데이터 동작 신호가 상기 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령인지 여부를 나타냄; 및Receiving a data operation signal transmitted from an internal or external device, wherein the data operation signal comprises an operation domain and an operation code, wherein the operation code comprises a first type flag bit, wherein the operation domain comprises a first type flag bit. 2 type flag bits, wherein the first type flag bit indicates whether the data operation signal is an I/O command, and the second type flag bit indicates whether the data operation signal is a broadcast or one of the I/O commands. Indicates whether this is a multicast command; and

상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는 단계를 포함한다.and obtaining necessary input data by performing a corresponding operation on operation schedule data in a memory according to the data operation signal.

일 실시예에서, 상기 동작 도메인은 데이터 수신 플래그 비트를 더 포함하고, 상기 데이터 수신 플래그 비트는 상기 입력 데이터를 수신하는 장치 또는 처리회로를 나타낸다.In one embodiment, the operation domain further includes a data reception flag bit, wherein the data reception flag bit indicates a device or processing circuit receiving the input data.

일 실시예에서, 상기 데이터 수신 플래그 비트의 개수는 상기 메모리와 상호 작용가능한 장치의 개수 또는 처리회로의 개수를 나타낼 수 있다.In one embodiment, the number of data reception flag bits may indicate the number of devices or processing circuits capable of interacting with the memory.

상기 동작 도메인은 동작 예정 데이터의 정보를 더 포함하고, 상기 동작 예정 데이터의 정보는 상기 동작 예정 데이터가 상기 메모리 내의 소스 주소, 동작 예정 데이터 길이, 및 데이터 동작시킨 후의 데이터 리턴 주소를 포함하고, 상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는 단계는,The operation domain further includes information on operation schedule data, wherein the operation schedule data includes a source address in the memory, an operation schedule data length, and a data return address after a data operation, wherein the operation schedule data includes: Obtaining necessary input data by performing a corresponding operation on operation scheduled data in the memory according to the data operation signal,

상기 소스 주소로부터 상기 메모리 판독을 시작하여 상기 데이터 길이를 만족하는 입력 데이터를 획득하는 단계;starting reading the memory from the source address to obtain input data that satisfies the data length;

상기 데이터 수신 플래그 비트에 의해 입력 데이터를 수신하는 장치 또는 처리회로를 확정하는 단계; 및determining a device or processing circuit that receives input data by means of the data reception flag bit; and

상기 데이터 리턴 주소에 의해 상기 입력 데이터를 상기 장치 또는 처리회로 내의 상기 데이터 리턴 주소와 대응하는 저장 공간으로 리턴하는 단계를 포함한다.and returning the input data by the data return address to a storage space corresponding to the data return address in the device or processing circuit.

일 실시예에서, 상기 동작 도메인은 점프 서브 동작 도메인을 더 포함하고, 상기 점프 서브 동작 도메인은 점프 폭 및 매번 점프한 뒤 동작하는 점프 데이터 길이를 포함하며, 상기 소스 주소로부터 상기 메모리를 판독하기 시작하여 상기 데이터 길이를 만족하는 입력 데이터를 획득하는 단계는,In an embodiment, the action domain further includes a jump sub-action domain, the jump sub-action domain includes a jump width and a jump data length operated after each jump, and starting to read the memory from the source address. Acquiring input data that satisfies the data length by doing

상기 소스 주소로부터 상기 메모리를 판독하되, 현재 점프한 후의 점프 데이터 길이에 의해 제1 점프 데이터를 획득하는 단계;reading the memory from the source address and obtaining first jump data by a jump data length after a current jump;

상기 점프 데이터의 마지막 주소를 획득하고 상기 점프 폭에 따라 상기 마지막 주소로부터 목표 점프 주소로 점프하는 단계;obtaining a last address of the jump data and jumping from the last address to a target jump address according to the jump width;

매번 점프한 후 얻은 점프 데이터의 길이가 상기 데이터 길이를 만족할 때까지 상기 목표 점프 주소로부터 점프한 후의 점프 데이터 길이에 의해 제2 점프 데이터를 획득하는 단계를 포함한다.and obtaining second jump data according to the jump data length after jumping from the target jump address until the length of jump data obtained after each jump satisfies the data length.

일 실시예에서, 상기 점프 서브 동작 도메인은 stride 동작 도메인 및/또는 segment 동작 도메인을 포함하고, 상기 stride 동작 도메인은 상기 데이터 동작 신호의 매번 점프 폭을 나타내며, 상기 segment 동작 도메인은 사전에 설정된 상기 데이터 동작 신호의 매번 분할 크기를 나타낸다.In one embodiment, the jump sub operation domain includes a stride operation domain and/or a segment operation domain, the stride operation domain indicates a jump width of each data operation signal, and the segment operation domain is configured in advance to the data operation domain. Indicates the size of each division of the operating signal.

일 실시예에서, 상기 동작 도메인은 판독된 데이터에 대해 수행하는 처리작업을 나타낸다.In one embodiment, the action domain represents a processing operation performed on read data.

일 실시예에서, 상기 방법은, In one embodiment, the method,

상기 제1 유형 플래그 비트의 값이 I/O이면, 상기 데이터 동작 신호를 I/O 명령으로 확정하는 단계; 및if the value of the first type flag bit is I/O, determining the data operation signal as an I/O command; and

상기 제2 유형 플래그 비트의 값이 1이면, 상기 데이터 동작 신호를 상기 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령으로 확정하는 단계를 더 포함한다.and if the value of the second type flag bit is 1, determining the data operation signal as a broadcast or multicast command among the I/O commands.

일 실시예에서, 상기 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하는 단계는,In one embodiment, the step of receiving the data operation signal transmitted from the internal or external device,

상기 데이터 동작 신호를 분석하여 상기 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻는 단계;obtaining information on a type flag bit of the data operation signal and operation schedule data by analyzing the data operation signal;

상기 데이터 동작 신호의 수행 순서를 나타내는 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행하는 단계를 더 포함한다.The method may further include executing the analyzed data operation signal according to a command queue indicating an execution sequence of the data operation signal.

일 실시예에서, 상기 명령 큐에 의해 상기 분석된 데이터 동작 신호를 수행하기 전에, 상기 방법은,In one embodiment, prior to performing the analyzed data operation signal by the command queue, the method comprises:

인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하여 판단 결과를 얻는 단계, -상기 의존 관계는 제s번째 데이터 동작 신호와 상기 제s번째 데이터 동작 신호 이전의 제s-1번째 데이터 동작 신호가 연관 관계가 있는지 여부를 나타냄-; 및Determining a dependency relationship between the adjacent analyzed data operation signals and obtaining a determination result; Indicates whether there is an association-; and

상기 판단 결과가 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호 사이에 의존 관계가 있는 것이면, 상기 제s번째 데이터 동작 신호를 캐싱하고, 상기 제s-1번째 데이터 동작 신호를 수행한 후 상기 제s번째 데이터 동작 신호를 추출하는 단계를 더 포함한다.If the determination result is that there is a dependency relationship between the s-th data operation signal and the s-1-th data operation signal, the s-th data operation signal is cached, and the s-1-th data operation signal is and extracting the s-th data operation signal after performing the operation.

일 실시예에서, 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하는 상기 단계는,In one embodiment, the step of determining the dependency relationship of the analyzed data operation signal that is adjacent,

상기 제s번째 데이터 동작 신호에 의해 상기 제s번째 데이터 동작 신호에 필요한 데이터를 추출하는 제1 저장 주소 구간; 및 상기 제s-1번째 데이터 동작 신호에 의해 상기 제s-1번째 데이터 동작 신호에 필요한 데이터를 추출하는 제0 저장 주소 구간을 각각 획득하는 단계;a first storage address section for extracting data necessary for the s-th data operation signal in response to the s-th data operation signal; and obtaining, by the s-1 th data operation signal, a 0 th storage address section for extracting data necessary for the s-1 th data operation signal.

상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 있다고 확정하는 단계; 및determining that the s-th data operation signal and the s-1-th data operation signal have a dependency relationship if an area overlapping the first storage address period and the 0th storage address period exists; and

상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하지 않으면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 없다고 확정하는 단계를 포함한다.and determining that the s-th data operation signal and the s-1-th data operation signal do not have a dependency relationship when there is no overlapping region between the first storage address interval and the 0th storage address interval. .

본 출원의 실시예는 프로세서 및 메모리를 포함하는 데이터 처리 장치를 제공한다. 상기 메모리에 컴퓨터 프로그램이 저장되어 있고, 상기 프로세서는 상기 컴퓨터 프로그램을 실행할 때,An embodiment of the present application provides a data processing apparatus including a processor and a memory. When a computer program is stored in the memory and the processor executes the computer program,

상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는 단계;를 구현한다.Obtaining required input data by performing a corresponding operation on operation scheduled data in a memory according to the data operation signal;

본 출원은 데이터 처리 방법을 제공한다. 상기 방법은,This application provides a data processing method. The method,

내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하는 단계, -상기 데이터 동작 신호는 상기 유형 플래그 비트를 포함하는 데이터 동작 코드를 포함하고, 상기 유형 플래그 비트는 상기 동작 신호의 브로드캐스트 또는 멀티캐스트 명령을 나타냄-; 및Receiving a data operation signal transmitted from an internal or external device, the data operation signal comprising a data operation code including the type flag bit, wherein the type flag bit is a broadcast or multicast command of the operation signal represents -; and

상기 데이터 동작 신호에 의해 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는 단계를 포함한다.and obtaining necessary input data by performing a corresponding operation on operation scheduled data in a memory in response to the data operation signal.

상기 데이터 동작 신호에 의해 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는 단계를 구현한다.A step of obtaining necessary input data by performing a corresponding operation on operation scheduled data in a memory according to the data operation signal is implemented.

일 실시예에서, 상기 데이터 동작 신호는 동작 도메인을 더 포함하고, 상기 동작 도메인은 데이터 수신 플래그 비트를 포함하고, 상기 데이터 수신 플래그 비트는 상기 입력 데이터를 수신하는 장치 또는 처리회로를 나타낸다.In one embodiment, the data operation signal further comprises an operation domain, wherein the operation domain comprises a data received flag bit, wherein the data received flag bit indicates a device or processing circuit receiving the input data.

일 실시예에서, 상기 방법은,In one embodiment, the method,

상기 유형 플래그 비트의 값이 CAST이면, 상기 데이터 동작 신호를 브로드캐스트 또는 멀티캐스트 명령으로 확정하는 단계를 더 포함한다.and if the value of the type flag bit is CAST, determining the data operation signal as a broadcast or multicast command.

기계 학습 데이터를 처리하는 데이터 처리 장치에 있어서, 상기 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 상기 전송 회로에 연결되고, 상기 전송 회로는 상기 공유 메모리에 연결되며,A data processing device for processing machine learning data, wherein the data processing device includes a machine learning device, a transmission circuit and a shared memory, the machine learning device is connected to the transmission circuit, and the transmission circuit is connected to the shared memory. connected,

상기 전송 회로는 상기 기계 학습 장치에서 발송된 데이터 동작 신호에 의해 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고 상기 입력 데이터를 상기 기계 학습 장치로 리턴하며, 상기 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함한다.The transmission circuit obtains input data necessary for the machine learning device from the shared memory according to a data operation signal sent from the machine learning device and returns the input data to the machine learning device, and the data operation signal is the data operation signal. It includes information about signal type flag bits and operation scheduled data.

일 실시예에서, 상기 기계 학습 장치는 상기 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻는다.In an embodiment, the machine learning device obtains output data by performing a machine learning operation on the input data.

일 실시예에서, 상기 기계 학습 장치는 추가로 상기 전송 회로를 통해 상기 출력 데이터를 상기 공유 메모리로 전송하여 데이터를 저장한다.In one embodiment, the machine learning device further transmits the output data to the shared memory through the transmission circuit to store data.

일 실시예에서, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고,In one embodiment, the machine learning device includes at least one machine learning unit,

상기 데이터 동작 신호는 데이터 수신 플래그 비트를 포함하고, 상기 데이터 수신 플래그 비트는 상기 입력 데이터를 수신한 목표 기계 학습 유닛을 나타낸다.The data operation signal includes a data reception flag bit, and the data reception flag bit indicates a target machine learning unit that has received the input data.

일 실시예에서, 상기 데이터 동작 신호의 유형 플래그 비트의 값은 CAST를 포함하고, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다.In one embodiment, the value of the type flag bit of the data operation signal includes CAST, indicating that the data operation signal is a broadcast or multicast command.

일 실시예에서, 상기 데이터 동작 신호의 유형 플래그 비트는 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 포함한다.In one embodiment, the type flag bits of the data operation signal include a first type flag bit and a second type flag bit.

여기서, 상기 제1 유형 플래그 비트의 값은 I/O를 포함하고, 상기 데이터 동작 신호가 I/O 명령인지 여부를 나타낸다.Here, the value of the first type flag bit includes I/O, and indicates whether the data operation signal is an I/O command.

상기 제2 유형 플래그 비트는 상기 데이터 동작 신호가 상기 I/O 명령의 브로드캐스트 또는 멀티캐스트 명령인지 여부를 나타낸다.The second type flag bit indicates whether the data operation signal is a broadcast or multicast command of the I/O command.

일 실시예에서, 상기 동작 예정 데이터의 정보는 상기 동작 예정 데이터가 상기 공유 메모리 내의 소스 주소, 동작 예정 데이터 길이 및 데이터 동작시킨 후의 데이터 리턴 주소 중 적어도 하나를 포함한다.In one embodiment, the information of the operation schedule data includes at least one of a source address in the shared memory, an operation schedule data length, and a data return address after data operation.

일 실시예에서, 상기 데이터 동작 신호는 점프 정보를 더 포함하고, 상기 점프 정보는 점프 폭 및 매번 점프한 후 동작하는 데이터 길이를 포함한다.In one embodiment, the data operation signal further includes jump information, and the jump information includes a jump width and a data length operated after each jump.

일 실시예에서, 상기 점프 정보는 stride점프 정보 및/또는 segment점프 정보를 포함하고，In one embodiment, the jump information includes stride jump information and / or segment jump information,

상기 stride점프 정보는 상기 데이터 동작 신호의 매번 점프 폭을 나타내며,The stride jump information indicates a jump width of each data operation signal,

상기 segment 동작 도메인은 사전에 설정된 상기 데이터 동작 신호의 매번 분할 크기를 나타낸다.The segment operation domain indicates a pre-set size of each division of the data operation signal.

일 실시예에서, 상기 데이터 동작 신호는, 판독된 데이터에 대해 수행하는 처리작업을 나타내는 기능 플래그 비트를 더 포함한다.In one embodiment, the data operation signal further includes a function flag bit indicating a processing operation to be performed on the read data.

본 출원의 실시예는 상기 데이터 처리 장치에 적용되는 데이터 처리 방법을 더 제공한다. 상기 방법은,An embodiment of the present application further provides a data processing method applied to the data processing device. The method,

상기 데이터 처리 장치의 전송 회로가 상기 데이터 처리 장치의 기계 학습 장치에서 송신된 데이터 동작 신호를 수신하는 단계, -상기 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함함-;Receiving, by a transmission circuit of the data processing device, a data operation signal transmitted from a machine learning device of the data processing device, wherein the data operation signal includes a type flag bit of the data operation signal and information of operation schedule data- ;

상기 전송 회로가 상기 데이터 동작 신호의 유형 플래그 비트에 의해 공유 메모리 내의 데이터에 대해 수행하는 동작을 확정하고, 상기 동작 예정 데이터의 정보에 따라 상기 동작 예정 데이터에 대해 상기 동작을 수행하여 상기 기계 학습 장치에 필요한 입력 데이터를 얻고, 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계; 및 The transmission circuit determines an operation to be performed on data in the shared memory according to the type flag bit of the data operation signal, and performs the operation on the operation schedule data according to the information of the operation schedule data, so that the machine learning device Obtaining input data required for the operation, and returning the input data to the machine learning device; and

상기 기계 학습 장치가 상기 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻고, 상기 출력 데이터를 새로운 입력 데이터로 하되, 상기 전송 회로를 통해 공유 메모리에 전송하여 저장하는 단계를 포함한다.The machine learning device performs a machine learning operation based on the input data to obtain output data, transmits the output data as new input data, and transmits and stores the output data to a shared memory through the transmission circuit.

일 실시예에서, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 데이터 동작 신호는 데이터 수신 플래그 비트를 포함하며, 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계는, In an embodiment, the machine learning device includes at least one machine learning unit, the data operation signal includes a data reception flag bit, and returning the input data to the machine learning device comprises:

상기 전송 회로가 상기 데이터 수신 플래그 비트의 값에 의해 상기 입력 데이터를 수신하는 목표 기계 학습 유닛을 확정하고, 상기 입력 데이터를 상기 목표 기계 학습 유닛으로 발송한다. The transmission circuit determines a target machine learning unit receiving the input data according to the value of the data reception flag bit, and sends the input data to the target machine learning unit.

일 실시예에서, 상기 동작 예정 데이터의 정보는 상기 동작 예정 데이터가 상기 메모리 내의 소스 주소, 동작 예정 데이터 길이, 및 데이터 동작시킨 후의 데이터 리턴 주소를 포함하고, 상기 동작 예정 데이터의 정보에 의해 상기 동작 예정 데이터에 대해 상기 동작을 수행하여 상기 기계 학습 장치에 필요한 입력 데이터를 얻고 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계는,In one embodiment, the information of the scheduled operation data includes a source address in the memory, a length of scheduled operation data, and a data return address after data operation, and the operation schedule data is determined by the information of the scheduled operation data. The step of performing the operation on scheduled data to obtain input data required for the machine learning device and returning the input data to the machine learning device,

상기 전송 회로가 상기 소스 주소로부터 상기 공유 메모리를 판독하기 시작하여 상기 데이터 길이를 만족하는 상기 입력 데이터를 획득하는 단계;starting the transmission circuit to read the shared memory from the source address to obtain the input data that satisfies the data length;

상기 전송 회로가 상기 데이터 리턴 주소 및 상기 데이터 수신 플래그 비트에 의해 상기 입력 데이터를 상기 목표 기계 학습 유닛으로 리턴하는 단계를 포함한다.and the transmission circuit returning the input data to the target machine learning unit by the data return address and the data reception flag bit.

본 출원은 데이터 처리 장치를 제공한다. 상기 데이터 처리 장치는, 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 기계 학습 유닛이 수행한 유니캐스트 판독 동작과 브로드캐스트 동작은 하나의 데이터 수신 인터페이스를 공유하며, 상기 기계 학습 유닛은 송신 인터페이스 및 공유 데이터 수신 인터페이스를 통해 상기 전송 회로에 연결되며, 상기 전송 회로는 상기 공유 메모리에 연결되고,This application provides a data processing device. The data processing device includes a machine learning device, a transmission circuit, and a shared memory, the machine learning device includes at least one machine learning unit, and a unicast read operation and a broadcast operation performed by the machine learning unit are sharing a data receiving interface, the machine learning unit is connected to the transmission circuit via a transmission interface and a shared data reception interface, and the transmission circuit is connected to the shared memory;

상기 전송 회로는 상기 기계 학습 장치가 상기 송신 인터페이스를 통해 발송한 데이터 동작 신호에 의해 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고, 상기 공유 데이터 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴한다.The transmission circuit obtains input data necessary for the machine learning apparatus from the shared memory according to a data operation signal sent from the machine learning apparatus through the transmission interface, and transmits the input data to the machine learning apparatus through the shared data reception interface. Return to learning device.

일 실시예에서, 상기 기계 학습 장치는 추가로 상기 전송 회로를 통해 상기 출력 데이터를 상기 공유 메모리로 전송하여 저장한다.In one embodiment, the machine learning device further transmits and stores the output data to the shared memory through the transmission circuit.

일 실시예에서, 상기 송신 인터페이스는 유니캐스트 판독 신호 송신 인터페이스 및 브로드캐스트 신호 송신 인터페이스를 포함하고, 상기 기계 학습 유닛은 상기 유니캐스트 판독 신호 송신 인터페이스와 상기 공유 데이터 수신 인터페이스가 상기 전송 회로에 각각 연결되는 것에 의해 유니캐스트 판독 동작을 구현하고 상기 브로드캐스트 신호 송신 인터페이스와 상기 공유 데이터 수신 인터페이스가 상기 전송 회로에 각각 연결되는 것에 의해 브로드캐스트 동작을 구현한다.In an embodiment, the transmission interface includes a unicast read signal transmission interface and a broadcast signal transmission interface, and the machine learning unit is configured such that the unicast read signal transmission interface and the shared data reception interface are respectively connected to the transmission circuit. A unicast read operation is implemented by being connected, and a broadcast operation is implemented by connecting the broadcast signal transmission interface and the shared data reception interface to the transmission circuit, respectively.

일 실시예에서, 상기 전송 회로는 제2 전송 인터페이스, 상기 제2 전송 인터페이스에 연결된 판독/기록 처리회로, 및 상기 판독/기록 처리회로에 연결된 중재 회로를 포함하고,In one embodiment, the transmission circuit comprises a second transmission interface, a read/write processing circuit coupled to the second transmission interface, and an arbitration circuit coupled to the read/write processing circuit;

상기 판독/기록 처리회로는 상기 적어도 하나 이상의 기계 학습 유닛이 상기 송신 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 데이터 동작 신호를 수신하고, 상기 데이터 동작 신호를 상기 중재 회로에 전송하고 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 상기 공유 데이터 수신 인터페이스를 통해 상기 데이터 동작 신호에 대응하는 기계 학습 유닛으로 리턴하며, The read/write processing circuit receives a data operation signal transmitted by the at least one machine learning unit through the transmission interface and the second transmission interface, transmits the data operation signal to the arbitration circuit, and the arbitration circuit Return data obtained from the shared memory to a machine learning unit corresponding to the data operation signal through the second transmission interface and the shared data reception interface;

상기 중재 회로는, 사전에 설정된 중재 규칙에 따라 상기 판독/기록 처리회로로부터 수신된 데이터 동작 신호를 중재하고 중재 성공한 데이터 동작 신호에 의해 상기 공유 메모리 내의 데이터를 동작시킨다.The mediation circuit mediates data operation signals received from the read/write processing circuit according to arbitration rules set in advance and operates data in the shared memory according to the data operation signal that has succeeded in mediation.

일 실시예에서, 상기 판독/기록 처리회로는 유니캐스트 판독 처리회로, 브로드캐스트 처리회로를 포함하고, 상기 유니캐스트 판독 처리회로는 유니캐스트 판독 신호를 처리하며, 상기 브로드캐스트 처리회로는 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리한다.In one embodiment, the read/write processing circuitry includes unicast read processing circuitry and broadcast processing circuitry, wherein the unicast read processing circuitry processes a unicast read signal, and the broadcast processing circuitry comprises a broadcast signal and/or process multicast signals.

일 실시예에서, 상기 제2 전송 인터페이스는 상기 유니캐스트 판독 처리회로에 연결된 적어도 한 그룹 이상의 유니캐스트 판독 신호 수신 인터페이스 및 유니캐스트 판독 데이터 송신 인터페이스, 상기 브로드캐스트 처리회로에 연결된 적어도 한 그룹 이상의 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 포함하고, 상기 유니캐스트 판독 신호 수신 인터페이스는 상기 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결되고, 상기 브로드캐스트 신호 수신 인터페이스는 상기 기계 학습 유닛의 브로드캐스트 신호 송신 인터페이스에 연결되며, 상기 전송 회로의 상기 유니캐스트 판독 데이터 송신 인터페이스와 상기 브로드캐스트 데이터 송신 인터페이스는 상기 기계 학습 유닛의 공유 데이터 수신 인터페이스에 각각 연결된다.In one embodiment, the second transmission interface includes at least one group of unicast read signal reception interfaces and unicast read data transmission interfaces connected to the unicast read processing circuitry, and at least one group of broadcast broadcast signals connected to the broadcast processing circuitry. a signal reception interface and a broadcast data transmission interface, wherein the unicast read signal reception interface is connected to the unicast read signal transmission interface of the machine learning unit, and the broadcast signal reception interface comprises a broadcast data transmission interface of the machine learning unit. The unicast read data transmission interface and the broadcast data transmission interface of the transmission circuit are respectively connected to the shared data reception interface of the machine learning unit.

일 실시예에서, 상기 판독/기록 처리회로는 브로드캐스트 처리회로 및 복수의 유니캐스트 판독 처리회로를 포함하고, 상기 복수의 유니캐스트 판독 처리회로와 상기 복수의 기계 학습 유닛은 일대일로 연결되며, 상기 브로드캐스트 처리회로와 상기 복수의 기계 학습 유닛은 일대다로 연결된다.In an embodiment, the read/write processing circuit includes a broadcast processing circuit and a plurality of unicast read processing circuits, wherein the plurality of unicast read processing circuits and the plurality of machine learning units are connected one-to-one; The broadcast processing circuit and the plurality of machine learning units are connected one-to-many.

일 실시예에서, 상기 제2 전송 인터페이스는 상기 브로드캐스트 처리회로에 연결된 한 그룹의 브로드캐스트 인터페이스를 포함하고, 상기 브로드캐스트 인터페이스는 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 포함하고, 상기 복수의 기계 학습 유닛은 상기 한 그룹의 브로드캐스트 인터페이스를 통해 상기 브로드캐스트 처리회로에 연결된다.In an embodiment, the second transmission interface includes a group of broadcast interfaces connected to the broadcast processing circuit, the broadcast interface includes a broadcast signal reception interface and a broadcast data transmission interface, and the plurality of broadcast interfaces The machine learning unit of is connected to the broadcast processing circuit through the broadcast interface of the group.

일 실시예에서, 상기 제2 전송 인터페이스는, 상기 복수의 유니캐스트 판독 처리회로와 일대일로 연결된 복수 그룹의 유니캐스트 판독 신호 수신 인터페이스와 공유 데이터 송신 인터페이스, 상기 브로드캐스트 처리회로에 연결된 브로드캐스트 신호 수신 인터페이스를 포함하고, 상기 공유 데이터 송신 인터페이스는 또한 상기 브로드캐스트 처리회로에 연결되며, 상기 유니캐스트 판독 신호 수신 인터페이스는 상기 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결되며 상기 브로드캐스트 신호 수신 인터페이스는 상기 기계 학습 유닛의 브로드캐스트 신호 송신 인터페이스에 연결되고, 상기 공유 데이터 송신 인터페이스는 상기 기계 학습 유닛의 공유 데이터 수신 인터페이스에 연결된다.In one embodiment, the second transmission interface includes a plurality of groups of unicast read signal reception interfaces and a shared data transmission interface connected one-to-one with the plurality of unicast read processing circuits, and a broadcast signal reception connected to the broadcast processing circuits. an interface, wherein the shared data transmission interface is also connected to the broadcast processing circuit, the unicast read signal reception interface is connected to the unicast read signal transmission interface of the machine learning unit, and the broadcast signal reception interface is It is connected to the broadcast signal transmission interface of the machine learning unit, and the shared data transmission interface is connected to the shared data reception interface of the machine learning unit.

본 출원의 실시예는 데이터 처리 장치에 적용되는 데이터 처리 방법을 더 제공한다. 상기 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 기계 학습 유닛이 수행하는 유니캐스트 판독 동작과 브로드캐스트 동작은 하나의 데이터 수신 인터페이스를 공유하고, 상기 기계 학습 유닛은 송신 인터페이스와 공유 데이터 수신 인터페이스를 통해 상기 전송 회로에 연결되며, 상기 전송 회로는 상기 공유 메모리에 연결되고, 상기 데이터 처리 방법은, Embodiments of the present application further provide a data processing method applied to a data processing apparatus. The data processing device includes a machine learning device, a transmission circuit, and a shared memory, the machine learning device includes at least one machine learning unit, and a unicast read operation and a broadcast operation performed by the machine learning unit are one. shares a data receiving interface, the machine learning unit is connected to the transmission circuit via a transmission interface and a shared data reception interface, the transmission circuit is connected to the shared memory, and the data processing method comprises:

상기 기계 학습 장치가 상기 송신 인터페이스를 통해 데이터 동작 신호를 상기 전송 회로로 송신하는 단계; 및transmitting, by the machine learning device, a data operation signal to the transmission circuit through the transmission interface; and

상기 전송 회로가 상기 데이터 동작 신호에 따라 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고 상기 공유 데이터 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계를 포함한다.and the transfer circuit obtaining input data required for the machine learning device from the shared memory according to the data operation signal and returning the input data to the machine learning device through the shared data reception interface.

일 실시예에서, 상기 데이터 동작 신호는 브로드캐스트 신호 및/또는 멀티캐스트 신호이며, 상기 공유 데이터 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계는,In an embodiment, the data operation signal is a broadcast signal and/or a multicast signal, and returning the input data to the machine learning device through the shared data receiving interface comprises:

상기 전송 회로가 상기 공유 데이터 수신 인터페이스를 통해 상기 입력 데이터를 상기 브로드캐스트 신호 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신하는 단계를 포함한다.and transmitting, by the transmission circuit, the input data to a plurality of machine learning units corresponding to the broadcast signal and/or multicast signal via the shared data reception interface.

본 출원은 데이터 처리 장치를 제공한다. 상기 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 기계 학습 유닛은 적어도 하나 이상의 송신 인터페이스 및 적어도 하나 이상의 수신 인터페이스를 포함하고, 상기 기계 학습 유닛이 수행한 유니캐스트 판독 동작, 유니캐스트 기록 동작 및 브로드캐스트 동작 중의 적어도 2가지 데이터 동작은 상기 기계 학습 유닛의 하나의 송신 인터페이스를 공유하고, 상기 기계 학습 유닛은 상기 전송 회로에 연결되며, 상기 전송 회로는 상기 공유 메모리에 연결되고,This application provides a data processing device. The data processing device includes a machine learning device, a transmission circuit, and a shared memory, the machine learning device includes at least one machine learning unit, and the machine learning unit includes at least one transmission interface and at least one reception interface. wherein at least two data operations of a unicast read operation, a unicast write operation, and a broadcast operation performed by the machine learning unit share one transmission interface of the machine learning unit, and the machine learning unit transmits the transmission interface. circuitry, wherein the transmission circuitry is coupled to the shared memory;

상기 전송 회로는, 상기 기계 학습 장치가 상기 기계 학습 유닛의 상기 적어도 하나 이상의 송신 인터페이스를 통해 발송한 데이터 동작 신호에 따라 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고, 상기 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴한다.The transmission circuit obtains input data necessary for the machine learning apparatus from the shared memory according to a data operation signal sent from the machine learning apparatus through the at least one transmission interface of the machine learning unit, and transmits the reception interface. The input data is returned to the machine learning device through

일 실시예에서, 상기 판독/기록 처리회로는 복수의 처리회로 그룹으로 나뉘고, 하나의 기계 학습 유닛은 하나의 처리회로 그룹에 대응되며, 상기 처리회로 그룹은 하나의 유니캐스트 판독 처리회로, 하나의 유니캐스트 기록 처리회로 및 하나의 브로드캐스트 처리회로를 포함한다.In an embodiment, the read/write processing circuit is divided into a plurality of processing circuit groups, one machine learning unit corresponds to one processing circuit group, and the processing circuit group includes one unicast read processing circuit and one processing circuit group. It includes a unicast recording processing circuit and one broadcast processing circuit.

일 실시예에서, 상기 처리회로 그룹에서 유니캐스트 판독 처리회로와 브로드캐스트 처리회로가 리턴한 데이터는 상기 기계 학습 유닛 상의 하나의 공유 데이터 수신 인터페이스를 공유한다.In one embodiment, data returned by unicast read processing circuitry and broadcast processing circuitry in the processing circuitry group share one shared data receiving interface on the machine learning unit.

일 실시예에서, 상기 적어도 하나 이상의 송신 인터페이스는 유니캐스트 기록 동작과 브로드캐스트 동작이 공유하는 공유 신호 송신 인터페이스 및 유니캐스트 판독 신호 송신 인터페이스를 포함한다.In an embodiment, the at least one transmission interface includes a shared signal transmission interface shared by a unicast write operation and a broadcast operation and a unicast read signal transmission interface.

일 실시예에서, 상기 제2 전송 인터페이스는 복수의 인터페이스 그룹을 포함하고, 상기 하나의 처리회로 그룹은 하나의 인터페이스 그룹과 대응하며, 상기 하나의 인터페이스 그룹은, 상기 유니캐스트 판독 처리회로에 연결된 유니캐스트 판독 신호 수신 인터페이스와 유니캐스트 판독 데이터 송신 인터페이스; 상기 유니캐스트 기록 처리회로에 연결된 유니캐스트 판독 신호 수신 인터페이스; 상기 브로드캐스트 처리회로에 연결된 브로드캐스트 신호 수신 인터페이스 및 로드캐스트 데이터 송신 인터페이스를 포함한다.In an embodiment, the second transmission interface includes a plurality of interface groups, the one processing circuit group corresponds to one interface group, and the one interface group is connected to the unicast read processing circuit. a cast read signal reception interface and a unicast read data transmission interface; a unicast read signal receiving interface connected to the unicast recording processing circuit; and a broadcast signal reception interface and a roadcast data transmission interface connected to the broadcast processing circuit.

일 실시예에서, 상기 하나의 처리회로 그룹에서 유니캐스트 기록 처리회로와 브로드캐스트 처리회로는 상기 대응하는 인터페이스 그룹 내의 하나의 공유 신호 수신 인터페이스를 공유하고, 상기 처리회로 그룹과 대응하는 공유 신호 수신 인터페이스는 상기 처리회로 그룹과 대응하는 기계 학습 유닛의 공유 신호 송신 인터페이스에 연결되며, 상기 처리회로 그룹 내의 유니캐스트 판독 신호 수신 인터페이스는 상기 처리회로 그룹과 대응하는 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결된다.In an embodiment, the unicast recording processing circuit and the broadcast processing circuit in the processing circuit group share a shared signal reception interface in the corresponding interface group, and the processing circuit group and the corresponding shared signal reception interface. is connected to the shared signal transmission interface of the processing circuit group and the corresponding machine learning unit, and the unicast read signal reception interface in the processing circuit group is connected to the unicast read signal transmission interface of the processing circuit group and the corresponding machine learning unit. Connected.

일 실시예에서, 상기 하나의 처리회로 그룹 내의 유니캐스트 판독 처리회로와 브로드캐스트 처리회로는 상기 대응하는 인터페이스 그룹 내의 하나의 공유 데이터 송신 인터페이스를 공유하고, 상기 처리회로 그룹과 대응하는 공유 데이터 송신 인터페이스는 상기 처리회로 그룹에 대응되는 기계 학습 유닛의 공유 데이터 수신 인터페이스에 연결된다.In an embodiment, the unicast read processing circuit and the broadcast processing circuit in the processing circuit group share a shared data transmission interface in the corresponding interface group, and the processing circuit group and the corresponding shared data transmission interface. is connected to a shared data receiving interface of a machine learning unit corresponding to the processing circuit group.

일 실시예에서, 상기 처리회로 그룹과 대응하는 공유 신호 수신 인터페이스는 상기 처리회로 그룹 내의 유니캐스트 기록 처리회로 및 브로드캐스트 처리회로에 각각 연결되어 상기 기계 학습 유닛의 공유 신호 송신 인터페이스에서 송신된 데이터 동작 신호를 수신하고, 상기 데이터 동작 신호를 두 갈래의 동일한 데이터 동작 신호로 나누고, 상기 유니캐스트 기록 처리회로와 상기 브로드캐스트 처리회로로 각각 송신한다.In an embodiment, a shared signal reception interface corresponding to the processing circuit group is connected to a unicast recording processing circuit and a broadcast processing circuit in the processing circuit group, respectively, to operate data transmitted from the shared signal transmission interface of the machine learning unit. A signal is received, the data operation signal is divided into two identical data operation signals, and the data operation signal is transmitted to the unicast recording processing circuit and the broadcast processing circuit, respectively.

본 출원의 실시예는 데이터 처리 장치에 적용되는 데이터 처리 방법을 더 제공한다. 상기 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포하함고, 상기 기계 학습 유닛은 적어도 하나 이상의 송신 인터페이스 및 적어도 하나 이상의 수신 인터페이스를 포함하고, 상기 기계 학습 유닛이 수행한 유니캐스트 판독 동작, 유니캐스트 기록 동작 및 브로드캐스트 동작 중의 적어도 2가지 데이터 동작은 상기 기계 학습 유닛 상의 하나의 송신 인터페이스를 공유하고, 상기 기계 학습 유닛은 상기 전송 회로에 연결되며, 상기 전송 회로는 상기 공유 메모리에 연결되고, 상기 방법은,Embodiments of the present application further provide a data processing method applied to a data processing apparatus. The data processing device includes a machine learning device, a transmission circuit and a shared memory, the machine learning device includes at least one machine learning unit, and the machine learning unit includes at least one transmission interface and at least one reception interface. wherein at least two data operations of a unicast read operation, a unicast write operation, and a broadcast operation performed by the machine learning unit share one transmission interface on the machine learning unit, and the machine learning unit performs the transmission circuitry, wherein the transfer circuitry is coupled to the shared memory, the method comprising:

상기 기계 학습 장치가 상기 적어도 하나 이상의 송신 인터페이스를 통해 데이터 동작 신호를 상기 전송 회로로 송신하는 단계;transmitting, by the machine learning device, a data operation signal to the transmission circuit through the at least one transmission interface;

상기 전송 회로가 상기 데이터 동작 신호에 따라 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고, 상기 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계를 포함한다.and obtaining, by the transmission circuit, input data necessary for the machine learning device from the shared memory according to the data operation signal, and returning the input data to the machine learning device through the receiving interface.

일 실시예에서, 상기 데이터 동작 신호는 브로드캐스트 신호 및/또는 멀티캐스트 신호이며, 상기 수신 인터페이스를 통해 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 단계는, In an embodiment, the data operation signal is a broadcast signal and/or a multicast signal, and returning the input data to the machine learning device through the receiving interface comprises:

상기 전송 회로가 상기 공유 데이터 수신 인터페이스를 통해 상기 입력 데이터를 상기 브로드캐스트 신호 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신하는 단계를 포함한다.and transmitting, by the transmission circuit, the input data to a plurality of machine learning units corresponding to the broadcast signal and/or multicast signal via the shared data receiving interface.

본 출원은 데이터 처리 장치를 제공한다. 상기 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 기계 학습 장치는 제1 전송 인터페이스를 통해 상기 전송 회로에 연결되고, 상기 전송 회로는 상기 공유 메모리에 연결되며,This application provides a data processing device. the data processing device includes a machine learning device, a transmission circuit and a shared memory, the machine learning device is connected to the transmission circuit through a first transmission interface, and the transmission circuit is connected to the shared memory;

상기 전송 회로는 상기 기계 학습 장치에서 발송된 데이터 동작 신호에 따라 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고, 상기 입력 데이터를 상기 기계 학습 장치로 리턴하고, 상기 데이터 동작 신호는 공유 메모리 내의 데이터에 대한 동작 방식을 나타낸다.The transmitting circuit obtains input data necessary for the machine learning device from the shared memory according to a data operation signal sent from the machine learning device, returns the input data to the machine learning device, and the data operation signal is shared. Indicates the operation method for data in memory.

일 실시예에서, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 기계 학습 유닛은 적어도 하나 이상의 연산 유닛, 및 상기 연산 유닛에 연결된 컨트롤러 유닛을 포함하고, 상기 연산 유닛은 하나의 마스트 처리회로와 복수의 슬레이브 처리회로를 포함하고, 상기 연산 유닛은 상기 제1 전송 인터페이스를 통해 상기 전송 회로에 연결된다. In an embodiment, the machine learning device includes at least one machine learning unit, the machine learning unit includes at least one computing unit, and a controller unit coupled to the computing unit, wherein the computing unit includes one mast a processing circuit and a plurality of slave processing circuits, wherein the arithmetic unit is connected to the transmission circuit through the first transmission interface.

상기 컨트롤러 유닛은, 상기 제1 전송 인터페이스의 송신 인터페이스를 통해 상기 데이터 동작 신호와 상기 출력 데이터를 상기 전송 회로에 송신하고, 상기 제1 전송 인터페이스의 수신 인터페이스를 통해 상기 전송 회로가 상기 공유 메모리로부터 획득한 상기 입력 데이터를 수신하고, 상기 입력 데이터를 기 마스트 처리회로 및/또는 상기 슬레이브 처리회로로 송신한다.The controller unit transmits the data operation signal and the output data to the transmission circuit through a transmission interface of the first transmission interface, and the transmission circuit acquires data from the shared memory through a reception interface of the first transmission interface. receives the input data, and transmits the input data to the master processing circuit and/or the slave processing circuit.

상기 마스트 처리회로는 상기 입력 데이터를 상기 복수의 슬레이브 처리회로로 배분하며, 상기 복수의 슬레이브 처리회로는 상기 마스트 처리회로가 전송한 데이터에 의해 중간 연산을 수행하여 복수의 중간 결과를 얻으며 상기 복수의 중간 결과를 상기 마스트 처리회로에 전송한다.The master processing circuit distributes the input data to the plurality of slave processing circuits, and the plurality of slave processing circuits perform an intermediate operation on the data transmitted by the master processing circuit to obtain a plurality of intermediate results, and obtain a plurality of intermediate results. Intermediate results are sent to the master processing circuit.

상기 마스트 처리회로는 추가로 상기 복수의 중간 결과에 대해 후속 처리하여 계산 결과를 얻는다.The master processing circuit further performs subsequent processing on the plurality of intermediate results to obtain a calculation result.

일 실시예에서, 상기 마스트 처리회로와 상기 슬레이브 처리회로의 구조는 H형, 시스톨릭 어레이형(systolic array type), 및 트리형 구조 중 1종 이상을 포함한다.In one embodiment, the structures of the master processing circuit and the slave processing circuit include at least one of an H-type structure, a systolic array type structure, and a tree-type structure.

일 실시예에서, 상기 전송 회로는 제2 전송 인터페이스, 상기 제2 전송 인터페이스에 연결된 적어도 하나 이상의 판독/기록 처리회로, 및 상기 판독/기록 처리회로에 연결된 중재 회로를 포함하고, 상기 적어도 하나 이상의 기계 학습 유닛이 상기 제1 전송 인터페이스를 통해 상기 제2 전송 인터페이스와의 연결은 상기 적어도 하나 이상의 기계 학습 유닛과 상기 전송 회로의 연결을 구현한다.In one embodiment, the transmission circuit comprises a second transmission interface, at least one read/write processing circuit connected to the second transmission interface, and an arbitration circuit connected to the read/write processing circuit, wherein the at least one machine The connection of the learning unit with the second transmission interface via the first transmission interface implements the connection of the at least one machine learning unit with the transmission circuit.

상기 판독/기록 처리회로는, 상기 적어도 하나 이상의 기계 학습 유닛이 상기 제1 전송 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 상기 데이터 동작 신호를 수신하고, 상기 데이터 동작 신호를 상기 중재 회로에 전송하고 상기 공유 메모리로부터 판독한 데이터를 상기 제2 전송 인터페이스를 통해 적어도 하나 이상의 기계 학습 유닛으로 송신한다.The read/write processing circuit receives the data operation signal transmitted by the at least one machine learning unit through the first transmission interface and the second transmission interface, and transmits the data operation signal to the arbitration circuit; The data read from the shared memory is transmitted to at least one machine learning unit through the second transmission interface.

상기 중재 회로는 사전에 설정된 중재 규칙에 따라 상기 적어도 하나 이상의 판독/기록 처리회로로부터 수신된 데이터 동작 신호를 중재하고 상기 중재 성공한 데이터 동작 신호에 따라 상기 공유 메모리 내의 데이터를 동작시킨다. The mediation circuit mediates data operation signals received from the at least one read/write processing circuit according to a preset mediation rule and operates data in the shared memory according to the data operation signal that has succeeded in mediation.

일 실시예에서, 상기 판독/기록 처리회로는 유니캐스트 판독 처리회로, 유니캐스트 기록 처리회로, 브로드캐스트 처리회로 중 하나 이상을 포함한다. 상기 데이터 동작 신호는 유니캐스트 판독 요청, 유니캐스트 기록 요청, 유니캐스트 판독 명령, 유니캐스트 기록 명령, 멀티캐스트 명령, 브로드캐스트 명령 중 하나 이상을 포함한다.In one embodiment, the read/write processing circuitry includes one or more of unicast read processing circuitry, unicast write processing circuitry, and broadcast processing circuitry. The data operation signal includes one or more of a unicast read request, a unicast write request, a unicast read command, a unicast write command, a multicast command, and a broadcast command.

여기서, 유니캐스트 유형의 처리회로는 유니캐스트 유형의 신호를 처리하고, 브로드캐스트 유형의 처리회로는 멀티캐스트 또는 브로드캐스트 유형의 신호를 처리한다.Here, the unicast type processing circuit processes a unicast type signal, and the broadcast type processing circuit processes a multicast or broadcast type signal.

일 실시예에서, 상기 데이터 동작 신호가 명령 유형의 신호이면, 상기 판독/기록 처리회로는 구체적으로 명령 유형의 신호를 분석하여 요청 유형의 신호를 생성하고, 상기 요청 유형의 신호를 상기 중재 회로로 전송한다.In one embodiment, if the data operation signal is a command type signal, the read/write processing circuit specifically analyzes the command type signal to generate a request type signal, and sends the request type signal to the arbitration circuit. send.

일 실시예에서, 상기 데이터 동작 신호가 멀티캐스트 명령이면, 상기 멀티캐스트 명령은 복수의 수신 데이터의 목표 기계 학습 유닛의 식별자를 포함한다.In one embodiment, if the data operation signal is a multicast command, the multicast command includes an identifier of a target machine learning unit of the plurality of received data.

상기 판독/기록 처리회로는 구체적으로 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 복수의 목표 기계 학습 유닛으로 송신한다.The read/write processing circuit specifically transmits the data acquired by the arbitration circuit from the shared memory to the plurality of target machine learning units.

일 실시예에서, 상기 데이터 동작 신호가 브로드캐스트 명령이면, 상기 판독/기록 처리회로는 구체적으로 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 모든 기계 학습 유닛으로 송신한다.In an embodiment, if the data operation signal is a broadcast command, the read/write processing circuit specifically transmits the data acquired by the arbitration circuit from the shared memory to all machine learning units.

일 실시예에서, 상기 입력 데이터는 입력 뉴런 데이터 및/또는 가중치를 포함하고， 상기 출력 데이터는 출력 뉴런 데이터를 포함한다.In one embodiment, the input data includes input neuron data and/or weights, and the output data includes output neuron data.

일 실시예에서, 상기 데이터 처리 장치는 적어도 하나 이상의 클러스터로 나뉘고, 각각의 클러스터는 복수의 기계 학습 유닛, 하나의 전송 회로 및 적어도 하나 이상의 공유 메모리를 포함한다. In one embodiment, the data processing device is divided into one or more clusters, and each cluster includes a plurality of machine learning units, one transmission circuit, and one or more shared memories.

상기 전송 회로는 소속된 클러스터 내의 중재 회로 및 상기 클러스터 내의 공유 메모리에 연결된 제1 유형 직접 액세스 제어기DMA, 및/또는 소속된 클러스터 내의 중재 회로 및 다른 클러스터 내의 공유 메모리에 연결된 제2 유형DMA를 더 포함한다.The transfer circuitry further includes a first type direct access controller DMA connected to an arbitration circuit in the affiliated cluster and a shared memory in the cluster, and/or a second type DMA connected to an arbitration circuit in the affiliated cluster and a shared memory in another cluster. do.

상기 제1 DMA는 상기 클러스터 내의 중재 회로와 상기 클러스터 내의 공유 메모리 사이의 데이터 상호 작용을 제어한다.The first DMA controls data interaction between arbitration circuitry in the cluster and shared memory in the cluster.

상기 제2 DMA는 상기 클러스터 내의 중재 회로와 다른 클러스터 내의 공유 메모리 사이의 데이터 상호 작용, 그리고 상기 클러스터 내의 중재 회로와 오프칩 메모리 사이의 데이터 상호 작용을 제어한다.The second DMA controls data interaction between arbitration circuitry in the cluster and shared memory in other clusters, and data interaction between arbitration circuitry in the cluster and off-chip memory.

일 실시예에서, 상기 전송 회로는 상기 제1 유형 DMA에 연결된 제1 선택 전송 회로, 상기 제2 유형 DMA에 연결된 제2 선택 전송 회로를 더 포함한다.In one embodiment, the transfer circuitry further comprises a first select transfer circuit connected to the first type DMA, and a second select transfer circuit connected to the second type DMA.

상기 제1 선택 전송 회로는 소속된 클러스터 내의 공유 메모리에 선택적으로 연결된다.The first selection transfer circuit is selectively connected to a shared memory within a cluster to which it belongs.

상기 제2 선택 전송 회로는 소속된 클러스터 및 다른 클러스터 내의 공유 메모리 및 상기 오프칩 메모리에 선택적으로 연결된다.The second selection transmission circuit is selectively connected to shared memories in the belonging cluster and other clusters and the off-chip memory.

일 실시예에서, 상기 전송 회로는 상기 중재 회로 및 상기 공유 메모리에 연결된 캐싱 회로를 더 포함하고, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 임시 저장하고 상기 중재 회로가 상기 공유 메모리에 기록하는 데이터를 임시 저장한다.In one embodiment, the transfer circuit further comprises a caching circuit coupled to the arbitration circuit and the shared memory, wherein the arbitration circuit temporarily stores data acquired from the shared memory and the arbitration circuit writes to the shared memory. Temporarily store data.

일 실시예에서, 상기 전송 회로와 상기 공유 메모리 사이의 전송 대역폭은 상기 전송 회로와 상기 기계 학습 유닛 사이의 전송 대역폭보다 크다.In an embodiment, a transmission bandwidth between the transmission circuit and the shared memory is greater than a transmission bandwidth between the transmission circuit and the machine learning unit.

본 출원은 데이터 처리 장치를 제공한다. 상기 데이터 처리 장치는 기계 학습 데이터에 대한 처리를 수행하고, 기계 학습 장치, 전송 회로 및 공유 메모리를 포함하고, 상기 전송 회로는 복수의 판독/기록 처리회로와 하나의 중재 회로를 포함하고, 상기 기계 학습 장치는 복수의 기계 학습 유닛을 포함하되, 각각의 기계 학습 유닛은 적어도 하나 이상의 연산 유닛을 포함하며, 상기 복수의 기계 학습 유닛은 제1 전송 인터페이스를 통해 상기 전송 회로에 연결되고, 상기 전송 회로는 상기 공유 메모리에 연결되고,This application provides a data processing device. The data processing device performs processing on machine learning data, and includes a machine learning device, a transmission circuit, and a shared memory, wherein the transmission circuit includes a plurality of read/write processing circuits and an arbitration circuit; The learning device includes a plurality of machine learning units, each machine learning unit including at least one or more arithmetic units, the plurality of machine learning units being connected to the transmission circuit through a first transmission interface, and the transmission circuit is connected to the shared memory,

상기 중재 회로는 상기 복수의 기계 학습 유닛에서 송신된 데이터 동작 신호를 중재하고 중재 성공한 데이터 동작 신호에 따라 상기 공유 메모리로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고,the mediation circuit mediates data operation signals transmitted from the plurality of machine learning units and obtains input data required for the machine learning device from the shared memory according to the data operation signals that have succeeded in mediation;

상기 판독/기록 처리회로는 상기 중재 성공한 데이터 동작 신호에 포함된 주소 정보 또는 상기 데이터 동작 신호의 유형에 따라 상기 복수의 기계 학습 유닛으로부터 목표 기계 학습 유닛 또는 목표 연산 유닛을 확정하고 상기 입력 데이터를 상기 목표 기계 학습 유닛 또는 목표 연산 유닛으로 리턴한다.The read/write processing circuit determines a target machine learning unit or a target arithmetic unit from the plurality of machine learning units according to the type of the data operation signal or the address information included in the data operation signal that has succeeded in mediation, and converts the input data into the data operation signal. Return to the target machine learning unit or target computational unit.

일 실시예에서, 상기 중재 회로는 구체적으로 복수의 판독/기록 회로가 송신한 데이터 동작 신호의 우선순위를 정하고 우선순위가 가장 높은 데이터 동작 신호를 중재 성공한 데이터 동작 신호로 삼는다.In an embodiment, the arbitration circuit specifically prioritizes the data operation signals transmitted by the plurality of read/write circuits, and takes the data operation signal with the highest priority as the successful data operation signal.

일 실시예에서, 상기 중재 회로는, 구체적으로 복수의 판독/기록 처리회로가 송신한 데이터 동작 신호의 우선순위가 동일할 때, 상기 복수의 데이터 동작 신호의 유형 및 사전에 설정한 수행 조건에 따라 중재 성공한 데이터 동작 신호를 확정한다.In one embodiment, the arbitration circuit is configured according to the type of the plurality of data operation signals and a previously set execution condition when the priorities of the data operation signals transmitted by the plurality of read/write processing circuits are the same. Determine the data operation signal that has succeeded in mediation.

일 실시예에서, 상기 데이터 동작 신호가 유니캐스트 유형의 신호이면, 상기 수행 조건은, 상기 유니캐스트 유형의 신호를 송신하는 기계 학습 유닛의 채널이 아이들(idle) 상태; 또는 상기 유니캐스트 유형의 신호를 송신하는 기계 학습 유닛 중 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.In one embodiment, if the data operation signal is a unicast type signal, the execution condition includes: a channel of a machine learning unit transmitting the unicast type signal is in an idle state; or a channel of an arithmetic unit among machine learning units transmitting the unicast type signal is in an idle state.

일 실시예에서, 상기 데이터 동작 신호가 멀티캐스트 유형의 신호이면, 상기 수행 조건은, 상기 멀티캐스트 유형의 신호를 송신하는 기계 학습 유닛의 채널이 아이들(idle) 상태이고 상기 멀티캐스트 유형의 신호가 지정한 목표 기계 학습 유닛의 채널이 아이들(idle) 상태; 또는 상기 멀티캐스트 유형의 신호를 송신한 기계 학습 유닛 중 연산 유닛의 채널이 아이들(idle) 상태이고 상기 멀티캐스트 유형의 신호가 지정한 목표 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.In one embodiment, if the data operation signal is a multicast type signal, the execution condition is: a channel of a machine learning unit transmitting the multicast type signal is in an idle state and the multicast type signal is an idle state of a channel of a specified target machine learning unit; or, among machine learning units that have transmitted the multicast-type signal, a channel of an arithmetic unit is in an idle state and a channel of a target arithmetic unit designated by the multicast-type signal is in an idle state.

일 실시예에서, 상기 데이터 동작 신호가 브로드캐스트 유형의 신호이면, 상기 수행 조건은 상기 브로드캐스트 유형의 신호를 송신한 기계 학습 유닛의 채널이 아이들(idle) 상태이고 다른 나머지 기계 학습 유닛의 채널이 아이들(idle) 상태; 또는 상기 브로드캐스트 유형의 신호를 송신하는 기계 학습 유닛 중 연산 유닛의 채널이 아이들(idle) 상태이며 다른 나머지 기계 학습 유닛 중 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.In one embodiment, if the data operation signal is a broadcast type signal, the execution condition is that the channel of the machine learning unit that has transmitted the broadcast type signal is in an idle state and the channels of the other machine learning units are in an idle state. idle state; or a channel of an arithmetic unit among the machine learning units transmitting the broadcast type signal is in an idle state, and a channel of the arithmetic unit among the other machine learning units is in an idle state.

일 실시예에서, 상기 전송 회로는 제2 전송 인터페이스를 더 포함하고, 상기 제2 전송 인터페이스의 각 인터페이스는 상기 제1 전송 인터페이스의 각 인터페이스와 일일히 대응하여 연결되고, 하나의 기계 학습 유닛은 하나의 판독/기록 처리회로와 대응하여 연결된다.In an embodiment, the transmission circuit further includes a second transmission interface, each interface of the second transmission interface is connected to each interface of the first transmission interface in correspondence with each other, and one machine learning unit is one Correspondingly connected with the read/write processing circuit of

일 실시예에서, 상기 하나의 기계 학습 유닛 내의 복수의 연산 유닛은 상기 제1 전송 인터페이스 내의 하나의 송신 인터페이스를 공유하고 각각의 연산 유닛은 하나의 데이터 수신 인터페이스에 대응된다.In an embodiment, the plurality of computing units in the one machine learning unit share one transmission interface in the first transmission interface, and each computing unit corresponds to one data receiving interface.

일 실시예에서, 상기 하나의 기계 학습 유닛 내의 복수의 연산 유닛은 상기 제1 전송 인터페이스 내의 하나의 송신 인터페이스 및 하나의 데이터 수신 인터페이스에 각각 대응된다.In one embodiment, the plurality of computing units in the one machine learning unit respectively correspond to one transmission interface and one data reception interface in the first transmission interface.

일 실시예에서, 상기 전송 회로는 제2 전송 인터페이스를 더 포함하고, 상기 복수의 기계 학습 유닛은 상기 제2 전송 인터페이스 내의 하나의 신호 수신 인터페이스 및 하나의 데이터 리턴 인터페이스를 공유한다.In an embodiment, the transmission circuit further includes a second transmission interface, and the plurality of machine learning units share one signal reception interface and one data return interface in the second transmission interface.

일 실시예에서, 상기 판독/기록 처리회로는 각각의 기계 학습 유닛이 송신한 데이터 동작 신호 신호를 저장하는 신호 큐를 포함한다.In one embodiment, the read/write processing circuit includes a signal queue for storing data operation signal signals transmitted by each machine learning unit.

상기 판독/기록 처리회로는 추가로 상기 데이터 동작 신호를 수신할 때 상기 요청 큐에 잔여 공간이 존재하는지 여부를 판단하도록 구성되되, "예"이면 상기 데이터 동작 신호를 상기 요청 큐에 캐싱하고, "아니요"이면, 상기 데이터 동작 신호를 차단한다.The read/write processing circuit is further configured to determine whether there is remaining space in the request queue when receiving the data operation signal, if "yes", cache the data operation signal in the request queue; If no, the data operation signal is blocked.

일 실시예에서, 판독/기록 처리회로가 브로드캐스트 처리회로이면, 상기 신호 큐는 명령 큐 및 요청 큐를 포함한다.In one embodiment, if the read/write processing circuit is a broadcast processing circuit, the signal queue includes a command queue and a request queue.

상기 명령 큐는 상기 브로드캐스트 처리회로가 수신한 명령 유형의 신호를 캐싱한다.The command queue caches signals of command types received by the broadcast processing circuitry.

상기 요청 큐는 상기 명령 유형의 신호를 분석한 후 얻은 요청 유형의 신호를 캐싱한다.The request queue caches the signal of the request type obtained after analyzing the signal of the command type.

일 실시예에서, 상기 기계 학습 유닛은 상기 연산 유닛에 연결된 컨트롤러 유닛을 더 포함하고, 상기 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함하고, 상기 연산 유닛은 상기 제1 전송 인터페이스를 통해 상기 전송 회로에 연결된다. In an embodiment, the machine learning unit further includes a controller unit connected to the computing unit, the computing unit including a master processing circuit and a plurality of slave processing circuits, the computing unit comprising the first transmission interface connected to the transmission circuit through

상기 컨트롤러 유닛은 상기 제1 전송 인터페이스 내의 송신 인터페이스를 통해 상기 전송 회로로 상기 데이터 동작 신호 및 상기 출력 데이터를 송신하고, 상기 제1 전송 인터페이스 내의 수신 인터페이스를 통해 상기 전송 회로가 상기 공유 메모리로부터 획득한 상기 입력 뉴런 데이터 및 상기 가중치를 획득하고, 상기 입력 뉴런 데이터와 상기 가중치를 상기 마스트 처리회로 및/또는 상기 슬레이브 처리회로로 송신한다.The controller unit transmits the data operation signal and the output data to the transmission circuit through a transmission interface in the first transmission interface, and the transmission circuit obtains from the shared memory through a reception interface in the first transmission interface. The input neuron data and the weights are obtained, and the input neuron data and the weights are transmitted to the master processing circuit and/or the slave processing circuit.

상기 마스트 처리회로는 상기 입력 데이터를 상기 복수의 슬레이브 처리회로로 배분하고 상기 복수의 슬레이브 처리회로는 상기 마스트 처리회로가 전송한 뉴런 데이터 및 가중치에 의해 중간 연산을 수행하여 복수의 중간 결과를 얻고 복수의 중간 결과를 상기 마스트 처리회로로 전송한다.The master processing circuit distributes the input data to the plurality of slave processing circuits, and the plurality of slave processing circuits perform intermediate operations based on the neuron data and weights transmitted by the master processing circuit to obtain a plurality of intermediate results. The intermediate result of is transmitted to the master processing circuit.

상기 마스트 처리회로는 또한 상기 복수의 중간 결과에 대해 후속 처리하여 계산 결과를 얻는다.The master processing circuit also performs subsequent processing on the plurality of intermediate results to obtain a calculation result.

일 실시예에서, 상기 입력 데이터는 입력 데이터를 포함하고, 상기 출력 데이터는 출력 데이터를 포함한다.In one embodiment, the input data includes input data and the output data includes output data.

전술한 일반적인 설명과 후술하는 세부 설명은 다만 예시적이며 설명을 위한 것일 뿐 본 출원을 한정하지 않음을 이해해야 한다.It should be understood that the foregoing general description and the following detailed description are illustrative only and not limiting of the present application.

첨부 도면은 명세서에 포함되어 명세서의 일부를 구성하고 본 출원의 실시예들을 도시하며, 본 출원의 원리를 설명하도록 명세서와 함께 사용된다.
도 1은 일 실시예의 네트워크 온칩 처리 시스템(1100)의 구성도이다.
도 2는 일 실시예의 네트워크 온칩 처리 시스템(1200)의 구성도이다.
도 3은 일 실시예의 네트워크 온칩 처리 시스템(1300)의 구성도이다.
도 4는 일 실시예의 네트워크 온칩 처리 시스템(1400)의 구성도이다.
도 5a는 일 실시예의 네트워크 온칩 처리 시스템(1500)의 구성도이다.
도 5b는 일 실시예의 네트워크 온칩 처리 시스템(15000)의 구성도이다.
도 6은 일 실시예의 네트워크 온칩 처리 시스템(1600)의 구성도이다.
도 7은 일 실시예의 네트워크 온칩 처리 시스템(1700)의 구성도이다.
도 8은 일 실시예의 네트워크 온칩 처리 시스템(1800)의 구성도이다.
도 9는 일 실시예의 네트워크 온칩 처리 시스템(1900)의 구성도이다.
도 10a는 일 실시예의 네트워크 온칩 처리 시스템(1910)의 구성도이다.
도 10b는 일 실시예의 네트워크 온칩 처리 시스템(19100)의 구성도이다.
도 11은 일 실시예의 네트워크 온칩 처리 시스템(1920)의 구성도이다.
도 12는 일 실시예의 네트워크 온칩 처리 시스템(1930)의 구성도이다.
도 13은 일 실시예의 계산 장치의 구성도이다.
도 14는 다른 일 실시예의 계산 장치의 구성도이다.
도 15는 일 실시예의 마스트 처리회로의 구성도이다.
도 16은 다른 일 실시예의 계산 장치의 구성도이다.
도 17은 다른 일 실시예의 계산 장치의 구성도이다.
도 18은 일 실시예의 트리 모듈의 구성도이다.
도 19는 다른 일 실시예의 계산 장치의 구성도이다.
도 20은 다른 일 실시예의 계산 장치의 구성도이다.
도 21은 다른 일 실시예의 계산 장치의 구성도이다.
도 22는 일 실시예의 통합 처리 장치의 구성도이다.
도 23은 다른 일 실시예의 통합 처리 장치의 구성도이다.
도 24는 일 실시예의 보드 카드의 구성도이다.
도 25는 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 26은 다른 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 27은 다른 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 28은 다른 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 29는 다른 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 30은 다른 일 실시예의 네트워크 온칩 데이터 처리 방법의 흐름도이다.
도 31은 일 실시예에 따른 데이터 처리 방법의 적용환경의 개략도이다.
도 32는 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 33은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 34는 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 35는 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 36은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 37은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 38은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 39는 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 40은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 41은 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 42는 일 실시예에 따른 데이터 처리 방법의 흐름도이다.
도 43은 일 실시예의 데이터 처리 장치의 구성도이다.
도 44는 일 실시예의 기계 학습 유닛의 구성도이다.
도 45는 일 실시예의 데이터 처리 장치의 구성도이다.
도 46은 일 실시예의 데이터 처리 장치의 구성도이다.
도 47은 일 실시예의 데이터 처리 장치의 구성도이다.
도 48은 일 실시예의 데이터 처리 장치의 구성도이다.
도 49는 일 실시예의 기계 학습 유닛의 구성도이다.
도 50은 일 실시예의 데이터 처리 장치의 구성도이다.
도 51은 일 실시예의 데이터 처리 장치의 구성도이다.
도 52는 일 실시예의 데이터 처리 장치의 구성도이다.
도 53은 일 실시예의 데이터 처리 장치의 구성도이다.
도 54는 일 실시예의 데이터 처리 장치의 구성도이다.
도 55는 일 실시예의 데이터 처리 장치의 구성도이다.
도 56은 일 실시예의 데이터 처리 장치의 구성도이다.
도 56A는 일 실시예의 기계 학습 장치의 구성도이다.
도 57은 일 실시예의 전송 회로의 구성도이다.
도 57A는 일 실시예의 한 전송 회로의 구성도이다.
도 57B는 일 실시예의 한 전송 회로의 구성도이다.
도 58은 일 실시예의 한 클러스터 내의 전송 회로의 구성도이다.
도 59는 일 실시예의 다른 한 클러스터 내의 전송 회로의 구성도이다.
도 60은 일 실시예의 다른 한 전송 회로의 구성도이다.
도 61은 일 실시예의 데이터 처리 장치의 구성도이다.
도 62는 일 실시예의 기계 학습 유닛의 구성도이다.
도 63는 일 실시예의 데이터 처리 장치의 구성도이다.
도 64는 일 실시예의 데이터 처리 장치의 구성도이다.
도 65는 일 실시예의 데이터 처리 장치의 구성도이다.
도 66은 일 실시예의 데이터 처리 장치의 구성도이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are incorporated into and constitute part of the specification, illustrate embodiments of the present application, and are used in conjunction with the specification to explain the principles of the present application.
1 is a block diagram of a network on-chip processing system 1100 according to an embodiment.
2 is a block diagram of a network on-chip processing system 1200 according to an embodiment.
3 is a block diagram of a network on-chip processing system 1300 according to an embodiment.
4 is a block diagram of a network on-chip processing system 1400 according to an embodiment.
5A is a block diagram of a network on-chip processing system 1500 in one embodiment.
5B is a block diagram of a network on-chip processing system 15000 according to an embodiment.
6 is a block diagram of a network on-chip processing system 1600 according to an embodiment.
7 is a block diagram of a network on-chip processing system 1700 according to an embodiment.
8 is a block diagram of a network on-chip processing system 1800 according to an embodiment.
9 is a block diagram of a network on-chip processing system 1900 according to an embodiment.
10A is a block diagram of a network on-chip processing system 1910 in one embodiment.
10B is a block diagram of a network on-chip processing system 19100 in one embodiment.
11 is a block diagram of a network on-chip processing system 1920 in one embodiment.
12 is a block diagram of a network on-chip processing system 1930 in one embodiment.
13 is a configuration diagram of a computing device according to an embodiment.
14 is a configuration diagram of a computing device according to another embodiment.
15 is a configuration diagram of a master processing circuit in one embodiment.
16 is a configuration diagram of a computing device according to another embodiment.
Fig. 17 is a configuration diagram of a computing device according to another embodiment.
18 is a configuration diagram of a tree module according to an embodiment.
19 is a configuration diagram of a computing device according to another embodiment.
Fig. 20 is a configuration diagram of a computing device according to another embodiment.
Fig. 21 is a configuration diagram of a computing device according to another embodiment.
22 is a configuration diagram of an integrated processing device according to an embodiment.
23 is a configuration diagram of an integrated processing device according to another embodiment.
24 is a block diagram of a board card according to an embodiment.
25 is a flowchart of a network on-chip data processing method according to an embodiment.
26 is a flowchart of a method for processing network on-chip data according to another embodiment.
27 is a flowchart of a method for processing network on-chip data according to another embodiment.
28 is a flowchart of a method for processing network on-chip data according to another embodiment.
29 is a flowchart of a method for processing network on-chip data according to another embodiment.
30 is a flowchart of a method for processing network on-chip data according to another embodiment.
31 is a schematic diagram of an application environment for a data processing method according to an embodiment.
32 is a flowchart of a data processing method according to an embodiment.
33 is a flowchart of a data processing method according to an embodiment.
34 is a flowchart of a data processing method according to an embodiment.
35 is a flowchart of a data processing method according to an embodiment.
36 is a flowchart of a data processing method according to an embodiment.
37 is a flowchart of a data processing method according to an embodiment.
38 is a flowchart of a data processing method according to an embodiment.
39 is a flowchart of a data processing method according to an embodiment.
40 is a flowchart of a data processing method according to an embodiment.
41 is a flowchart of a data processing method according to an embodiment.
42 is a flowchart of a data processing method according to an embodiment.
43 is a configuration diagram of a data processing apparatus according to an embodiment.
44 is a block diagram of a machine learning unit according to an embodiment.
45 is a configuration diagram of a data processing device according to an embodiment.
46 is a configuration diagram of a data processing apparatus in one embodiment.
47 is a configuration diagram of a data processing apparatus in one embodiment.
48 is a configuration diagram of a data processing apparatus in one embodiment.
49 is a block diagram of a machine learning unit according to an embodiment.
50 is a configuration diagram of a data processing apparatus according to an embodiment.
51 is a configuration diagram of a data processing apparatus in an embodiment.
52 is a configuration diagram of a data processing apparatus according to an embodiment.
53 is a configuration diagram of a data processing apparatus according to an embodiment.
54 is a configuration diagram of a data processing apparatus according to an embodiment.
55 is a configuration diagram of a data processing apparatus in one embodiment.
56 is a configuration diagram of a data processing apparatus in one embodiment.
Fig. 56A is a block diagram of a machine learning device according to an embodiment.
57 is a configuration diagram of a transmission circuit in one embodiment.
Fig. 57A is a configuration diagram of a transmission circuit in an embodiment.
Fig. 57B is a schematic diagram of a transmission circuit in an embodiment.
58 is a configuration diagram of a transmission circuit in a cluster according to an embodiment.
59 is a configuration diagram of a transmission circuit in another cluster according to an embodiment.
60 is a configuration diagram of another transmission circuit according to an embodiment.
61 is a configuration diagram of a data processing apparatus in one embodiment.
62 is a configuration diagram of a machine learning unit according to an embodiment.
63 is a configuration diagram of a data processing apparatus in one embodiment.
64 is a configuration diagram of a data processing apparatus in one embodiment.
65 is a configuration diagram of a data processing apparatus in one embodiment.
66 is a configuration diagram of a data processing apparatus in one embodiment.

여기서, 예시적인 실시예에 대해 상세히 설명하고 그 예를 도면에 나타낸다. 이하 도면에 관한 설명에서 특별한 설명이 없는 한 다른 도면 내의 동일한 도면부호는 동일하거나 유사한 요소를 나타낸다. 아래 예시적인 실시예에서 설명하는 실시예는 본 출원과 같은 모든 실시예를 대표하지 않는다. 반대로, 그 실시예들은 단지 첨부된 특허청구범위에 설명된 본 출원의 양태와 일치하는 장치 및 방법의 예에 불과하다.Here, exemplary embodiments are described in detail and examples thereof are shown in the drawings. In the following description of the drawings, the same reference numerals in different drawings denote the same or similar elements unless otherwise specified. The embodiments described in the exemplary embodiments below do not represent all embodiments as in the present application. To the contrary, the embodiments are merely examples of devices and methods consistent with aspects of the present application set forth in the appended claims.

일 실시예에서, 본 출원에 따른 네트워크 온칩 처리 시스템은 저장 장치 및 복수의 계산 장치를 포함하고, 상기 저장 장치 및 복수의 상기 계산 장치는 동일한 칩 위에 설치되고, 적어도 하나 이상의 계산 장치는 상기 저장 장치에 연결되며, 적어도 두개 이상의 계산 장치는 서로 연결된다.In an embodiment, a network on-chip processing system according to the present application includes a storage device and a plurality of computing devices, wherein the storage device and the plurality of computing devices are installed on a same chip, and at least one computing device includes the storage device. and at least two or more computing devices are connected to each other.

여기서, 네트워크 온칩(Network on Chip, NoC)은 단일 칩 위에 대량의 계산 자원을 집적시키고 이러한 자원을 연결하는 온칩 통신 네트워크를 가리킨다. 대안적으로, 칩 위의 각각의 계산 장치는 각각의 인터페이스를 통해 상기 네트워크 온칩에 액세스하고 공유 네트워크 자원과 통신하고자 하는 목표 모듈을 사용하여 통신할 수 있다. 구체적으로, 상기 저장 장치와 복수의 상기 계산 장치가 동일한 칩 위에 설치된다는 것은, 저장 장치와 복수의 계산 장치를 하나의 칩 위에 집적시키는 것을 의미한다. 계산 장치 내의 프로세서 코어와 오프칩 저장 장치는 NoC를 통해 연결되고, NoC는 또한 프로세서의 복수 코어 사이의 통신을 지원한다.Here, Network on Chip (NoC) refers to an on-chip communication network that integrates a large amount of computing resources on a single chip and connects these resources. Alternatively, each computing device on a chip may access the network on-chip through a respective interface and communicate using a target module that wishes to communicate with shared network resources. Specifically, installing the storage device and the plurality of computing devices on the same chip means integrating the storage device and the plurality of computing devices on one chip. A processor core in a computing device and an off-chip storage device are connected through an NoC, and the NoC also supports communication between multiple cores of the processor.

본 출원의 실시예에 따른 네트워크 온칩 처리 시스템은 모두 NoC에 기반하여 온칩 통신을 구현한다. 한편, 본 출원의 실시예에 따른 네트워크 온칩 처리 시스템은 온칩 저장뿐만 아니라 오프칩 저장도 가능하다. 다시 말해서, 신경망 프로세서의 처리 과정에서 연산 데이터는 온칩 저장 장치에 저장될 수도 있고, 오프칩 저장 장치에 저장될 수도 있다. 네트워크 온칩 처리 시스템의 온칩 저장 용량이 제한적이므로 연산 데이터 및 연산 과정에 생성된 중간 결과를 오프칩 저장 장치에 임시 저장하였다가 필요시 오프칩 저장 장치로부터 NoC로 판독할 수 있다. 본 출원의 실시예에서, 네트워크 온칩 처리 시스템 내의 저장 장치는 모두 온칩 저장 장치를 의미하며, 네트워크 온칩 처리 시스템 내의 계산 장치는 신경망 프로세서를 포함한다.Network on-chip processing systems according to embodiments of the present application implement on-chip communication based on NoC. Meanwhile, the network on-chip processing system according to an embodiment of the present application is capable of off-chip storage as well as on-chip storage. In other words, during the processing of the neural network processor, operation data may be stored in an on-chip storage device or an off-chip storage device. Since the on-chip storage capacity of the network on-chip processing system is limited, calculation data and intermediate results generated during the calculation process can be temporarily stored in an off-chip storage device and read from the off-chip storage device to the NoC when necessary. In the embodiments of the present application, the storage devices in the network on-chip processing system all refer to on-chip storage devices, and the computing devices in the network on-chip processing system include neural network processors.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 더 제공한다. 상기 시스템은 저장 장치 및 복수의 계산 장치를 포함하고, 상기 복수의 계산 장치는 제1 계산 장치와 복수의 제2 계산 장치를 포함하며, 상기 저장 장치와 복수의 상기 계산 장치는 동일한 칩 위에 설치된다. 여기서, 상기 제1 계산 장치는 상기 저장 장치에 연결되며, 상기 복수의 제2 계산 장치 중 적어도 하나 이상의 제2 계산 장치는 상기 제1 계산 장치에 연결된다.In one embodiment, the present application further provides a network on-chip processing system. The system includes a storage device and a plurality of computing devices, the plurality of computing devices include a first computing device and a plurality of second computing devices, and the storage device and a plurality of the computing devices are installed on the same chip. . Here, the first computing device is connected to the storage device, and at least one second computing device among the plurality of second computing devices is connected to the first computing device.

일 실시예에 신경망 칩이 제공된다. 상기 칩은 저장 장치, 복수의 계산 장치, 제1 인터커넥트 장치 및 제2 인터커넥트 장치를 포함하고, 적어도 하나 이상의 계산 장치와 상기 저장 장치는 상기 제1 인터커넥트 장치를 통해 연결되고 상기 복수의 계산 장치는 상기 제2 인터커넥트 장치를 통해 연결된다. 더 나아가, 계산 장치는 제1 인터커넥트 장치를 통해 저장 장치에 대한 판독/기록 동작을 구현하고, 복수의 계산 장치 사이는 제2 인터커넥트 장치를 통해 데이터를 전송할 수도 있다.In one embodiment, a neural network chip is provided. The chip includes a storage device, a plurality of computing devices, a first interconnect device, and a second interconnect device, wherein at least one computing device and the storage device are connected through the first interconnect device, and the plurality of computing devices are connected to the first interconnect device. Connected through a second interconnect device. Furthermore, the computing device may implement a read/write operation to the storage device through the first interconnect device, and transfer data between the plurality of computing devices through the second interconnect device.

이하 네트워크 온칩 처리 시스템 및 신경망 칩에 대해 각각 설명한다.A network on-chip processing system and a neural network chip are respectively described below.

도 1에 도시된 바와 같이, 도 1은 일 실시예에 따른 네트워크 온칩 처리 시스템(1100)이다. 네트워크 온칩 처리 시스템(1100)은 저장 장치(1101), 제1 계산 장치(1102), 제2 계산 장치(1103) 및 제2 계산 장치(1104)를 포함하고, 저장 장치(1101), 제1 계산 장치(1102), 제2 계산 장치(1103) 및 제2 계산 장치(1104)는 네트워크 온칩 처리 시스템(1100)의 동일한 칩 위에 설치된다. 여기서 제1 계산 장치(1102)는 저장 장치(1101)에 연결되고, 제2 계산 장치(1103)는 제1 계산 장치(1102)에 연결되는 동시에, 제2 계산 장치(1103)도 제2 계산 장치(1104)에 연결된다. 오직 제1 계산 장치(1102)만이 저장 장치(1101)에 액세스할 수 있다. 다시 말해서, 제1 계산 장치(1102)만이 저장 장치(1101)로부터 데이터를 판독/기록할 수 있다. 제1 계산 장치(1102), 제2 계산 장치(1103)와 제2 계산 장치(1104) 들은 상호간에 데이터를 전송할 수 있다.As shown in FIG. 1 , FIG. 1 is a network on-chip processing system 1100 according to one embodiment. The network on-chip processing system 1100 includes a storage device 1101, a first calculation device 1102, a second calculation device 1103, and a second calculation device 1104, wherein the storage device 1101, the first calculation device Device 1102 , second computing device 1103 and second computing device 1104 are installed on the same chip of network on-chip processing system 1100 . Here, the first computing device 1102 is connected to the storage device 1101, the second computing device 1103 is connected to the first computing device 1102, and the second computing device 1103 is also connected to the second computing device. (1104). Only the first computing device 1102 can access the storage device 1101 . In other words, only the first computing device 1102 can read/write data from the storage device 1101 . The first computing device 1102, the second computing device 1103, and the second computing device 1104 may transmit data to each other.

구체적으로, 제2 계산 장치(1104)가 데이터를 판독하고자 할 때, 제1 계산 장치(1102)가 저장 장치(1101)에 액세스하여 저장 장치(1101)로부터 제2 계산 장치(1104)에 필요한 데이터를 판독하고, 상기 데이터를 제2 계산 장치(1103)로 송신한 후 제2 계산 장치(1103)가 다시 상기 데이터를 제2 계산 장치(1104)로 송신한다. 대안적으로, 제1 계산 장치(1102) 이외의 제2 계산 장치(1103)와 제2 계산 장치(1104)도 저장 장치(1101)에 연결할 수 있으며, 제1 계산 장치(1102), 제2 계산 장치(1103) 및 제2 계산 장치(1104) 중 적어도 하나 이상의 계산 장치가 저장 장치(1101)에 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 제2 계산 장치(1103)는 제2 계산 장치(1104)에 서로 연결될 수도 있고, 제1 계산 장치(1102)에 서로 연결될 수도 있다. 제1 계산 장치(1102), 제2 계산 장치(1103) 및 제2 계산 장치(1104) 중 적어도 두개 이상의 계산 장치가 서로 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Specifically, when the second computing device 1104 wants to read data, the first computing device 1102 accesses the storage device 1101 to obtain data required by the second computing device 1104 from the storage device 1101. is read, the data is sent to the second computing device 1103, and the second computing device 1103 transmits the data to the second computing device 1104 again. Alternatively, the second computing device 1103 and the second computing device 1104 other than the first computing device 1102 may also be connected to the storage device 1101, and the first computing device 1102, the second computing device At least one of the device 1103 and the second computing device 1104 need only be connected to the storage device 1101 . There is no particular limitation on this here. Alternatively, the second computing device 1103 may be interconnected to the second computing device 1104 and may be interconnected to the first computing device 1102 . At least two of the first computing device 1102, the second computing device 1103, and the second computing device 1104 need only be connected to each other. There is no particular limitation on this here.

도 2에 도시된 바와 같이, 도 2는 일 실시예에 따른 네트워크 온칩 처리 시스템(1200)이다. 네트워크 온칩 처리 시스템(1200)은 저장 장치(1201), 제1 계산 장치(1202), 제2 계산 장치(1203) 및 제2 계산 장치(1204)를 포함하고, 저장 장치(1201), 제1 계산 장치(1202), 제2 계산 장치(1203) 및 제2 계산 장치(1204)는 네트워크 온칩 처리 시스템(1200)의 동일한 칩 위에 설치된다. 여기서, 제1 계산 장치(1202)는 저장 장치(1201)에 연결되고, 제2 계산 장치(1203)와 제2 계산 장치(1204)는 제1 계산 장치(1202)에 직접적으로 연결된다. 즉, 제2 계산 장치(1204)가 제2 계산 장치(1203)에 연결될 뿐만 아니라 제1 계산 장치(1201)에도 연결되므로 제2 계산 장치(1203)를 통해 제1 계산 장치(1201)에 연결되도록 구성할 필요없다. 오직 제1 계산 장치(1202)만이 저장 장치(1201)에 액세스할 수 있다. 다시 말해서 제1 계산 장치(1202)만이 저장 장치(1201)로부터 데이터를 판독/기록할 수 있다. 제1 계산 장치(1202), 제2 계산 장치(1203)와 제2 계산 장치(1204)들은 상호간에 데이터를 전송할 수 있다.As shown in FIG. 2 , FIG. 2 is a network on-chip processing system 1200 according to one embodiment. The network on-chip processing system 1200 includes a storage device 1201, a first calculation device 1202, a second calculation device 1203, and a second calculation device 1204, wherein the storage device 1201, the first calculation device Device 1202 , second computing device 1203 and second computing device 1204 are installed on the same chip of network on-chip processing system 1200 . Here, the first computing device 1202 is connected to the storage device 1201 , and the second computing device 1203 and the second computing device 1204 are directly connected to the first computing device 1202 . That is, the second computing device 1204 is not only connected to the second computing device 1203, but also connected to the first computing device 1201, so that it is connected to the first computing device 1201 through the second computing device 1203. no need to configure Only the first computing device 1202 can access the storage device 1201 . In other words, only the first computing device 1202 can read/write data from the storage device 1201 . The first computing device 1202, the second computing device 1203, and the second computing device 1204 may transmit data to each other.

구체적으로, 제2 계산 장치(1204)가 데이터를 판독하고자 할 때, 제1 계산 장치(1202)가 저장 장치(1201)에 액세스하여 저장 장치(1201)로부터 제2 계산 장치(1204)에 필요한 데이터를 판독한 후, 상기 데이터를 제2 계산 장치(1204)로 바로 송신하므로 제2 계산 장치(1203)를 거쳐 데이터를 포워딩할 필요없다. 대안적으로, 제1 계산 장치(1202), 제2 계산 장치(1203) 및 제2 계산 장치(1204)는 모두 저장 장치(1201)에 연결할 수 있으며 제1 계산 장치(1202), 제2 계산 장치(1203) 및 제2 계산 장치(1204) 중 적어도 하나 이상의 계산 장치가 저장 장치(1201)에 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 제2 계산 장치(1203)는 제2 계산 장치(1204)에 연결될 수 있고, 제1 계산 장치(1202)에 연결될 수도 있다. 제1 계산 장치(1202), 제2 계산 장치(1203) 및 제2 계산 장치(1204) 중 적어도 두개 이상의 계산 장치가 서로 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Specifically, when the second computing device 1204 wants to read data, the first computing device 1202 accesses the storage device 1201 and, from the storage device 1201, data required by the second computing device 1204. After reading , the data is directly transmitted to the second computing device 1204, so there is no need to forward the data via the second computing device 1203. Alternatively, first computing device 1202, second computing device 1203, and second computing device 1204 can all connect to storage device 1201 and first computing device 1202, second computing device At least one of the computing devices 1203 and the second computing device 1204 need only be connected to the storage device 1201 . There is no particular limitation on this here. Alternatively, the second computing device 1203 can be coupled to the second computing device 1204 and can be coupled to the first computing device 1202 . At least two of the first computing device 1202, the second computing device 1203, and the second computing device 1204 need only be connected to each other. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치가 상호간에 연결되도록 구성됨으로써 복수의 계산 장치가 상호간에 데이터를 전송할 수 있어 복수의 계산 장치 모두가 저장 장치로부터 데이터를 판독할 경우 연결하는 대역폭의 오버헤드가 과도하게 발생하는 것을 막을 수 있을 뿐만 아니라 데이터의 판독/기록 효율을 향상시킨다.In the network on-chip processing system, a plurality of computing devices installed on the same chip are configured to be connected to each other, so that the plurality of computing devices can transmit data to each other and connect when all of the plurality of computing devices read data from the storage device. Not only can excessive generation of bandwidth overhead be prevented, but also data read/write efficiency is improved.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은 저장 장치 및 복수의 계산 장치를 포함하고, 상기 저장 장치와 복수의 상기 계산 장치는 동일한 칩 위에 설치된다. 여기서, 상기 복수의 계산 장치 내의 각각의 계산 장치마다 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치는 서로 연결되어 있다.In one embodiment, the present application provides a network on-chip processing system. The system includes a storage device and a plurality of computing devices, and the storage device and the plurality of computing devices are installed on the same chip. Here, each computing device in the plurality of computing devices is connected to the storage device, and at least two or more computing devices are connected to each other.

도 3에 도시된 바와 같이, 도 3은 일 실시예에 따른 네트워크 온칩 처리 시스템(1300)이다. 네트워크 온칩 처리 시스템(1300)은 저장 장치(1301), 계산 장치(1302), 계산 장치(1303) 및 계산 장치(1304)를 포함하고, 저장 장치(1301), 계산 장치(1302), 계산 장치(1303) 및 계산 장치(1304)는 네트워크 온칩 처리 시스템(1300)의 동일한 칩 위에 설치되고, 계산 장치(1302), 계산 장치(1303) 및 계산 장치(1304)는 모두 저장 장치(1301)에 연결되며, 계산 장치(1302)와 계산 장치(1303)는 서로 연결되고, 동시에 계산 장치(1303)와 계산 장치(1304)는 서로 연결되어 있다. 계산 장치(1202), 계산 장치(1203) 및 계산 장치(1304)는 모두 저장 장치(1201)에 액세스할 수 있으며, 계산 장치(1302)와 계산 장치(1303)는 상호간에 데이터를 전송할 수 있고 계산 장치(1303)와 계산 장치(1304)도 상호간에 데이터를 전송할 수 있다.As shown in FIG. 3 , FIG. 3 is a network on-chip processing system 1300 according to one embodiment. The network on-chip processing system 1300 includes a storage device 1301, a calculation device 1302, a calculation device 1303, and a calculation device 1304, and includes a storage device 1301, a calculation device 1302, and a calculation device ( 1303) and computing device 1304 are installed on the same chip of network on-chip processing system 1300, and computing device 1302, computing device 1303 and computing device 1304 are all connected to storage device 1301, , the computing device 1302 and the computing device 1303 are connected to each other, and at the same time, the computing device 1303 and the computing device 1304 are connected to each other. Computing device 1202, computing device 1203, and computing device 1304 can all access storage device 1201, and computing device 1302 and computing device 1303 can transfer data to each other and calculate The device 1303 and the computing device 1304 can also transmit data to each other.

구체적으로, 계산 장치(1304)가 데이터를 판독하고자 할 때, 계산 장치(1304)가 저장 장치(1301)에 직접 액세스할 수도 있고, 계산 장치(1303)가 저장 장치(1301)에 액세스하여 저장 장치(1301)로부터 계산 장치(1304)에 필요한 데이터를 판독한 후 상기 데이터를 계산 장치(1304)로 송신할 수도 있다. 또한 계산 장치(1302)가 저장 장치(1301)에 액세스하여 저장 장치(1301)로부터 계산 장치(1304)에 필요한 데이터를 판독한 후 상기 데이터를 계산 장치(1303)로 송신하고 계산 장치(1303)가 다시 계산 장치(1304)에 전송할 수도 있다. 대안적으로, 계산 장치(1302), 계산 장치(1303) 및 계산 장치(1304) 중 적어도 하나 이상의 계산 장치만이 저장 장치(1301)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 계산 장치(1302), 계산 장치(1303) 및 계산 장치(1304) 중 적어도 두개 이상의 계산 장치가 서로 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Specifically, when the computing device 1304 wants to read data, the computing device 1304 may directly access the storage device 1301, or the computing device 1303 may access the storage device 1301 to access the storage device 1301. After reading necessary data for the computing device 1304 from 1301, the data may be transmitted to the computing device 1304. In addition, the computing device 1302 accesses the storage device 1301, reads data necessary for the computing device 1304 from the storage device 1301, and transmits the data to the computing device 1303, and the computing device 1303 It can also be sent back to computing device 1304. Alternatively, at least one of computing device 1302 , computing device 1303 , and computing device 1304 need only be connected to storage device 1301 . There is no particular limitation on this here. Alternatively, at least two of the computing devices 1302, 1303, and 1304 may be connected to each other. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치들이 서로 연결되도록 구성함으로써 임의의 한 계산 장치에 필요한 모든 데이터가 복수의 계산 장치 사이에서 전송 가능하므로 상기 시스템은 저장 장치 인터페이스를 동시에 판독하는 계산 장치를 줄이고 대역폭 막힘을 감소시킬 수 있다.In the network on-chip processing system, by configuring a plurality of computing devices installed on the same chip to be connected to each other, all data necessary for any one computing device can be transmitted between the plurality of computing devices, so that the system simultaneously reads the storage device interface. It can reduce computing devices and reduce bandwidth clogging.

도 4에 도시된 바와 같이, 도 4는 일 실시예에 따른 네트워크 온칩 처리 시스템(1400)이다. 네트워크 온칩 처리 시스템(1400)은 저장 장치(1401), 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)를 포함하고, 저장 장치(1401), 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)는 네트워크 온칩 처리 시스템(1400)의 동일한 칩 위에 설치된다. 여기서, 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)는 모두 저장 장치(1401)에 연결되고, 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)들은 서로 연결되어 있다. 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)는 모두 저장 장치(1401)에 액세스가능하며, 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404)들은 상호간에 데이터를 전송할 수 있다.As shown in FIG. 4, FIG. 4 is a network on-chip processing system 1400 according to one embodiment. The network on-chip processing system 1400 includes a storage device 1401, a calculation device 1402, a calculation device 1403, and a calculation device 1404, and includes a storage device 1401, a calculation device 1402, and a calculation device ( 1403) and computing device 1404 are installed on the same chip of network on-chip processing system 1400. Here, computing device 1402, computing device 1403, and computing device 1404 are all connected to storage device 1401, and computing device 1402, computing device 1403, and computing device 1404 are connected to each other. has been Computing device 1402, computing device 1403, and computing device 1404 can all access storage device 1401, and computing device 1402, computing device 1403, and computing device 1404 can communicate data to each other. can transmit.

구체적으로, 계산 장치(1404)가 데이터를 판독하고자 할 때, 저장 장치(1401)에 직접 액세스할 수도 있고 계산 장치(1403)가 저장 장치(1401)를 액세스하여 저장 장치(1401)로부터 계산 장치(1404)에 필요한 데이터를 판독한 후 상기 데이터를 계산 장치(1404)로 송신할 수도 있다. 또한 계산 장치(1402)가 저장 장치(1401)에 액세스하여 저장 장치(1401)로부터 계산 장치(1404)에 필요한 데이터를 판독한 후 상기 데이터를 바로 계산 장치(1404)로 송신할 수 있으므로 계산 장치(1403)를 거치지 않고 데이터를 포워딩할 수 있다. 대안적으로, 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404) 중 적어도 하나 이상의 계산 장치가 저장 장치(1401)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 계산 장치(1402), 계산 장치(1403) 및 계산 장치(1404) 중 적어도 두개 이상의 계산 장치가 서로 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Specifically, when the computing device 1404 wants to read data, it may directly access the storage device 1401 or the computing device 1403 may access the storage device 1401 to obtain data from the storage device 1401 ( After reading necessary data for 1404, the data may be transmitted to the computing device 1404. In addition, since the computing device 1402 can access the storage device 1401, read data necessary for the computing device 1404 from the storage device 1401, and then directly transmit the data to the computing device 1404, the computing device ( Data can be forwarded without going through 1403). Alternatively, at least one of the computing devices 1402 , 1403 , and 1404 may be connected to the storage device 1401 . There is no particular limitation on this here. Alternatively, at least two of the computing devices 1402, 1403, and 1404 may be connected to each other. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치가 저장 장치를 거치지 않고 직접적으로 연결되도록 구성됨으로써 두 계산 장치는 상호간에 데이터를 직접 전송할 수 있어 데이터의 판독/기록 효율을 향상시킬 수 있다.In the network on-chip processing system, since a plurality of computing devices installed on the same chip are configured to be directly connected without passing through a storage device, the two computing devices can directly transmit data to each other, thereby improving data read/write efficiency. .

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 더 제공한다. 상기 시스템은 저장 장치 및 복수의 계산 장치 그룹을 포함하고, 상기 저장 장치 및 복수의 상기 계산 장치 그룹은 동일한 칩 위에 설치되고, 각각의 계산 장치 그룹마다 복수의 계산 장치를 포함한다. 여기서, 상기 복수의 계산 장치 그룹에서 적어도 하나 이상의 계산 장치 그룹은 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치 그룹 사이는 서로 연결되어 있다.In one embodiment, the present application further provides a network on-chip processing system. The system includes a storage device and a plurality of groups of computing devices, the storage device and a plurality of groups of computing devices are installed on the same chip, and each group of computing devices includes a plurality of computing devices. Here, among the plurality of computing device groups, at least one computing device group is connected to the storage device, and at least two or more computing device groups are connected to each other.

일 실시예에서, 본 출원은 신경망 칩을 더 제공한다. 상기 칩은 저장 장치, 복수의 계산 장치 그룹, 제1 인터커넥트 장치 및 제2 인터커넥트 장치를 포함하고, 상기 복수의 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹은 상기 제1 인터커넥트 장치를 통해 상기 저장 장치에 연결되며, 상기 복수의 계산 장치 그룹 사이는 상기 제2 인터커넥트 장치를 통해 연결된다. 더 나아가, 계산 장치 그룹은 제1 인터커넥트 장치를 통해 저장 장치에 대한 판독/기록 동작을 구현하고, 복수의 계산 장치 그룹 사이는 제2 인터커넥트 장치를 통해 데이터를 전송할 수 있다. 여기서, 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다. 일 예로서, 하나의 그룹에 4개 계산 장치를 포함한다.In one embodiment, the present application further provides a neural network chip. The chip includes a storage device, a plurality of computing device groups, a first interconnect device, and a second interconnect device, wherein at least one computing device group of the plurality of computing device groups is connected to the storage device through the first interconnect device. and the plurality of computing device groups are connected through the second interconnect device. Further, a group of computing devices may implement a read/write operation to a storage device through a first interconnect device, and transfer data between multiple groups of computing devices through a second interconnect device. Here, the plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited. As an example, one group includes four computing devices.

도 5a에 도시된 바와 같이, 일 실시예는 네트워크 온칩 처리 시스템(1500)을 제공한다. 네트워크 온칩 처리 시스템(1500)은 저장 장치(1501) 및 6개 계산 장치(계산 장치(1502) 내지 계산 장치(1507))를 포함하고, 저장 장치(1501) 및 6개 계산 장치(계산 장치(1502) 내지 계산 장치(1507))는 네트워크 온칩 처리 시스템(1500)의 동일한 칩 위에 설치된다. 예를 들어 설명하면, 6개 계산 장치는 3개 그룹으로 나눌 수 있으며, 각 그룹마다 계산 장치를 두개 포함한다. 예를 들어 계산 장치(1502) 및 계산 장치(1503)가 제1 계산 장치 그룹(cluster1)에 속하며, 계산 장치(1504) 및 계산 장치(1505)가 제2 계산 장치 그룹(cluster2)에 속하고, 계산 장치(1506) 및 계산 장치(1507)가 제3 계산 장치 그룹(cluster3)에 속한다. cluster1은 메인 계산 장치 그룹이고, cluster2 및 cluster3는 서브 계산 장치 그룹이다. 여기서, cluster1만 저장 장치(1501)에 연결되고, cluster1, cluster2 및 cluster3은 상호간에 서로 연결된다. cluster1 내의 계산 장치(1502)가 저장 장치(1501)에 연결되고, cluster1 내의 계산 장치(1503)와 cluster2 내의 계산 장치(1504)는 서로 연결되고, cluster2 내의 계산 장치(1505)와 cluster3 내의 계산 장치(1507)는 서로 연결된다.As shown in FIG. 5A , one embodiment provides a network on-chip processing system 1500. The network on-chip processing system 1500 includes a storage device 1501 and 6 calculation devices (calculation device 1502 to 1507), and includes a storage device 1501 and 6 calculation devices (calculation device 1502). ) to computing device 1507) are installed on the same chip of network on-chip processing system 1500. For example, six computing devices can be divided into three groups, each group including two computing devices. For example, the computing device 1502 and the computing device 1503 belong to the first computing device group cluster1, and the computing device 1504 and the computing device 1505 belong to the second computing device group cluster2; The computing device 1506 and the computing device 1507 belong to the third computing device group (cluster3). cluster1 is a main computing device group, and cluster2 and cluster3 are sub computing device groups. Here, only cluster1 is connected to the storage device 1501, and cluster1, cluster2, and cluster3 are connected to each other. The computing device 1502 in cluster1 is connected to the storage device 1501, the computing device 1503 in cluster1 and the computing device 1504 in cluster2 are connected to each other, and the computing device 1505 in cluster2 and the computing device 1505 in cluster3 ( 1507) are connected to each other.

구체적으로, cluster3이 데이터를 판독하고자 할 때, cluster1가 저장 장치(1501)에 액세스하여 저장 장치(1501)로부터 cluster3에 필요한 데이터를 판독한 후 상기 데이터를 cluster2로 송신하고 다시 cluster2에서 cluster3로 송신한다. 여기서, 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다. 하나의 그룹에 4개 계산 장치를 포함하는 것이 바람직하다.Specifically, when cluster3 wants to read data, cluster1 accesses storage device 1501, reads data necessary for cluster3 from storage device 1501, and then transmits the data to cluster2 and from cluster2 to cluster3 again. . Here, the plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited. It is desirable to include 4 computing devices in one group.

대안적으로, 복수의 계산 장치 중 모든 계산 장치가 저장 장치(1501)에 연결될 필요가 없으며, 두 계산 장치 그룹 중 단지 하나 이상의 계산 장치 그룹만 저장 장치(1501)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1은 cluster2에 연결될 수도, cluster3에 연결될 수도 있는 바, 3개 계산 장치 그룹 중 적어도 두개 이상의 계산 장치 그룹이 서로 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 계산 장치 그룹마다 적어도 하나 이상의 계산 장치가 다른 상기 계산 장치 그룹 내의 적어도 하나 이상의 계산 장치에 연결된다. 다시 말하면, cluster1의 계산 장치마다 제2 장치 그룹에 연결될 수 있으며 cluster1의 적어도 하나 이상의 계산 장치가 cluster2의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 계산 장치 그룹은 상기 복수의 계산 장치 그룹 내의 임의의 한 계산 장치를 통해 서로 연결된다. 다시 말하면, cluster1의 임의의 한 계산 장치가 cluster2의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, not all of the plurality of computing devices need be connected to the storage device 1501 , and only one or more of the two groups of computing devices need be connected to the storage device 1501 . There is no particular limitation on this here. Alternatively, cluster1 may be connected to cluster2 or cluster3, as long as at least two or more of the three computing device groups are connected to each other. There is no particular limitation on this here. Alternatively, at least one computing device in each group of computing devices is coupled to at least one or more computing devices in another group of computing devices. In other words, each computing device of cluster1 may be connected to the second device group, and at least one computing device of cluster1 may be connected to at least one computing device of cluster2. There is no particular limitation on this here. Alternatively, the plurality of computing device groups are coupled to each other through any one computing device within the plurality of computing device groups. In other words, any one computing device in cluster1 can be connected to any one computing device in cluster2. There is no particular limitation on this here.

도 5b에 도시된 바와 같이, 도 5b는 일 실시예에 따른 네트워크 온칩 처리 시스템(15000)이다. 네트워크 온칩 처리 시스템(15000)은 저장 장치(15010) 및 6개 계산 장치(계산 장치(15020) 내지 계산 장치(15070))를 포함하고, 저장 장치(15010) 및 6개 계산 장치(계산 장치(15020) 내지 계산 장치(15070))는 네트워크 온칩 처리 시스템(15000)의 동일한 칩 위에 설치되며, 6개 계산 장치는 또한 3 그룹으로 나뉜다. 계산 장치(15020)와 계산 장치(15030)가 제1 계산 장치 그룹(cluster1)에 속하며, 계산 장치(15040)와 계산 장치(15050)가 제2 계산 장치 그룹(cluster2)에 속하며, 계산 장치(15060)와 계산 장치(15070)가 제3 계산 장치 그룹(cluster3)에 속한다. cluster1은 메인 계산 장치 그룹이고, cluster2와 cluster3은 서브 계산 장치 그룹이다. 여기서, cluster1만 저장 장치(15010)에 연결되고, cluster1, cluster2 및 cluster3은 상호간에 서로 연결된다. cluster1의 계산 장치(15020)는 저장 장치(15010)에 연결되고, cluster1의 계산 장치(15030)는 cluster2의 계산 장치(15040)에 연결되며, cluster2의 계산 장치(15050)는 cluster3의 계산 장치(15070)에 연결되고, cluster3의 계산 장치(15060)는 cluster1의 계산 장치(15020)에 연결된다.As shown in FIG. 5B, FIG. 5B is a network on-chip processing system 15000 according to one embodiment. The network on-chip processing system 15000 includes a storage device 15010 and 6 calculation devices (calculation device 15020 to 15070), and includes a storage device 15010 and 6 calculation devices (calculation device 15020). ) to computing devices 15070) are installed on the same chip of the network on-chip processing system 15000, and the six computing devices are also divided into three groups. The computing device 15020 and the computing device 15030 belong to the first computing device group cluster1, the computing device 15040 and the computing device 15050 belong to the second computing device group cluster2, and the computing device 15060 ) and the computing device 15070 belong to the third computing device group cluster3. cluster1 is a main computing device group, and cluster2 and cluster3 are sub computing device groups. Here, only cluster1 is connected to the storage device 15010, and cluster1, cluster2, and cluster3 are connected to each other. The computing unit 15020 of cluster1 is connected to the storage device 15010, the computing unit 15030 of cluster1 is connected to the computing unit 15040 of cluster2, and the computing unit 15050 of cluster2 is connected to the computing unit 15070 of cluster3. ), and the computing device 15060 of cluster3 is connected to the computing device 15020 of cluster1.

구체적으로, cluster3이 데이터를 판독하고자 할 때, cluster1이 저장 장치(1501)를 액세스하여 저장 장치(1501)로부터 cluster3에 필요한 데이터를 판독한 후 상기 데이터를 cluster3로 바로 송신할 수 있다. 여기서, 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다. 하나의 그룹에 4개의 계산 장치를 포함하는 것이 바람직하다.Specifically, when cluster3 wants to read data, cluster1 can access the storage device 1501, read data necessary for cluster3 from the storage device 1501, and then directly transmit the data to cluster3. Here, the plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited. It is desirable to include four computing devices in one group.

대안적으로, 복수의 계산 장치 중 모든 계산 장치가 저장 장치(15010)에 연결될 필요가 없으며, 두 계산 장치 그룹에서 적어도 하나 이상의 계산 장치 그룹이 저장 장치(15010)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1은 cluster2에 연결될 수도, cluster3에 연결될 수도 있는 바, 3개 계산 장치 그룹 중 적어도 두개 이상의 계산 장치 그룹이 서로 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 각각의 상기 계산 장치 그룹 내의 적어도 하나 이상의 계산 장치가 다른 상기 계산 장치 그룹의 적어도 하나 이상의 계산 장치에 연결된다. 다시 말하면, cluster1의 계산 장치마다 제2 장치 그룹에 연결할 수 있으며 cluster1의 적어도 하나 이상의 계산 장치가 cluster2의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 계산 장치 그룹들은 상기 복수의 계산 장치 그룹 내의 임의의 한 계산 장치를 통해 서로 연결된다. 즉, cluster1의 임의의 한 계산 장치가 cluster2의 임의의 계산 장치에 연결된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, not all of the plurality of computing devices need be connected to the storage device 15010, as long as at least one of the two groups of computing devices is connected to the storage device 15010. There is no particular limitation on this here. Alternatively, cluster1 may be connected to cluster2 or cluster3, as long as at least two or more of the three computing device groups are connected to each other. There is no particular limitation on this here. Alternatively, at least one or more computing devices in each said group of computing devices are coupled to at least one or more computing devices in other said groups of computing devices. In other words, each computing device of cluster1 may be connected to the second device group, and at least one computing device of cluster1 may be connected to at least one computing device of cluster2. There is no particular limitation on this here. Alternatively, the plurality of computing device groups are coupled to each other through any one computing device within the plurality of computing device groups. That is, an arbitrary computing device in cluster1 is connected to an arbitrary computing device in cluster2. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치 그룹들이 상호간에 연결되도록 구성함으로써 복수의 계산 장치 그룹 사이에 그룹간 통신이 가능한다. 상기 시스템은 그룹간 데이터 전송을 통해 저장 장치 인터페이스를 동시에 판독하는 계산 장치의 개수를 줄이고, 메모리 엑세스에 따른 에너지 오버헤드를 감소시킬 수 있다. 동시에, 동일한 칩 위에 설치된 복수의 계산 장치 그룹 사이는 다양한 연결 방식으로 그룹간 통신을 구현하므로 복수의 계산 장치 사이에 여러 갈래의 통신 채널을 구축함으로써 현재의 네트워크 정체 상황에 따라 가장 바람직한 채널을 선택하여 데이터를 전송할 수 있어 에너지 소모를 줄이고 데이터의 처리 효율을 향상시키는 효과에 도달하였다.In the network on-chip processing system, inter-group communication is possible between a plurality of groups of computing devices by configuring a plurality of groups of computing devices installed on the same chip to be connected to each other. The system can reduce the number of computing devices simultaneously reading the storage device interface through data transmission between groups and reduce energy overhead due to memory access. At the same time, since inter-group communication is implemented between multiple groups of computing devices installed on the same chip using various connection methods, multiple communication channels are established between multiple computing devices to select the most desirable channel according to the current network congestion situation. Data can be transmitted, reducing energy consumption and improving data processing efficiency.

일 실시예에서, 네트워크 온칩 처리 시스템이 제공된다. 상기 시스템은 저장 장치 및 복수의 계산 장치 그룹을 포함하고, 상기 저장 장치 및 복수의 상기 계산 장치 그룹은 동일한 칩 위에 설치되며, 각각의 계산 장치 그룹마다 복수의 계산 장치를 포함하고, 상기 복수의 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹은 상기 저장 장치에 연결되고, 상기 복수의 계산 장치 그룹은 상호간에 서로 연결된다.In one embodiment, a network on-chip processing system is provided. The system includes a storage device and a plurality of groups of computing devices, the storage device and a plurality of groups of computing devices are installed on the same chip, each group of computing devices includes a plurality of computing devices, and the plurality of computing devices are installed on the same chip. At least one computing device group among the device groups is connected to the storage device, and the plurality of computing device groups are connected to each other.

도 6에 도시된 바와 같이, 도 6은 일 실시예에 따른 네트워크 온칩 처리 시스템(1600)이다. 네트워크 온칩 처리 시스템(1600)은 저장 장치(1601) 및 6개 계산 장치(계산 장치(1602) 내지 계산 장치(1607))를 포함하고, 저장 장치(1601) 및 6개 계산 장치(계산 장치(1602) 내지 계산 장치(1607))는 네트워크 온칩 처리 시스템(1600)의 동일한 칩 위에 설치된다. 예를 들어, 6개 계산 장치를 세 그룹으로 나눌 수 있으며, 계산 장치(1602) 및 계산 장치(1603)는 제1 계산 장치 그룹(cluster1)에 속하고, 계산 장치(1604) 및 계산 장치(1605)는 제2 계산 장치 그룹(cluster2)에 속하며, 계산 장치(1606) 및 계산 장치(1607)는 제3 계산 장치 그룹(cluster3)에 속한다. 여기서 cluster1, cluster2 및 cluster3은 모두 저장 장치(1601)에 연결되고, cluster1과 cluster2는 서로 연결되고, cluster2와 cluster3은 서로 연결된다. 계산 장치(1602) 내지 계산 장치(1607)는 모두 저장 장치(1601)에 연결되고, cluster1의 계산 장치(1603)와 cluster2의 계산 장치(1604)는 서로 연결되고, cluster2의 계산 장치(1604)와 cluster3의 계산 장치(1607)는 서로 연결된다.As shown in FIG. 6, FIG. 6 is a network on-chip processing system 1600 according to one embodiment. The network on-chip processing system 1600 includes a storage device 1601 and 6 calculation devices (calculation device 1602 to 1607), and includes a storage device 1601 and 6 calculation devices (calculation device 1602). ) to computing device 1607) are installed on the same chip of network on-chip processing system 1600. For example, six computing devices can be divided into three groups, computing device 1602 and computing device 1603 belong to the first computing device group (cluster1), computing device 1604 and computing device 1605 ) belongs to the second computing device group cluster2, and the computing device 1606 and 1607 belong to the third computing device group cluster3. Here, cluster1, cluster2, and cluster3 are all connected to the storage device 1601, cluster1 and cluster2 are connected to each other, and cluster2 and cluster3 are connected to each other. The computing device 1602 to the computing device 1607 are all connected to the storage device 1601, the computing device 1603 of cluster1 and the computing device 1604 of cluster2 are connected to each other, and the computing device 1604 of cluster2 and The computing devices 1607 of cluster3 are interconnected.

구체적으로, cluster3이 데이터를 판독하고자 할 때, cluster2가 저장 장치(1601)에 액세스하여 저장 장치(1601)로부터 cluster3에 필요한 데이터를 판독한 후 cluster3로 송신할 수 있다. 또한 cluster1이 저장 장치(1601)에 액세스하여 저장 장치(1601)로부터 cluster3에 필요한 데이터를 판독한 후 상기 데이터를 cluster2로 송신하고 다시 cluster2에서 cluster3로 송신할 수 있다. 여기서, 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며, 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다, 예를 들어, 하나의 그룹에 4개의 계산 장치를 포함한다.Specifically, when cluster3 wants to read data, cluster2 can access the storage device 1601, read data necessary for cluster3 from the storage device 1601, and then transmit the data to cluster3. Also, after cluster1 accesses the storage device 1601 and reads necessary data for cluster3 from the storage device 1601, the data can be transmitted to cluster2 and then transmitted from cluster2 to cluster3. Here, the plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, for example, one group includes four computing devices.

대안적으로, 복수의 계산 장치 중 모든 계산 장치가 저장 장치(1601)에 연결될 필요없으며, 두 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹만 저장 장치(1601)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 각 계산 장치는 모두 제2 유닛 그룹 및/또는 cluster3에 연결될 수 있으며 cluster1의 적어도 하나 이상의 계산 장치가 cluster2 및/또는 cluster3의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 임의의 한 계산 장치는 cluster2 및/또는 cluster3의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, all of the computing devices of the plurality of computing devices need not be connected to the storage device 1601, and only at least one of the two computing device groups needs to be connected to the storage device 1601. There is no particular limitation on this here. Alternatively, each computing device of cluster1 may be connected to the second unit group and/or cluster3, and at least one computing device of cluster1 may be connected to at least one computing device of cluster2 and/or cluster3. There is no particular limitation on this here. Alternatively, any one computing device in cluster1 may be coupled to any one computing device in cluster2 and/or cluster3. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치 그룹들이 상호간에 연결되도록 구성함으로써 임의의 한 계산 장치 그룹에 필요한 데이터가 모두 복수의 계산 장치 그룹 사이에서 전송 가능하므로 상기 시스템은 저장 장치 인터페이스를 동시에 판독하는 계산 장치를 줄이고 대역폭의 막힘을 감소시킬 수 있다.In the network on-chip processing system, by configuring a plurality of groups of computing devices installed on the same chip to be connected to each other, all data necessary for a certain group of computing devices can be transmitted between the plurality of groups of computing devices, so that the system is configured to interface with a storage device. It is possible to reduce the number of computing devices that read simultaneously and reduce bandwidth clogging.

일 실시예에서, 네트워크 온칩 처리 시스템이 제공된다. 상기 시스템은 저장 장치 및 복수의 계산 장치 그룹을 포함하고, 상기 저장 장치 및 복수의 상기 계산 장치 그룹은 동일한 칩 위에 설치되며, 각각의 계산 장치 그룹마다 복수의 계산 장치를 포함하고, 상기 복수의 계산 장치 그룹에서 적어도 하나 이상의 계산 장치 그룹은 상기 저장 장치에 연결되고, 상기 복수의 계산 장치 그룹 중 임의의 두 계산 장치 그룹은 직접적으로 연결된다.In one embodiment, a network on-chip processing system is provided. The system includes a storage device and a plurality of groups of computing devices, the storage device and a plurality of groups of computing devices are installed on the same chip, each group of computing devices includes a plurality of computing devices, and the plurality of computing devices are installed on the same chip. At least one computing device group in the device group is coupled to the storage device, and any two computing device groups among the plurality of computing device groups are directly coupled.

도 7에 도시된 바와 같이, 도 7은 일 실시예에 따른 네트워크 온칩 처리 시스템(1700)이다. 네트워크 온칩 처리 시스템(1700)은 저장 장치(1701) 및 6개 계산 장치(계산 장치(1702) 내지 계산 장치(1707))를 포함하고, 저장 장치(1701) 및 6개 계산 장치(계산 장치(1702) 내지 계산 장치(1707))는 네트워크 온칩 처리 시스템(1700)의 동일한 칩 위에 설치되고, 6개 계산 장치는 세 그룹으로 나뉘고, 계산 장치(1702) 및 계산 장치(1703)는 제1 계산 장치 그룹(cluster1)에 속하고, 계산 장치(1704) 및 계산 장치(1705)는 제2 계산 장치 그룹(cluster2)에 속하며, 계산 장치(1706) 및 계산 장치(1707)는 제3 계산 장치 그룹(cluster3)에 속한다. 여기서, cluster1, cluster2 및 cluster3은 모두 저장 장치(1701)에 연결되고, cluster1, cluster2 및 cluster3의 3개 계산 장치 그룹은 상호간에 서로 연결된다. 계산 장치(1702) 내지 계산 장치(1707)는 모두 저장 장치(1701)에 연결되고, cluster1의 계산 장치(1703)와 cluster2의 계산 장치(1704)는 서로 연결되며, cluster2의 계산 장치(1704)와 cluster3의 계산 장치(1707)는 서로 연결되며, cluster1의 계산 장치(1702)와 cluster3의 계산 장치(1706)는 서로 연결된다.As shown in FIG. 7 , FIG. 7 is a network on-chip processing system 1700 according to one embodiment. The network on-chip processing system 1700 includes a storage device 1701 and 6 calculation devices (calculation device 1702 to 1707), and includes a storage device 1701 and 6 calculation devices (calculation device 1702 ) to computing device 1707) are installed on the same chip of network on-chip processing system 1700, six computing devices are divided into three groups, and computing device 1702 and computing device 1703 are the first computing device group. (cluster1), the computing device 1704 and the computing device 1705 belong to the second computing device group (cluster2), and the computing device 1706 and the computing device 1707 belong to the third computing device group (cluster3). belongs to Here, cluster1, cluster2, and cluster3 are all connected to the storage device 1701, and three computing device groups of cluster1, cluster2, and cluster3 are connected to each other. The computing device 1702 to the computing device 1707 are all connected to the storage device 1701, the computing device 1703 of cluster1 and the computing device 1704 of cluster2 are connected to each other, and the computing device 1704 of cluster2 and The computing device 1707 of cluster3 is connected to each other, and the computing device 1702 of cluster1 and the computing device 1706 of cluster3 are connected to each other.

구체적으로, cluster3이 데이터를 판독하고자 할 때, cluster2가 저장 장치(1701)에 액세스하여 저장 장치(1701)로부터 cluster3에 필요한 데이터를 판독한 후 cluster2에서 cluster3으로 송신할 수 있다. 또한 cluster1이 저장 장치(1701)에 액세스하여 저장 장치(1701)로부터 cluster3에 필요한 데이터를 판독한 후 상기 데이터를 cluster1에서 cluster3으로 직접 송신할 수 있다. 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다. 하나의 그룹에 4개 계산 장치를 포함하는 것이 바람직하다.Specifically, when cluster3 wants to read data, cluster2 can access the storage device 1701, read data necessary for cluster3 from the storage device 1701, and transmit data from cluster2 to cluster3. In addition, after cluster1 accesses the storage device 1701 and reads data necessary for cluster3 from the storage device 1701, the data can be directly transmitted from cluster1 to cluster3. A plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited. It is desirable to include 4 computing devices in one group.

복수의 계산 장치에서 모든 계산 장치가 저장 장치(1701)에 연결될 필요없으며, 두 계산 장치 그룹에서 적어도 하나 이상의 계산 장치 그룹이 저장 장치(1701)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 각각의 계산 장치마다 제2 유닛 그룹 및/또는 cluster3에 연결될 수 있고 cluster1의 적어도 하나 이상의 계산 장치가 cluster2 및/또는 cluster3의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 임의의 한 계산 장치는 cluster2 및/또는 cluster3의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.In a plurality of computing devices, not all of the computing devices need to be connected to the storage device 1701, and at least one or more computing device groups in the two computing device groups need only be connected to the storage device 1701. There is no particular limitation on this here. Alternatively, each computing device of cluster1 may be connected to the second unit group and/or cluster3, and at least one computing device of cluster1 may be connected to at least one computing device of cluster2 and/or cluster3. There is no particular limitation on this here. Alternatively, any one computing device in cluster1 may be coupled to any one computing device in cluster2 and/or cluster3. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치 그룹이 상호간에 직접적으로 연결되도록 구성되므로 데이터의 판독/기록 효율을 향상시킬 수 있다.In the network on-chip processing system, since a plurality of groups of computing devices installed on the same chip are configured to be directly connected to each other, data read/write efficiency can be improved.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은 저장 장치 및 복수의 계산 장치 그룹을 포함하고, 상기 저장 장치 및 복수의 상기 계산 장치 그룹은 동일한 칩 위에 설치되며, 각각의 계산 장치 그룹마다 복수의 계산 장치를 포함하고, 상기 복수의 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹은 상기 저장 장치에 연결되고, 적어도 두개 이상의 계산 장치 그룹은 상호간에 연결되며, 각각의 상기 계산 장치 그룹의 복수의 계산 장치는 상호간에 연결된다.In one embodiment, the present application provides a network on-chip processing system. The system includes a storage device and a plurality of groups of computing devices, the storage device and a plurality of groups of computing devices are installed on the same chip, each group of computing devices includes a plurality of computing devices, and the plurality of computing devices are installed on the same chip. At least one computing device group of the device groups is connected to the storage device, at least two or more computing device groups are connected to each other, and a plurality of computing devices of each of the computing device groups are connected to each other.

도 8에 도시된 바와 같이, 도 8은 일 실시예에 따른 네트워크 온칩 처리 시스템(1800)이다. 네트워크 온칩 처리 시스템(1800)은 저장 장치(1801) 및 6개 계산 장치(계산 장치(1802) 내지 계산 장치(1807))를 포함하고, 저장 장치(1801) 및 6개 계산 장치(계산 장치(1802) 내지 계산 장치(1807))는 네트워크 온칩 처리 시스템(1800)의 동일한 칩 위에 설치되며, 6개 계산 장치는 두 그룹으로 나뉘고, 계산 장치(1802), 계산 장치(1803) 및 계산 장치(1804)는 제1 계산 장치 그룹(cluster1)에 속하고, 계산 장치(1805), 계산 장치(1806) 및 계산 장치(1807)는 제2 계산 장치 그룹(cluster2)에 속한다. cluster1 및 cluster2는 모두 저장 장치(1801)에 연결되고, cluster1 및 cluster2는 서로 연결되고, cluster1 내의 3개 계산 장치는 상호간에 연결되며 cluster2 내의 3개 계산 장치는 상호간에 연결된다. 계산 장치(1802) 내지 계산 장치(1807)는 모두 저장 장치(1801)에 연결되며, cluster1의 계산 장치(1802)와 cluster2의 계산 장치(1805)는 서로 연결되고, 계산 장치(1803)는 계산 장치(1802) 및 계산 장치(1804)에 연결되고, 계산 장치(1806)는 계산 장치(1805) 및 계산 장치(1807)에 연결된다. 여기서, 각 계산 장치 그룹의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 여기서 중복 설명하지 않는다.As shown in FIG. 8 , FIG. 8 is a network on-chip processing system 1800 according to one embodiment. The network on-chip processing system 1800 includes a storage device 1801 and six calculation devices (calculation device 1802 to 1807), and includes a storage device 1801 and six calculation devices (calculation device 1802 ) to the computing device 1807) are installed on the same chip of the network on-chip processing system 1800, the six computing devices are divided into two groups, the computing device 1802, the computing device 1803 and the computing device 1804 belongs to the first computing device group cluster1, and the computing device 1805, the computing device 1806, and the computing device 1807 belong to the second computing device group cluster2. cluster1 and cluster2 are both connected to the storage device 1801, cluster1 and cluster2 are connected to each other, three computing units in cluster1 are connected to each other, and three computing units in cluster2 are connected to each other. The computing device 1802 to the computing device 1807 are all connected to the storage device 1801, the computing device 1802 of cluster1 and the computing device 1805 of cluster2 are connected to each other, and the computing device 1803 is a computing device. 1802 and computing device 1804, and computing device 1806 is connected to computing device 1805 and computing device 1807. Here, a connection method between a plurality of computing devices of each computing device group may refer to a connection method of the network on-chip processing system 1100 to the network on-chip processing system 1400 . It is not repeated here.

구체적으로, cluster2가 데이터를 판독하고자 할 때, 저장 장치(1801)에 직접 액세스할 수 있다. 또한 cluster1가 저장 장치(1801)에 액세스하여 저장 장치(1801)로부터 cluster2에 필요한 데이터를 판독한 후 상기 데이터를 cluster2로 송신할 수 있다. 이와 동시에, 제2 계산 장치도 그룹 내에서 데이터를 전송할 수 있다. cluster2가 데이터를 판독하고자 할 때, cluster2의 계산 장치(1805), 계산 장치(1806) 및 계산 장치(1807)가 동시에 저장 장치(1801)에 액세스할 수 있다. 여기서 계산 장치(1805), 계산 장치(1806) 및 계산 장치(1807)는 cluster2에 필요한 일부 데이터를 각각 판독하는데, 이러한 데이터는 cluster2 내에서 전송가능하다. 복수의 계산 장치는 복수의 그룹으로 나눌 수 있으며 각 그룹의 계산 장치의 개수는 특별히 제한되지 않는다. 하나의 그룹에 4개 계산 장치를 포함하는 것이 바람직하다.Specifically, when cluster2 wants to read data, it can directly access the storage device 1801. In addition, cluster1 can access the storage device 1801, read data necessary for cluster2 from the storage device 1801, and then transmit the data to cluster2. At the same time, the second computing device may also transmit data within the group. When cluster2 wants to read data, the computing device 1805, computing device 1806, and computing device 1807 of cluster2 can access the storage device 1801 at the same time. Here, the computing device 1805, the computing device 1806, and the computing device 1807 respectively read some data necessary for cluster2, and this data can be transmitted within cluster2. A plurality of computing devices can be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited. It is desirable to include 4 computing devices in one group.

대안적으로, 복수의 계산 장치 중 모든 계산 장치가 저장 장치(1801)에 연결될 필요없으며, 두개 계산 장치 그룹 중 적어도 하나 이상의 계산 장치 그룹이 저장 장치(1801)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 각각의 계산 장치가 모두 제2 유닛 그룹에 연결될 수 있으며, cluster1의 적어도 하나 이상의 계산 장치가 cluster2의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, cluster1의 임의의 한 계산 장치가 cluster2의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, not all of the plurality of computing devices need to be connected to the storage device 1801, and at least one or more of the two groups of computing devices need only be connected to the storage device 1801. There is no particular limitation on this here. Alternatively, each computing device of cluster1 may be connected to the second unit group, and at least one computing device of cluster1 may be connected to at least one computing device of cluster2. There is no particular limitation on this here. Alternatively, any one computing device in cluster1 may be coupled to any one computing device in cluster2. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 계산 장치 그룹이 상호간에 연결되도록 구성하는 동시에 각각의 계산 장치 그룹의 복수의 계산 장치가 상호간에 연결되도록 구성함으로써 복수의 계산 장치 사이에 그룹 내 통신뿐만 아니라 그룹간 통신이 가능하게 된다. 그러므로 상기 시스템은 메모리 액세스에 따른 에너지 오버헤드를 줄이고 데이터의 판독 효율을 향상시킬 수 있다.In the network on-chip processing system, a plurality of groups of computing devices installed on the same chip are configured to be connected to each other, and at the same time, a plurality of computing devices of each group of computing devices are configured to be connected to each other, so that intra-group communication between a plurality of computing devices is achieved. In addition, communication between groups is possible. Therefore, the system can reduce energy overhead due to memory access and improve data reading efficiency.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 더 제공한다. 상기 시스템은 서로 연결된 복수의 네트워크 온칩 처리 모듈을 포함하고, 상기 복수의 네트워크 온칩 처리 모듈은 동일한 칩 위에 설치되며, 각각의 네트워크 온칩 처리 모듈은 적어도 하나 이상의 저장 장치 및 복수의 계산 장치를 포함하되, 각각의 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치가 상기 네트워크 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결되고, 상기 복수의 계산 장치 중 적어도 두개 이상의 계산 장치는 상호간에 연결되어 있다.In one embodiment, the present application further provides a network on-chip processing system. The system includes a plurality of network on-chip processing modules connected to each other, the plurality of network on-chip processing modules are installed on the same chip, each network on-chip processing module includes at least one storage device and a plurality of computing devices, In each network on-chip processing module, at least one computing device is connected to at least one storage device inside the network processing module, and at least two of the plurality of computing devices are connected to each other.

일 실시예에서, 신경망 칩이 제공된다. 상기 칩은 서로 연결된 복수의 네트워크 온칩 처리 모듈을 포함하고, 각각의 네트워크 온칩 처리 모듈은 적어도 하나 이상의 저장 장치, 복수의 계산 장치, 제1 인터커넥트 장치 및 제2 인터커넥트 장치를 포함하고, 각각의 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치와 상기 네트워크 온칩 처리 모듈 내부의 적어도 하나 이상의 저장 장치가 상기 제1 인터커넥트 장치를 통해 연결되며, 상기 복수의 계산 장치가 상기 제2 인터커넥트 장치를 통해 상호간에 연결된다. 더 나아가, 계산 장치는 제1 인터커넥트 장치를 통해 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에 대한 판독/기록 동작을 구현하고, 복수의 계산 장치는 또한 제2 인터커넥트 장치를 통해 데이터를 전송할 수 있다.In one embodiment, a neural network chip is provided. The chip includes a plurality of network-on-chip processing modules connected to each other, each network-on-chip processing module including at least one storage device, a plurality of computing devices, a first interconnect device and a second interconnect device, each network-on-chip processing module including: In a processing module, at least one computing device and at least one storage device inside the network-on-chip processing module are connected through the first interconnect device, and the plurality of computing devices are connected to each other through the second interconnect device. Furthermore, the computing device implements a read/write operation to a storage device inside the attached network on-chip processing module through the first interconnect device, and the plurality of computing devices may also transmit data through the second interconnect device.

도 9에 도시된 바와 같이, 도 9는 일 실시예에 따른 네트워크 온칩 처리 시스템(1900)이다. 네트워크 온칩 처리 시스템(1900)은 서로 연결된 4개 네트워크 온칩 처리 모듈을 포함하고, 상기 4개 네트워크 온칩 처리 모듈은 네트워크 온칩 처리 시스템(1900)의 동일한 칩 위에 설치되고, 각각의 네트워크 온칩 처리 모듈은 하나의 저장 장치(1901) 및 4개 계산 장치(계산 장치(1902) 내지 계산 장치(1905))를 포함한다. 여기서, 각각의 네트워크 온칩 처리 모듈에서 계산 장치(1902)는 해당 네트워크 온칩 처리 모듈 내부의 저장 장치(1901)에 연결되며, 각각의 네트워크 온칩 처리 모듈 내부의 4개 계산 장치는 상호간에 연결되어 있다.As shown in FIG. 9 , FIG. 9 is a network on-chip processing system 1900 according to one embodiment. The network on-chip processing system 1900 includes four network on-chip processing modules connected to each other, the four network on-chip processing modules are installed on the same chip of the network on-chip processing system 1900, and each network on-chip processing module is one A storage device 1901 of and four calculation devices (calculation device 1902 to 1905). Here, in each network on-chip processing module, the computing device 1902 is connected to the storage device 1901 inside the corresponding network on-chip processing module, and the four computing devices inside each network on-chip processing module are connected to each other.

구체적으로, 각각의 네트워크 온칩 처리 모듈이 처리해야 하는 데이터는 모두 상기 네트워크 온칩 처리 모듈 내부의 저장 장치에 저장되어 있다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에만 액세스할 수 있고 단지 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치로부터 데이터를 판독/기록할 수 있다.Specifically, all data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module. In other words, the plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module to which it belongs, and can only read/write data from the storage device inside the network-on-chip processing module to which it belongs. .

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 저장 장치의 개수는 하나에 제한되지 않으며, 2개, 3개 또는 복수 개일 수 있다. 여기서 이에 대해 특별히 한정하지 않으나 4개가 바람직하다. 대안적으로, 각각의 네트워크 온칩 처리 모듈에서 상기 복수의 계산 장치는 서로 연결되어 계산 장치 네트워크를 구성하고, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 이에 대해 중복 설명하지 않는다. 대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치가 모두 저장 장치(1901)에 연결될 필요없으며, 각각의 네트워크 온칩 처리 모듈 중 적어도 하나 이상의 계산 장치가 저장 장치(1901)에 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, the number of storage devices in each network on-chip processing module is not limited to one, and may be two, three, or plural. Although there is no particular limitation on this here, four are preferable. Alternatively, in each network on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and the connection method among the plurality of computing devices in each network on-chip processing module is the network on-chip processing system 1100 or a connection method of the network on-chip processing system 1400 may be referred to. I will not double explain this. Alternatively, the plurality of computing devices in each network on-chip processing module need not all be connected to the storage device 1901, only at least one computing device in each network on-chip processing module needs to be connected to the storage device 1901. . There is no particular limitation on this here.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치마다 다른 네트워크 온칩 처리 모듈에 연결될 수 있으며, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치가 다른 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치에 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 네트워크 온칩 처리 모듈은 각각의 네트워크 온칩 처리 모듈 중 임의의 한 계산 장치를 통해 서로 연결된다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 중 임의의 한 계산 장치가 다른 한 네트워크 온칩 처리 모듈의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, each computing device in each network-on-chip processing module can be coupled to another network-on-chip processing module, and at least one computing device in each network-on-chip processing module is coupled to one or more computing devices in another network-on-chip processing module. You just have to be. There is no particular limitation on this here. Alternatively, the plurality of network on-chip processing modules are connected to each other through any one computing device of each network-on-chip processing module. In other words, any one computing device in each network-on-chip processing module may be coupled to any one computing device in another network-on-chip processing module. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 네트워크 온칩 처리 모듈이 상호간에 연결되도록 구성하는 동시에, 각각의 네트워크 온칩 처리 모듈의 복수의 계산 장치가 상호간에 연결되도록 구성함으로써 복수의 계산 장치 간에 모듈 내 통신뿐만 아니라 모듈간 통신이 가능하게 되어 메모리 엑세스에 따른 에너지 오버헤드를 줄이고 데이터의 판독 효율을 향상시킬 수 있다. 동시에, 동일한 칩 위에 설치된 복수의 네트워크 온칩 처리 모듈 사이는 다양한 연결 방식으로 모듈간 통신을 구현하므로 복수의 계산 장치 사이에 여러 갈래의 통신 채널을 구축하여 현재의 네트워크 정체 상황에 따라 가장 바람직한 채널을 선택하여 데이터를 전송함으로써 에너지를 절약하고 데이터의 처리 효율을 향상시키는 효과를 달성한다.In the network on-chip processing system, a plurality of network on-chip processing modules installed on the same chip are configured to be connected to each other, and a plurality of computing devices of each network on-chip processing module are configured to be connected to each other, so that a plurality of computing devices are connected to each other. Internal communication as well as inter-module communication becomes possible, reducing energy overhead due to memory access and improving data reading efficiency. At the same time, since inter-module communication is implemented between multiple network on-chip processing modules installed on the same chip through various connection methods, multiple communication channels are established between multiple computing devices to select the most desirable channel according to the current network congestion situation. to transmit data to achieve the effect of saving energy and improving data processing efficiency.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은 서로 연결된 복수의 네트워크 온칩 처리 모듈을 포함하고, 상기 복수의 네트워크 온칩 처리 모듈은 동일한 칩 위에 설치되며, 각각의 네트워크 온칩 처리 모듈은 복수의 저장 장치를 포함하고, 상기 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내부의 상기 복수의 저장 장치에 연결되고, 상기 복수의 계산 장치 중 적어도 두개 이상의 계산 장치는 상호간에 연결되어 있다.In one embodiment, the present application provides a network on-chip processing system. The system includes a plurality of network on-chip processing modules connected to each other, the plurality of network on-chip processing modules are installed on the same chip, each network on-chip processing module includes a plurality of storage devices, and in the network on-chip processing module At least one computing device is connected to the plurality of storage devices inside the network on-chip processing module, and at least two of the plurality of computing devices are connected to each other.

도 10a에 도시된 바와 같이, 도 10a은 일 실시예에 따른 네트워크 온칩 처리 시스템(1910)이다. 네트워크 온칩 처리 시스템(1910)은 서로 연결된 4개 네트워크 온칩 처리 모듈을 포함하고, 상기 4개 네트워크 온칩 처리 모듈은 네트워크 온칩 처리 시스템(1910)의 동일한 칩 위에 설치되며, 각각의 네트워크 온칩 처리 모듈은, 저장 장치(1911), 저장 장치(1916), 및 4개 계산 장치(계산 장치(1912) 내지 계산 장치(1915))를 포함하되, 각각의 네트워크 온칩 처리 모듈에서 계산 장치(1912)는 그 네트워크 온칩 처리 모듈 내부의 저장 장치(1911) 및 저장 장치(1916)에 연결되고, 각각의 네트워크 온칩 처리 모듈 내부의 4개 계산 장치 들은 서로 연결되어 있다.As shown in FIG. 10A, FIG. 10A is a network on-chip processing system 1910 according to one embodiment. The network on-chip processing system 1910 includes four network on-chip processing modules connected to each other, the four network on-chip processing modules are installed on the same chip of the network on-chip processing system 1910, and each network on-chip processing module comprises: A storage device 1911, a storage device 1916, and four computing devices (calculation device 1912 to 1915), wherein in each network-on-chip processing module, the computing device 1912 is the network-on-chip It is connected to the storage device 1911 and the storage device 1916 inside the processing module, and the four computing devices inside each network on-chip processing module are connected to each other.

구체적으로, 각각의 네트워크 온칩 처리 모듈이 처리해야 하는 데이터는 모두 상기 네트워크 온칩 처리 모듈 내부의 저장 장치에 저장되어 있다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에만 액세스할 수 있으므로 오직 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치로부터 데이터를 판독/기록할 수 있다. 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 연결된다. 다시 말해서, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치가 상기 네트워크 온칩 처리 모듈 내의 모든 저장 장치를 액세스할 수 있다. 여기서, 각각의 네트워크 온칩 처리 모듈 내의 저장 장치의 개수는 2개에 제한되지 않으며, 3개, 4개 또는 복수 개일 수 있다. 여기서 이에 대해 특별히 한정하지 않으나 4개가 바람직하다. 그러므로 차지하는 공간을 절약하는 동시에 데이터를 효율적으로 처리할 수 있다.Specifically, all data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module. In other words, since the plurality of computing devices in each network on-chip processing module can access only the storage device inside the network on-chip processing module to which they belong, they can only read/write data from the storage device inside the network on-chip processing module to which they belong. . At least one computing device within each network-on-chip processing module is coupled to all storage devices within the network-on-chip processing module. In other words, a computing device within each network-on-chip processing module can access all storage devices within the network-on-chip processing module. Here, the number of storage devices in each network on-chip processing module is not limited to two, and may be three, four, or a plurality. Although there is no particular limitation on this here, four are preferable. Therefore, data can be processed efficiently while saving the space occupied.

구체적으로, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치는 인접되어 있는 저장 장치에 우선적으로 액세스한다. 여기서, 인접되어 있는 저장 장치는 계산 장치와 연결된 복수의 저장 장치 중 통신 거리가 가장 짧은 저장 장치를 가리킨다. 다시 말하면, 통신 거리가 가장 짧은 저장 장치는 다른 저장 장치보다 액세스 우선순위가 높다.Specifically, computing devices within each network-on-chip processing module preferentially access adjacent storage devices. Here, the adjacent storage device refers to a storage device having the shortest communication distance among a plurality of storage devices connected to the computing device. In other words, the storage device with the shortest communication distance has higher access priority than other storage devices.

대안적으로, 각각의 네트워크 온칩 처리 모듈에서 상기 복수의 계산 장치는 상호간에 연결되어 계산 장치 네트워크를 구성하고, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 여기서 중복 설명하지 않는다. 대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치에서 모든 계산 장치가 저장 장치(1911)에 연결될 필요없으며, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치가 저장 장치(1911)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, in each network on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and the connection method between the plurality of computing devices in each network on-chip processing module is the network on-chip processing system 1100 ) to a connection method of the network on-chip processing system 1400 may be referred to. It is not repeated here. Alternatively, in the plurality of computing devices in each network-on-chip processing module, not all of the computing devices need be coupled to storage device 1911, and at least one or more computing devices in each network-on-chip processing module are coupled to storage device 1911. It can be done. There is no particular limitation on this here.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치마다 모두 다른 한 네트워크 온칩 처리 모듈에 연결될 수 있으며, 각각의 네트워크 온칩 처리 모듈의 적어도 하나 이상의 계산 장치가 다른 한 네트워크 온칩 처리 모듈의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 네트워크 온칩 처리 모듈 사이는 각각의 네트워크 온칩 처리 모듈 중 임의의 한 계산 장치를 통해 서로 연결된다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 중 임의의 한 계산 장치는 다른 한 네트워크 온칩 처리 모듈의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, each computing device in each network-on-chip processing module may be connected to one other network-on-chip processing module, and at least one computing device in each network-on-chip processing module may be connected to at least one or more computing devices in another network-on-chip processing module. You just have to connect it to the device. There is no particular limitation on this here. Alternatively, the plurality of network on-chip processing modules are interconnected through any one computing device of each network-on-chip processing module. In other words, any one computing device in each network-on-chip processing module may be coupled to any one computing device in another network-on-chip processing module. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 각각의 계산 장치가 소속된 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 액세스하여 데이터를 전송하도록 여러 갈래의 통신 채널을 제공할 수 있음으로써 데이터의 판독/기록 효율을 향상시킬 수 있다. 그리고, 상기 시스템의 각각의 계산 장치가 인접되어 있는 저장 장치를 우선적으로 액세스하므로 메모리 액세스에 따른 오버헤드를 감소시키는 동시에 일정한 유연성을 보장할 수 있다.In the network on-chip processing system, data read/write efficiency can be improved by providing multiple communication channels so that each computing device accesses all storage devices in the network on-chip processing module to which it belongs and transmits data. there is. In addition, since each computing device of the system preferentially accesses an adjacent storage device, it is possible to reduce memory access overhead and at the same time ensure certain flexibility.

일 실시예에서, 도 10b에 도시된 네트워크 온칩 처리 시스템(19100)은, 각각의 네트워크 온칩 처리 모듈이 처리해야 하는 데이터를 모두 상기 네트워크 온칩 처리 모듈 내부의 저장 장치에 저장한다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에만 액세스할 수 있어 오직 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치로부터 데이터를 판독/기록할 수 있다. 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 연결되도록 구성되어 있다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치가 해당 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 액세스할 수 있다. 여기서, 각각의 네트워크 온칩 처리 모듈 내의 저장 장치의 개수는 2개에 제한되지 않으며, 3개, 4개 또는 복수 개일 수 있다. 여기서 이에 대해 특별히 한정하지 않으나 4개가 바람직하다. In one embodiment, the network on-chip processing system 19100 shown in FIG. 10B stores all data to be processed by each network-on-chip processing module in a storage device inside the network on-chip processing module. In other words, the plurality of computing devices in each network on-chip processing module can access only the storage device inside the network on-chip processing module to which it belongs, and can only read/write data from the storage device inside the network on-chip processing module to which it belongs. . At least one computing device within each network-on-chip processing module is configured to be coupled to all storage devices within the network-on-chip processing module. In other words, a computing device within each network-on-chip processing module can access all storage devices within that network-on-chip processing module. Here, the number of storage devices in each network on-chip processing module is not limited to two, and may be three, four, or a plurality. Although there is no particular limitation on this here, four are preferable.

구체적으로, 각각의 네트워크 온칩 처리 모듈에서, 각각의 계산 장치는 제1 통신 거리에 있는 저장 장치에 연결된다. 여기서 제1 통신 거리는 가장 짧은 통신 거리를 가리킨다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치는 인접되어 있는 저장 장치에만 액세스할 수 있다. 즉 각각의 네트워크 온칩 처리 모듈 내의 계산 장치는 그와 통신 거리가 가장 짧은 저장 장치에만 액세스할 수 있다. 예를 들어, 계산 장치(19120)는 인접되어 있는 저장 장치(19110)에만 액세스할 수 있으며, 저장 장치(19160)에 액세스할 수 없다. 계산 장치(19130)는 인접되어 있는 저장 장치(19160)에만 액세스할 수 있으며 저장 장치(19110)에 액세스할 수 없다. 계산 장치(19120)가 판독하고자 하는 데이터가 저장 장치(19160)에 저장되어 있는 경우, 우선 계산 장치(19130)를 통해 저장 장치(19160)로부터 상기 데이터를 판독한 후 상기 데이터를 계산 장치(19120)에 전송한다.Specifically, in each network on-chip processing module, each computing device is connected to a storage device at a first communication distance. Here, the first communication distance refers to the shortest communication distance. In other words, computing devices within each network-on-chip processing module can only access storage devices to which they are contiguous. That is, the computing device in each network on-chip processing module can access only the storage device with the shortest communication distance therefrom. For example, computing device 19120 can only access storage device 19110 to which it is contiguous, and cannot access storage device 19160. Computing device 19130 can only access storage device 19160 to which it is contiguous and cannot access storage device 19110. When data that the computing device 19120 wants to read is stored in the storage device 19160, the data is first read from the storage device 19160 through the computing device 19130, and then the data is stored in the computing device 19120. send to

대안적으로, 각각의 네트워크 온칩 처리 모듈에서 상기 복수의 계산 장치는 서로 연결되어 계산 장치 네트워크를 구성하고, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 여기서 중복 설명하지 않는다. 대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 모두 저장 장치(19110)에 연결될 필요없으며, 각각의 네트워크 온칩 처리 모듈 중 적어도 하나 이상의 계산 장치가 저장 장치(19110)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, in each network on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and the connection method among the plurality of computing devices in each network on-chip processing module is the network on-chip processing system 1100 or a connection method of the network on-chip processing system 1400 may be referred to. It is not repeated here. Alternatively, the plurality of computing devices in each network on-chip processing module need not all be connected to the storage device 19110, as long as at least one computing device in each network on-chip processing module is connected to the storage device 19110. There is no particular limitation on this here.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 각 계산 장치는 다른 한 네트워크 온칩 처리 모듈에 연결되도록 구성할 수 있으며, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치가 다른 한 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 네트워크 온칩 처리 모듈은 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치를 통해 서로 연결된다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치는 다른 한 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, each computing device within each network-on-chip processing module may be configured to be coupled to another network-on-chip processing module, wherein at least one computing device within each network-on-chip processing module may be configured to connect to at least one other network-on-chip processing module. It may be connected to one or more computing devices. There is no particular limitation on this here. Alternatively, the plurality of network-on-chip processing modules are interconnected through any one computing device within each network-on-chip processing module. In other words, any one computing device in each network-on-chip processing module may be coupled to any one computing device in another network-on-chip processing module. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 각각의 계산 장치가 소속된 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 액세스하여 데이터를 전송하도록 여러 갈래의 통신 채널을 제공할 수 있음으로써 데이터의 판독/기록 효율을 향상시킬 수 있다. 상기 시스템은 각 계산 장치가 인접되어 있는 저장 장치에만 액세스할 수 있어 메모리 액세스에 따른 오버헤드를 최대로 줄일 수 있다.In the network on-chip processing system, data read/write efficiency can be improved by providing multiple communication channels so that each computing device accesses all storage devices in the network on-chip processing module to which it belongs and transmits data. there is. In the system, each computing device can access only an adjacent storage device, thereby maximally reducing overhead due to memory access.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은, 직접 연결된 임의의 두 네트워크 온칩 처리 모듈을 포함하고, 임의의 두 네트워크 처리 모듈이 동일한 칩 위에 설치되고, 각각의 네트워크 온칩 처리 모듈은 적어도 하나 이상의 저장 장치 및 복수의 계산 장치를 포함하고, 각각의 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치가 상기 네트워크 처리 모듈 내부의 적어도 하나 이상의 저장 장치에 연결되며, 상기 복수의 계산 장치 중 적어도 두개 이상의 계산 장치가 서로 연결된다.In one embodiment, the present application provides a network on-chip processing system. The system includes any two directly connected network on-chip processing modules, wherein any two network processing modules are installed on the same chip, each network on-chip processing module includes at least one storage device and a plurality of computing devices; In each network-on-chip processing module, at least one computing device is connected to at least one storage device inside the network processing module, and at least two of the plurality of computing devices are connected to each other.

도 11에 도시된 바와 같이, 도 11은 일 실시예에 따른 네트워크 온칩 처리 시스템(1920)이다. 네트워크 온칩 처리 시스템(1920)은 서로 연결된 4개 네트워크 온칩 처리 모듈을 포함하고, 상기 4개 네트워크 온칩 처리 모듈은 네트워크 온칩 처리 시스템(1920)의 동일한 칩 위에 설치되고, 상기 4개 네트워크 온칩 처리 모듈 중 임의의 두 네트워크 온칩 처리 모듈은 상호간에 직접적으로 연결되며, 각각의 네트워크 온칩 처리 모듈은 하나의 저장 장치(1921) 및 4개 계산 장치(계산 장치(1922) 내지 계산 장치(1925))를 포함한다. 여기서, 각각의 네트워크 온칩 처리 모듈에서 계산 장치(1922)는 그 네트워크 온칩 처리 모듈 내부의 저장 장치(1921)에 연결되며, 각각의 네트워크 온칩 처리 모듈 내부의 4개 계산 장치는 서로 연결되어 있다.As shown in FIG. 11 , FIG. 11 is a network on-chip processing system 1920 according to one embodiment. The network on-chip processing system 1920 includes four network on-chip processing modules connected to each other, the four network on-chip processing modules are installed on the same chip of the network on-chip processing system 1920, and among the four network on-chip processing modules Any two network on-chip processing modules are directly connected to each other, and each network on-chip processing module includes one storage device 1921 and four computing devices (calculation device 1922 to 1925). . Here, in each network on-chip processing module, the computing device 1922 is connected to the storage device 1921 inside the network on-chip processing module, and the four computing devices inside each network on-chip processing module are connected to each other.

구체적으로, 각각의 네트워크 온칩 처리 모듈이 처리해야 하는 데이터는 모두 상기 네트워크 온칩 처리 모듈 내부의 저장 장치에 저장되어 있다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에만 액세스할 수 있고 오직 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치로부터 데이터를 판독/기록할 수 있다.Specifically, all data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module. In other words, the plurality of computing devices in each network on-chip processing module can only access the storage device inside the network on-chip processing module to which it belongs and can only read/write data from the storage device inside the network on-chip processing module to which it belongs. .

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 저장 장치의 개수는 하나에 제한되지 않으며, 2개, 3개 또는 복수 개일 수 있다. 여기서 이에 대해 특별히 한정하지 않으나 4개가 바람직하다. 대안적으로, 각각의 네트워크 온칩 처리 모듈에서 상기 복수의 계산 장치는 서로 연결되어 계산 장치 네트워크를 구성하고, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 여기서 중복 설명하지 않는다. 대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 모두 저장 장치(1921)에 연결될 필요없으며, 각각의 네트워크 온칩 처리 모듈 중 적어도 하나 이상의 계산 장치가 저장 장치(1921)에 연결되기만 하면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, the number of storage devices in each network on-chip processing module is not limited to one, and may be two, three, or plural. Although there is no particular limitation on this here, four are preferable. Alternatively, in each network on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and the connection method among the plurality of computing devices in each network on-chip processing module is the network on-chip processing system 1100 or a connection method of the network on-chip processing system 1400 may be referred to. It is not repeated here. Alternatively, the plurality of computing devices in each network-on-chip processing module need not all be coupled to the storage device 1921, as long as at least one computing device in each network-on-chip processing module is coupled to the storage device 1921. . There is no particular limitation on this here.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 각각의 계산 장치는 다른 한 네트워크 온칩 처리 모듈에 연결되도록 구성할 수 있으며, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치가 다른 한 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 네트워크 온칩 처리 모듈은 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치를 통해 서로 연결된다. 즉, 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치는 다른 한 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, each computing device in each network-on-chip processing module may be configured to be coupled to another network-on-chip processing module, wherein at least one computing device in each network-on-chip processing module is connected to another network-on-chip processing module. It may be connected to at least one computing device. There is no particular limitation on this here. Alternatively, the plurality of network-on-chip processing modules are interconnected through any one computing device within each network-on-chip processing module. That is, any one computing device in each network-on-chip processing module can be coupled to any one computing device in another network-on-chip processing module. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서, 동일한 칩 위에 설치된 복수의 네트워크 온칩 처리 모듈들이 상호간에 연결되도록 구성하는 동시에, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치들이 상호간에 연결되도록 구성함으로써 복수의 계산 장치 사이에 모듈 내 통신이 가능한 동시에 임의의 두 네트워크 온칩 처리 모듈 사이에 모듈간 직접적인 통신이 가능하게 되어 상기 시스템은 저장 장치 인터페이스를 동시에 판독하는 계산 장치를 줄이고 대역폭의 막힘을 감소시킬 수 있으며 모듈 간의 데이터 전송에 의해 데이터의 판독/기록 효율을 향상시킬 수 있다.In the network-on-chip processing system, a plurality of network-on-chip processing modules installed on the same chip are configured to be connected to each other, and at the same time, a plurality of computing devices in each network-on-chip processing module are configured to be connected to each other, so that between the plurality of computing devices At the same time as intra-module communication is possible, direct inter-module communication is possible between any two network on-chip processing modules, so that the system can reduce the number of computing devices that simultaneously read the storage device interface, reduce bandwidth clogging, and reduce data transmission between modules. Thus, data read/write efficiency can be improved.

일 실시예에서, 본 출원은 네트워크 온칩 처리 시스템을 제공한다. 상기 시스템은, 직접적으로 연결된 임의의 두 네트워크 온칩 처리 모듈을 포함하고, 임의의 두 네트워크 처리 모듈이 동일한 칩 위에 설치되고, 각각의 네트워크 온칩 처리 모듈은 복수의 저장 장치를 포함하고, 상기 네트워크 온칩 처리 모듈에서 적어도 하나 이상의 계산 장치가 상기 네트워크 처리 모듈 내부의 상기 복수의 저장 장치에 연결되며, 상기 복수의 계산 장치 중 적어도 두개 이상의 계산 장치가 서로 연결되어 있다.In one embodiment, the present application provides a network on-chip processing system. The system includes any two network on-chip processing modules directly connected, any two network processing modules are installed on the same chip, each network on-chip processing module includes a plurality of storage devices, and the network on-chip processing modules are installed on the same chip. At least one computing device in the module is connected to the plurality of storage devices inside the network processing module, and at least two or more computing devices among the plurality of computing devices are connected to each other.

도 12에 도시된 바와 같이, 도 12는 일 실시예에 따른 네트워크 온칩 처리 시스템(1930)이다. 네트워크 온칩 처리 시스템(1930)은 서로 연결된 4개 네트워크 온칩 처리 모듈을 포함하고, 상기 4개 네트워크 온칩 처리 모듈은 네트워크 온칩 처리 시스템(1920)의 동일한 칩 위에 설치되며, 상기 4개 네트워크 온칩 처리 모듈 내의 임의의 두 네트워크 온칩 처리 모듈은 상호간에 직접적으로 연결되고, 각각의 네트워크 온칩 처리 모듈은 저장 장치(1931), 저장 장치(1936) 및 4개 계산 장치(계산 장치(1932) 내지 계산 장치(1935))를 포함한다. 각각의 네트워크 온칩 처리 모듈에서 계산 장치(1932)는 해당 네트워크 온칩 처리 모듈 내부의 저장 장치(1931) 및 저장 장치(1936)에 연결되고, 각각의 네트워크 온칩 처리 모듈 내부의 4개 계산 장치는 상호간에 연결되어 있다.As shown in FIG. 12 , FIG. 12 is a network on-chip processing system 1930 according to one embodiment. The network-on-chip processing system 1930 includes four network-on-chip processing modules connected to each other, the four network-on-chip processing modules are installed on the same chip of the network-on-chip processing system 1920, and the four network-on-chip processing modules are Any two network on-chip processing modules are directly connected to each other, and each network on-chip processing module includes a storage device 1931, a storage device 1936 and four computing devices (calculating device 1932 to 1935). ). In each network-on-chip processing module, the computing device 1932 is connected to the storage device 1931 and the storage device 1936 inside the corresponding network-on-chip processing module, and the four computing devices inside each network-on-chip processing module communicate with each other. It is connected.

구체적으로, 각각의 네트워크 온칩 처리 모듈이 처리해야 하는 데이터는 모두 상기 네트워크 온칩 처리 모듈 내부의 저장 장치에 저장되어 있다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치에만 액세스할 수 있어 오직 소속된 네트워크 온칩 처리 모듈 내부의 저장 장치로부터 데이터를 판독/기록할 수 있다. 각각의 네트워크 온칩 처리 모듈 내의 계산 장치는 인접되어 있는 저장 장치를 우선적으로 액세스한다.Specifically, all data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module. In other words, the plurality of computing devices in each network on-chip processing module can access only the storage device inside the network on-chip processing module to which it belongs, and can only read/write data from the storage device inside the network on-chip processing module to which it belongs. . Computing devices within each network-on-chip processing module preferentially access adjacent storage devices.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 저장 장치의 개수는 두개에 제한되지 않으며, 3개, 4개 또는 복수 개일 수 있다. 여기서 이에 대해 특별히 한정하지 않으나 4개가 바람직하다. 구체적으로, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치는 상기 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 연결되도록 구성된다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치는 상기 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 액세스할 수 있다.Alternatively, the number of storage devices in each network on-chip processing module is not limited to two, and may be three, four, or a plurality. Although there is no particular limitation on this here, four are preferable. Specifically, at least one or more computing devices in each network-on-chip processing module are configured to be connected to all storage devices in the network-on-chip processing module. In other words, a computing device within each network-on-chip processing module can access all storage devices within the network-on-chip processing module.

대안적으로, 각각의 네트워크 온칩 처리 모듈에서 상기 복수의 계산 장치는 상호간에 연결되어 계산 장치 네트워크를 구성하고, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치 사이의 연결 방식은 네트워크 온칩 처리 시스템(1100) 내지 네트워크 온칩 처리 시스템(1400)의 연결 방식을 참고할 수 있다. 여기서 중복 설명하지 않는다. 대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 복수의 계산 장치에서 모든 계산 장치가 저장 장치(1931)에 연결될 필요없으며, 각각의 네트워크 온칩 처리 모듈 중 적어도 하나 이상의 계산 장치가 저장 장치(1931)에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, in each network on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and the connection method between the plurality of computing devices in each network on-chip processing module is the network on-chip processing system 1100 ) to a connection method of the network on-chip processing system 1400 may be referred to. It is not repeated here. Alternatively, not all of the computing devices in the plurality of computing devices in each network-on-chip processing module need be coupled to storage device 1931, and at least one computing device in each network-on-chip processing module is coupled to storage device 1931. It can be done. There is no particular limitation on this here.

대안적으로, 각각의 네트워크 온칩 처리 모듈 내의 계산 장치마다 다른 한 네트워크 온칩 처리 모듈에 연결될 수 있으며, 각각의 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치가 다른 한 네트워크 온칩 처리 모듈 내의 적어도 하나 이상의 계산 장치에 연결되면 된다. 여기서 이에 대해 특별히 한정하지 않는다. 대안적으로, 상기 복수의 네트워크 온칩 처리 모듈들은 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치를 통해 서로 연결된다. 다시 말하면, 각각의 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치는 다른 네트워크 온칩 처리 모듈 내의 임의의 한 계산 장치에 연결될 수 있다. 여기서 이에 대해 특별히 한정하지 않는다.Alternatively, each computing device in each network-on-chip processing module may be coupled to one other network-on-chip processing module, and at least one computing device in each network-on-chip processing module may be coupled to at least one or more computing devices in another network-on-chip processing module. to be connected to There is no particular limitation on this here. Alternatively, the plurality of network on-chip processing modules are connected to each other through any one computing device within each network-on-chip processing module. In other words, any one computing device in each network-on-chip processing module may be coupled to any one computing device in another network-on-chip processing module. There is no particular limitation on this here.

상기 네트워크 온칩 처리 시스템에서 각각의 계산 장치는 소속된 네트워크 온칩 처리 모듈 내의 모든 저장 장치에 액세스할 수 있는 동시에 임의의 두 네트워크 온칩 처리 모듈은 모듈간의 직접적인 통신이 가능하게 된다. 그러므로 상기 시스템은 데이터를 전송하도록 여러 갈래의 통신 채널을 제공할 수 있음으로써 데이터의 판독/기록 효율을 향상시킬 수 있다. 상기 시스템 중 각각의 계산 장치는 인접되어 있는 저장 장치를 우선적으로 액세스하므로 메모리 엑스에 따른 오버헤드를 절약할 수 있는 동시에 일정한 유연성을 보장할 수 있다.In the network on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module to which it belongs, and at the same time, direct communication between any two network-on-chip processing modules is possible. Therefore, the system can provide multiple communication channels to transmit data, thereby improving data read/write efficiency. Since each computing device in the system preferentially accesses an adjacent storage device, it is possible to save overhead according to memory X and at the same time guarantee certain flexibility.

일 실시예에서, 도 13에 도시된 바와 같이, 도 13에 도시된 네트워크 온칩 처리 시스템의 계산 장치는 기계 학습 계산을 수행하도록 구성되고, 상기 계산 장치는 컨트롤러 유닛(11) 및 연산 유닛(12)을 포함하고, 컨트롤러 유닛(11)과 연산 유닛(12)이 연결되어 있고, 상기 연산 유닛(11)은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다.In one embodiment, as shown in FIG. 13 , a computing device of the network on-chip processing system shown in FIG. 13 is configured to perform machine learning calculations, the computing device comprising a controller unit 11 and a computing unit 12 , wherein the controller unit 11 and the arithmetic unit 12 are connected, and the arithmetic unit 11 includes one master processing circuit and a plurality of slave processing circuits.

컨트롤러 유닛(11)은 입력 데이터 및 계산 명령을 획득하는데 사용된다. 하나의 선택 가능한 방안에서, 입력 데이터 및 계산 명령의 획득은 데이터 입출력 유닛을 통해 구체적으로 구현할 수 있으며 상기 데이터 입출력 유닛은 구체적으로 하나 또는 복수의 데이터 I/O 인터페이스 또는 I/O 핀일 수 있다.The controller unit 11 is used to acquire input data and calculation instructions. In one possible option, the acquisition of input data and calculation instructions may be specifically implemented through a data input/output unit, and the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.

상기 계산 명령은 정방향 연산 명령 또는 역방향 훈련 명령; 또는 컨볼루션 연산 명령과 같은 다른 신경망 연산 명령 등을 포함하나 이에 한정되지 않는다. 본 출원의 구체적인 실시형태는 상기 계산 명령의 구체적인 표현형식을 제한하지 않는다.The computation instruction may be a forward computation instruction or a backward training instruction; or other neural network operation instructions such as convolution operation instructions, but is not limited thereto. Specific embodiments of the present application do not limit specific expression forms of the calculation instruction.

구체적으로, 컨트롤러 유닛(11)은 추가로 상기 계산 명령을 분석하여 복수의 연산 명령을 얻도록 구성되며, 상기 복수의 연산 명령 및 상기 입력 데이터를 상기 마스트 처리회로로 송신한다.Specifically, the controller unit 11 is further configured to analyze the calculation instructions to obtain a plurality of calculation instructions, and transmits the plurality of calculation instructions and the input data to the master processing circuit.

상기 연산 유닛(12)은 하나의 마스트 처리회로(101) 및 복수의 슬레이브 처리회로(102)를 포함한다. 여기서, 마스트 처리회로(101)는 상기 입력 데이터에 대한 프리오더 처리를 수행하고, 상기 복수의 슬레이브 처리회로와 데이터 및 연산 명령을 전송하며;The arithmetic unit 12 includes one master processing circuit 101 and a plurality of slave processing circuits 102. Here, the master processing circuit 101 performs preorder processing on the input data, and transmits data and operation commands to the plurality of slave processing circuits;

상기 복수의 슬레이브 처리회로는 상기 마스트 처리회로에서 전송된 데이터 및 연산 명령에 따라 중간 연산을 병행 수행하여 복수의 중간 결과를 얻은 후 복수의 중간 결과를 상기 마스트 처리회로로 전송하며;the plurality of slave processing circuits perform intermediate operations in parallel according to data and operation commands transmitted from the master processing circuit to obtain a plurality of intermediate results, and then transmit the plurality of intermediate results to the master processing circuit;

마스트 처리회로(101)는 상기 복수의 중간 결과에 대해 후속 처리를 진행하여 상기 계산 명령의 계산 결과를 얻는다.The master processing circuit 101 performs subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

본 출원에 제공된 기술적 방안은 연산 유닛을 one-master-multiple-slave 구조로 설치하고 정방향 연산하는 계산 명령에 따라 데이터를 분할한다. 이에 의해 복수의 슬레이브 처리회로를 통해 계산 량이 큰 부분에 대해 병행 연산을 수행함으로써, 연산 속도를 향상시키고 연산 시간을 절약하여 전력 소모를 낮출 수 있다.The technical solution provided in the present application installs a calculation unit in a one-master-multiple-slave structure and divides data according to a calculation instruction for performing forward calculation. Accordingly, by performing parallel calculation on a large calculation amount through a plurality of slave processing circuits, it is possible to improve calculation speed, save calculation time, and reduce power consumption.

대안적으로, 상기 계산 장치는 해당 저장 유닛(10) 및 직접 메모리 액세스 유닛(50)을 더 포함하고, 저장 유닛(10)은 레지스터, 캐시 중 하나 또는 임의의 조합을 포함할 수 있다. 구체적으로, 상기 캐시는 상기 계산 명령을 저장하고, 상기 레지스터는 상기 입력 데이터 및 스칼라를 저장하고 상기 캐시는 고속 임시 저장 캐시이다. 직접 메모리 액세스 유닛(50)은 저장 유닛(10)으로부터 데이터를 판독하거나 저장한다.Alternatively, the computing device further includes a corresponding storage unit 10 and a direct memory access unit 50, and the storage unit 10 may include one or any combination of a register and a cache. Specifically, the cache stores the calculation instructions, the registers store the input data and scalars, and the cache is a high-speed temporary storage cache. The direct memory access unit 50 reads or stores data from the storage unit 10 .

대안적으로, 상기 컨트롤러 유닛은 명령 저장 유닛(110), 명령 처리 유닛(111) 및 저장 큐 유닛(113)을 포함한다. Alternatively, the controller unit includes an instruction storage unit 110 , an instruction processing unit 111 and a storage queue unit 113 .

명령 저장 유닛(110)은 상기 인공 신경망과 관련있는 계산 명령을 저장하고,The instruction storage unit 110 stores calculation instructions related to the artificial neural network;

상기 명령 처리 유닛(111)은 상기 계산 명령을 분석하여 복수의 연산 명령을 얻으며,the command processing unit 111 analyzes the calculation command to obtain a plurality of calculation commands;

저장 큐 유닛(113)은 명령 큐를 저장하고, 상기 명령 큐는 상기 큐의 전후 순서에 따라 수행하고자 하는 복수의 연산 명령 및/또는 계산 명령을 포함한다.The storage queue unit 113 stores a command queue, and the command queue includes a plurality of calculation commands and/or calculation commands to be executed according to the order before and after the queue.

예를 들어 설명하면, 하나의 선택 가능한 기술적 방안에서, 마스트 연산 처리회로는 하나의 컨트롤러 유닛을 더 포함할 수 있다. 상기 컨트롤러 유닛은 마스터 명령 처리 유닛을 포함할 수 있으며 구체적으로 명령을 마이크로 명령으로 디코딩하는데 사용된다. 다른 하나의 선택 가능한 기술적 방안에서, 슬레이브 연산 처리회로는 다른 하나의 컨트롤러 유닛을 더 포함할 수 있으며, 상기 다른 하나의 컨트롤러 유닛은 슬레이브 명령 처리 유닛을 포함하고, 구체적으로 마이크로 명령을 수신하고 처리하는데 사용된다. 상기 마이크로 명령은 명령의 다음 레벨 명령일 수 있다. 상기 마이크로 명령은 명령을 분할 또는 디코딩하여 얻을 수 있으며 각 부재, 각 유닛 또는 각 처리회로의 제어 신호로 추가 디코딩될 수 있다.For example, in one possible technical scheme, the master arithmetic processing circuit may further include one controller unit. The controller unit may include a master command processing unit and is specifically used to decode commands into micro-instructions. In another selectable technical scheme, the slave operation processing circuit may further include another controller unit, and the other controller unit includes a slave command processing unit, specifically for receiving and processing micro-commands. used The micro-instruction may be a next-level command of the command. The micro-instruction can be obtained by dividing or decoding the command, and can be additionally decoded as a control signal of each member, each unit or each processing circuit.

하나의 선택 가능한 방안에서, 상기 계산 명령의 구조는 표 1에 표시된 바와 같다.In one possible option, the structure of the calculation instruction is shown in Table 1.

표 1Table 1

위 표에서 생략 부호는 복수의 레지스터 또는 즉치값을 포함할 수 있음을 의미한다.In the table above, the ellipsis signifies that multiple registers or immediate values can be included.

다른 하나의 선택 가능한 방안에서, 상기 계산 명령은 하나 또는 복수의 동작 도메인 및 하나의 동작 코드를 포함할 수 있다. 상기 계산 명령은 신경망 연산 명령을 포함할 수 있다. 신경망 연산 명령을 예로 들면, 표 2에 표시된 바와 같이, 레지스터 번호0, 레지스터 번호1, 레지스터 번호2, 레지스터 번호3, 레지스터 번호4는 동작 도메인일 수 있다. 여기서, 각각의 레지스터 번호0, 레지스터 번호1, 레지스터 번호2, 레지스터 번호3, 레지스터 번호4는 하나 또는 복수의 레지스터의 번호일 수 있다.In another alternative option, the calculation instruction may include one or more operation domains and one operation code. The calculation command may include a neural network calculation command. Taking the neural network operation instructions as an example, as shown in Table 2, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Here, each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be one or more register numbers.

표 2Table 2

상기 레지스터는 오프칩 메모리일 수 있으며, 물론 실제 응용에서 온칩 메모리일 수도 있는데 데이터를 저장하는데 사용되며, 상기 데이터는 구체적으로 n 차원 데이터일 수 있으며, n은 1보다 크거나 같은 정수이다. 예를 들어, n=1일 때, 1차원 데이터이며, 즉 벡터이고; n=2일 때, 2차원 데이터이며, 즉 행렬이고; n=3 또는 3 이상일 때, 다차원 텐서(multidimensional tensor)이다.The register may be an off-chip memory or, of course, an on-chip memory in practical application, used to store data, and the data may be specifically n-dimensional data, where n is an integer greater than or equal to 1. For example, when n=1, it is one-dimensional data, that is, a vector; When n=2, it is two-dimensional data, that is, a matrix; When n=3 or greater than 3, it is a multidimensional tensor.

대안적으로, 상기 컨트롤러 유닛은, Alternatively, the controller unit,

복수의 연산 명령이 있을 때, 제1 연산 명령과 상기 제1 연산 명령 이전의 제0 연산 명령이 연관 관계가 있는지 여부를 확정하며, 상기 제1 연산 명령과 상기 제0 연산 명령이 연관 관계가 있으면 상기 제1 연산 명령을 상기 명령 저장 유닛 내에 캐싱하고, 상기 제0 연산 명령을 수행한 후 상기 명령 저장 유닛으로부터 상기 제1 연산 명령을 추출하여 상기 연산 유닛으로 전송하는 의존 관계 처리 유닛(112)을 포함하고,When there are a plurality of operation instructions, it is determined whether a first operation instruction and a 0th operation instruction preceding the first operation instruction have an association relationship, and if the first operation instruction and the 0th operation instruction have a relationship relationship a dependency processing unit 112 which caches the first operation instruction in the instruction storage unit, and extracts the first operation instruction from the instruction storage unit after executing the zeroth operation instruction and transmits it to the operation unit; include,

상기 제1 연산 명령과 제1 연산 명령 이전의 제0 연산 명령이 연관 관계가 있는지 여부를 확정하는 단계는,Determining whether or not the first operation command and the 0th operation command preceding the first operation command have an association relationship,

상기 제1 연산 명령에 의해 상기 제1 연산 명령에 필요한 데이터(예를 들어 행렬)의 제1 저장 주소 구간을 추출하고, 상기 제0 연산 명령에 의해 상기 제0 연산 명령에 필요한 행렬의 제0 저장 주소 구간을 추출하되, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하면, 상기 제1 연산 명령과 상기 제0 연산 명령이 연관 관계가 있다고 판단하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하지 않으면, 상기 제1 연산 명령과 상기 제0 연산 명령이 연관 관계가 없다고 확정하는 단계를 포함한다.A first storage address range of data (for example, a matrix) necessary for the first operation command is extracted by the first operation command, and a 0th storage of the matrix required for the 0 th operation command is extracted by the 0 th operation command. The address interval is extracted, but if there is an overlapping region between the first storage address interval and the 0th storage address interval, it is determined that the first operation command and the 0th operation command have a relation, and the first storage address interval is extracted. and determining that the first operation command and the zeroth operation command do not have a correlation if there is no overlapping area between the address interval and the zeroth storage address interval.

다른 일 실시예에서, 도 14에 도시된 바와 같이, 연산 유닛(12)은 하나의 마스트 처리회로(101) 및 복수의 슬레이브 처리회로(102)를 포함할 수 있다. 일 실시예에서, 도 14에 도시된 바와 같이, 복수의 슬레이브 처리회로는 어레이 분포를 이루고, 각각의 슬레이브 처리회로는 인접되어 있는 다른 슬레이브 처리회로에 연결되고, 마스트 처리회로는 상기 복수의 슬레이브 처리회로 중 K개 슬레이브 처리회로에 연결되고, 상기 K개 슬레이브 처리회로는 제1 행의 n개 슬레이브 처리회로, 제m 행의 n개 슬레이브 처리회로 및 제1 열의 m개 슬레이브 처리회로이다. 유의할 것은, 도 14에 도시된 바와 같이, K개 슬레이브 처리회로는 단지 제1행의 n개 슬레이브 처리회로, 제m 행의 n개 슬레이브 처리회로 및 제1 열의 m개 슬레이브 처리회로를 포함한다. 즉 상기 K개 슬레이브 처리회로는 복수의 슬레이브 처리회로 중 마스트 처리회로에 직접적으로 연결된 슬레이브 처리회로이다.In another embodiment, as shown in FIG. 14 , the arithmetic unit 12 may include one master processing circuit 101 and a plurality of slave processing circuits 102 . In one embodiment, as shown in FIG. 14, a plurality of slave processing circuits form an array distribution, each slave processing circuit is connected to another adjacent slave processing circuit, and a master processing circuit is the plurality of slave processing circuits. It is connected to K slave processing circuits in the circuit, wherein the K slave processing circuits are n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column. It should be noted that, as shown in Fig. 14, the K slave processing circuits only include n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column. That is, the K slave processing circuits are slave processing circuits directly connected to a master processing circuit among a plurality of slave processing circuits.

K개 슬레이브 처리회로는 상기 마스트 처리회로와 복수의 슬레이브 처리회로 사이에서 데이터 및 명령을 포워딩한다.K slave processing circuits forward data and commands between the master processing circuit and a plurality of slave processing circuits.

대안적으로, 도 15에 도시된 바와 같이, 상기 마스트 처리회로는 변환 처리회로(110), 활성화 처리회로(111), 덧셈 처리회로(112) 중 하나 또는 임의의 조합을 포함하고,Alternatively, as shown in FIG. 15, the master processing circuit includes one or any combination of conversion processing circuitry 110, activation processing circuitry 111, and addition processing circuitry 112;

변환 처리회로(110)는 마스트 처리회로가 수신한 데이터 블록 또는 중간 결과에 대해 제1 데이터구조와 제2 데이터구조 사이의 호환(예를 들어, 연속 데이터와 이산 데이터의 변환)을 진행하거나, 혹은 마스트 처리회로가 수신한 데이터 블록 또는 중간 결과에 대해 제1 데이터 유형과 제2 데이터 유형 사이의 호환(예를 들어, 고정-소수점 유형과 부동-소수점 유형의 변환)을 수행하며,The conversion processing circuit 110 performs compatibility between the first data structure and the second data structure (for example, conversion of continuous data and discrete data) on the data block or intermediate result received by the master processing circuit, or The master processing circuit performs compatibility between the first data type and the second data type (eg, conversion between a fixed-point type and a floating-point type) on the received data block or intermediate result;

활성화 처리회로(111)는 마스트 처리회로 내의 데이터의 활성화 연산을 수행하며,The activation processing circuit 111 performs an activation operation of data in the master processing circuit,

덧셈 처리회로(112)는 덧셈 연산 또는 누적 연산을 수행한다.The addition processing circuit 112 performs an addition operation or an accumulation operation.

상기 마스트 처리회로는 상기 입력 뉴런을 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 배부 데이터를 복수의 데이터 블록으로 배분하고, 상기 복수의 데이터 블록 중 적어도 하나 이상의 데이터 블록 및 복수의 연산 명령 중의 적어도 하나 이상의 연산 명령을 상기 슬레이브 처리회로로 송신하고,The master processing circuit determines the input neuron as broadcast data and the weight as distribution data, distributes the distribution data into a plurality of data blocks, and selects at least one of the plurality of data blocks and a plurality of operation instructions. Sending at least one arithmetic command to the slave processing circuit;

상기 복수의 슬레이브 처리회로는 상기 연산 명령에 의해 이미 수신된 데이터 블록에 대해 연산하여 중간 결과를 얻으며, 연산 결과를 상기 마스트 처리회로에 전송하고,The plurality of slave processing circuits obtain an intermediate result by performing an operation on a data block already received by the operation command, and transmits the operation result to the master processing circuit;

상기 마스트 처리회로는 복수의 슬레이브 처리회로가 송신한 중간 결과를 처리하여 상기 계산 명령의 결과를 얻으며, 상기 계산 명령의 결과를 상기 컨트롤러 유닛으로 송신한다.The master processing circuit processes intermediate results transmitted by a plurality of slave processing circuits to obtain a result of the calculation command, and transmits the result of the calculation command to the controller unit.

상기 슬레이브 처리회로는 곱셈 처리회로를 포함하고，The slave processing circuit includes a multiplication processing circuit,

상기 곱셈 처리회로는 수신된 데이터 블록에 대해 곱셈 연산하여 곱셈 결과를 얻으며,The multiplication processing circuit obtains a multiplication result by performing a multiplication operation on the received data block;

대안적으로, 슬레이브 처리회로는 포워딩 처리회로를 더 포함하고, 상기 포워딩 처리회로는 수신된 데이터 블록 또는 곱셈 결과를 포워딩한다.Alternatively, the slave processing circuit further includes a forwarding processing circuit, and the forwarding processing circuit forwards the received data block or multiplication result.

대안적으로, 슬레이브 처리회로는 누적 처리회로를 더 포함하며, 상기 누적 처리회로는 상기 곱셈 결과에 대해 누적 연산하여 상기 중간 결과를 얻는다.Alternatively, the slave processing circuit further includes an accumulation processing circuit, wherein the accumulation processing circuit performs an accumulation operation on the multiplication result to obtain the intermediate result.

다른 일 실시예에서, 상기 연산 명령은 행렬에 행렬을 곱한 명령, 누적 명령, 활성화 명령 등 계산 명령이다.In another embodiment, the calculation command is a calculation command such as a matrix multiplication command, an accumulation command, and an activation command.

이하, 신경망 연산 명령을 통해 도 1에 도시된 계산 장치의 구체적인 계산 방법을 설명한다. 신경망 연산 명령에 있어서, 실제 수행해야 하는 공식은: Hereinafter, a detailed calculation method of the calculation device shown in FIG. 1 will be described through neural network calculation commands. For the neural network calculation instructions, the formula that actually needs to be performed is:

이고, 여기서, 가중치

에 입력 데이터

를 곱하고 합을 구한 다음 오프셋 b 를 더하고 활성화 연산

을 진행하여, 마지막 출력결과 s 를 얻는다.

where, the weight

input data into

Multiply and sum, then add offset b and activate operation

and get the final output result s.

하나의 선택 가능한 방안에서, 도 16에 도시된 바와 같이, 상기 연산 유닛은 트리 모듈(40)을 포함하고, 상기 트리 모듈은 하나의 루트 포트(401) 및 복수의 브랜치 포트(404)를 포함하고, 상기 트리 모듈의 루트 포트는 상기 마스트 처리회로에 연결되며, 상기 트리 모듈의 복수의 브랜치 포트는 복수의 슬레이브 처리회로 중 하나의 슬레이브 처리회로에 각각 연결되며,In one possible option, as shown in FIG. 16 , the computing unit includes a tree module 40, the tree module includes a root port 401 and a plurality of branch ports 404, , the root port of the tree module is connected to the master processing circuit, and a plurality of branch ports of the tree module are respectively connected to one slave processing circuit among a plurality of slave processing circuits,

상기 트리 모듈은 송수신 기능이 있으며, 예를 들어 도 16에 도시된 상기 트리 모듈은 송신 기능을 하고, 도 17에 도시된 상기 트리 모듈은 수신 기능을 한다.The tree module has a transmission/reception function. For example, the tree module shown in FIG. 16 performs a transmission function, and the tree module shown in FIG. 17 performs a reception function.

상기 트리 모듈은 상기 마스트 처리회로와 상기 복수의 슬레이브 처리회로 사이에서 데이터 블록, 가중치 및 연산 명령을 포워딩한다.The tree module forwards data blocks, weights, and operation commands between the master processing circuit and the plurality of slave processing circuits.

대안적으로, 상기 트리 모듈은 계산 장치의 선택적 결과로서 적어도 1층의 노드를 포함하며, 상기 노드는 송신 기능이 있는 선형 구조이며, 상기 노드 자체는 계산 기능이 없을 수 있다. 트리 모듈이 0층의 노드를 가지면, 상기 트리 모듈은 필요없게 된다.Alternatively, the tree module may include at least one layer of nodes as an optional result of a computational device, wherein the nodes are linear structures with a transmission function, and the nodes themselves may not have a computational function. If a tree module has a node of layer 0, the tree module is not needed.

대안적으로, 상기 트리 모듈은 n-진수 트리 구조일 수 있는데, 예를 들어, 도 18에 도시된 2-진수 트리 구조이다. 하지만 3-진수 트리 구조도 가능하다. 상기 n은 2보다 크거나 같은 정수이다. 본 출원의 구체적인 실시예는 상기 n의 구체적인 값을 제한하지 않는다. 상기 층수도 2일 수 있으며, 슬레이브 처리회로는 밑에서 제2 층 노드 이외의 다른 층의 노드에 연결될 수 있다. 예를 들어 도 18에 도시된 밑에서 제1 층의 노드에 연결될 수 있다.Alternatively, the tree module may be an n-ary tree structure, for example a binary tree structure shown in FIG. 18 . However, a 3-base tree structure is also possible. The n is an integer greater than or equal to 2. Specific embodiments of the present application do not limit the specific value of n. The number of layers may also be two, and the slave processing circuit may be connected to a node of a layer other than the node of the second layer below. For example, it may be connected to a node of the first layer below shown in FIG. 18 .

대안적으로, 상기 연산 유닛은 별도의 캐시를 가질 수 있는 바, 도 19에 도시된 바와 같이, 뉴런 캐시 유닛을 포함할 수 있다. 상기 뉴런 캐시 유닛(63)은 상기 슬레이브 처리회로의 입력 뉴런 벡터 데이터 및 출력 뉴런 값 데이터를 캐싱한다.Alternatively, the calculation unit may have a separate cache, including a neuron cache unit, as shown in FIG. 19 . The neuron cache unit 63 caches input neuron vector data and output neuron value data of the slave processing circuit.

도 20에 도시된 바와 같이, 상기 연산 유닛은 상기 슬레이브 처리회로가 계산 과정에 필요한 가중치를 캐싱하는 가중치 캐시 유닛(64)을 포함한다.As shown in FIG. 20, the calculation unit includes a weight cache unit 64 for caching weights required for the calculation process of the slave processing circuit.

하나의 선택 가능한 실시예에서, 도 21에 도시된 바와 같이, 연산 유닛(12)은 분기 처리회로(103)를 포함할 수 있으며 그 구체적인 연결구조는 도 21에 도시된 바와 같다.In one selectable embodiment, as shown in FIG. 21 , the arithmetic unit 12 may include a branch processing circuit 103 and a specific connection structure thereof is as shown in FIG. 21 .

여기서, 마스트 처리회로(101)는 분기 처리회로(103)(하나 또는 복수)에 연결되고, 분기 처리회로(103)는 하나 또는 복수의 슬레이브 처리회로(102)에 연결되며;Here, the master processing circuit 101 is connected to one or more branch processing circuits 103, and the branch processing circuit 103 is connected to one or more slave processing circuits 102;

분기 처리회로(103)는 마스트 처리회로(101)와 슬레이브 처리회로(102) 사이에서 데이터 또는 명령을 포워딩한다.The branch processing circuit 103 forwards data or commands between the master processing circuit 101 and the slave processing circuit 102 .

본 출원은, 본 출원에 언급된 하나 또는 복수의 계산 장치를 포함하고 다른 처리 장치로부터 피연산 데이터 및 제어 정보를 획득하여 지정된 기계 학습 연산을 수행하고, I/O 인터페이스를 통해 수행 결과를 주변 장치에 전달하는 신경망 연산 장치를 추가 개시하였다. 주변 장치는 예를 들어 카메라, 표시장치, 마우스, 키보드, 네트워크 카드, wifi 인터페이스 및 서버이다. 적어도 하나 이상의 계산 장치를 포함할 때, 계산 장치는 특정 구조를 통해 연결되어 데이터를 전송한다. 예들 들어, PCIE 버스를 통해 인터커넥트되어 데이터를 전송하므로 더욱 큰 규모의 기계 학습 연산이 가능하다. 이때, 하나의 제어 시스템을 공유하거나 각자 독립적인 제어 시스템을 가질 수 있으며, 메모리를 공유하거나 가속기마다 자체 메모리를 가질 수 있다. 한편, 그 인터커넥트 방식은 임의의 인터커넥트 토폴로지일 수 있다.The present application includes one or a plurality of computing devices mentioned in the present application, obtains operand data and control information from other processing devices, performs a specified machine learning operation, and transmits the result through an I/O interface to a peripheral device. A neural network calculation device that transmits to is additionally disclosed. Peripheral devices are, for example, cameras, displays, mice, keyboards, network cards, wifi interfaces and servers. When including at least one computing device, the computing devices are connected through a specific structure to transmit data. For example, they are interconnected via the PCIE bus to transmit data, enabling larger-scale machine learning operations. At this time, one control system may be shared or each may have an independent control system, and memory may be shared or each accelerator may have its own memory. Meanwhile, the interconnect scheme may be any interconnect topology.

상기 신경망 연산 장치는 높은 호환성을 가지므로 PCIE 인터페이스를 통해 다양한 유형의 서버와 연결할 수 있다.Since the neural network computing device has high compatibility, it can be connected to various types of servers through a PCIE interface.

본 출원은, 상기 신경망 연산 장치, 범용 인터케넥트 인터페이스 및 다른 처리 장치를 포함하는 통합 처리 장치를 추가 개시하였다. 신경망 연산 장치는 다른 처리 장치와 상호 작용하여 공동으로 사용자가 지정한 동작을 수행한다. 도 22는 통합 처리 장치의 개략도이다.The present application further discloses an integrated processing device including the neural network computing device, a universal interconnect interface, and other processing devices. The neural network computing unit interacts with other processing units to jointly perform an operation specified by a user. 22 is a schematic diagram of an integrated processing unit.

다른 처리 장치는 중앙 프로세서 CPU, 그래픽 프로세서 GPU, 신경망 프로세서 등 범용／전용 프로세서 중 1종 또는 1종 이상의 프로세서를 포함한다. 다른 처리 장치에 포함된 프로세서의 수는 제한되지 않는다. 다른 처리 장치는 신경망 연산 장치와 외부 데이터 및 제어 간의 인터페이스로서, 데이터 운반을 포함하고 본 신경망 연산 장치에 대한 개시, 정지 등 기본 제어를 수행한다. 다른 처리 장치는 또한 신경망 연산 장치와 협업하여 연산 임무를 공동으로 완성할 수 있다.The other processing device includes one or more types of general-purpose/dedicated processors such as a central processor CPU, a graphics processor GPU, and a neural network processor. The number of processors included in the other processing unit is not limited. Another processing device is an interface between the neural network computing device and external data and control, including data transfer, and performing basic control such as starting and stopping the neural network computing device. Other processing units may also cooperate with the neural network computing unit to jointly complete computational tasks.

범용 인터케넥트 인터페이스는 상기 신경망 연산 장치와 다른 처리 장치 사이에서 데이터 및 제어 명령을 전송한다. 상기 신경망 연산 장치는 다른 처리 장치로부터 필요한 입력 데이터를 획득하여 신경망 연산 장치의 저장 장치에 기록하고, 다른 처리 장치로부터 제어 명령을 획득하여 신경망 연산 장치의 제어 캐시에 기록할 수 있으며, 신경망 연산 장치의 저장 모듈 내의 데이터를 판독하여 다른 처리 장치에 전송할 수도 있다.A universal interconnect interface transfers data and control commands between the neural network computing unit and other processing units. The neural network arithmetic unit may obtain necessary input data from another processing unit and record them in a storage device of the neural network arithmetic unit, obtain a control command from another processing unit, and record the obtained input data in a control cache of the neural network arithmetic unit. Data in the storage module may be read and transmitted to another processing device.

대안적으로, 도 23에 도시된 바와 같이, 상기 구조는 상기 신경망 연산 장치 및 상기 다른 처리 장치에 각각 연결된 저장 장치를 더 포함할 수 있다. 저장 장치는 상기 신경망 연산 장치와 상기 다른 처리 장치의 데이터를 저장하고, 특히 연산이 필요한 데이터에서 해당 신경망 연산 장치 또는 다른 처리 장치의 내부 저장소에 모두 저장할 수 없는 데이터에 사용된다.Alternatively, as shown in FIG. 23 , the structure may further include storage devices respectively connected to the neural network computing device and the other processing device. The storage device stores data of the neural network calculation unit and the other processing unit, and is used for data that cannot be stored in the internal storage of the neural network calculation unit or other processing units, particularly data requiring calculation.

상기 통합 처리 장치는 모바일폰, 로봇, 드롬, 비디오 감시 장치 등 장치의 SOC온칩 시스템으로 사용될 수 있으며, 제어 부분의 핵심 면적을 효과적으로 줄이고 처리 속도를 높이고 전반적인 전력 소모를 줄일 수 있다. 이러한 상황에서, 상기 통합 처리 장치의 범용 인터케넥트 인터페이스는 장치의 일부 부재에 연결된다. 일부 부재는 예를 들어 카메라, 표시장치, 마우스, 키보드, 네트워크 카드, wifi 인터페이스와 같다.The integrated processing unit can be used as a SOC on-chip system for devices such as mobile phones, robots, video surveillance devices, and the like, and can effectively reduce the core area of the control part, increase the processing speed, and reduce overall power consumption. In this situation, the universal interconnect interface of the integrated processing device is connected to some member of the device. Some members are for example camera, display, mouse, keyboard, network card, wifi interface.

일부 실시예에서, 본 출원은 상기 신경망 연산 장치 또는 통합 처리 장치를 포함한 칩을 더 제공한다.In some embodiments, the present application further provides a chip including the neural network computing device or integrated processing device.

일부 실시예에서, 본 출원은 상기 칩을 포함한 칩 패키지 구조를 더 제공한다.In some embodiments, the present application further provides a chip package structure including the chip.

일부 실시예에서, 본 출원은 상기 칩 패키지 구조를 포함한 보드 카드를 더 제공한다. 도 24를 참조하면, 도 24는 보드 카드를 제공한다. 상기 보드 카드는 상기 칩(389) 외에 다른 보조 부재를 더 포함할 수 있고, 상기 보조 부재는 메모리 장치(390), 인터페이스 장치(391) 및 컨트롤러 장치(392)를 포함하지만, 이에 한정되는 것은 아니다.In some embodiments, the present application further provides a board card including the above chip package structure. Referring to Figure 24, Figure 24 provides a board card. The board card may further include other auxiliary members in addition to the chip 389, and the auxiliary members include, but are not limited to, a memory device 390, an interface device 391, and a controller device 392. .

상기 메모리 장치(390)는 버스를 통해 상기 칩 패키지 구조 내의 칩에 연결되어 데이터를 저장한다. 상기 메모리 장치는 복수 그룹의 메모리 셀(393)을 포함할 수 있다. 각 그룹의 상기 메모리 셀은 버스를 통해 상기 칩에 연결된다. 각 그룹의 상기 메모리 셀이 DDR SDRAM(Double Data Rate SDRAM, 더블 데이터 레이트 동기식 동적 랜덤 메모리)일 수 있음을 이해할 수 있다.The memory device 390 is connected to a chip in the chip package structure through a bus to store data. The memory device may include a plurality of groups of memory cells 393 . The memory cells of each group are connected to the chip through a bus. It can be appreciated that the memory cells of each group may be Double Data Rate SDRAM (Double Data Rate Synchronous Dynamic Random Memory).

DDR는 클럭 주파수를 높이지 않아도 SDRAM의 속도를 배로 높일 수 있다. DDR은 클록 펄스의 상승 및 하강 에지에서 데이터를 판독할 수 있도록 한다. DDR의 속도는 표준 SDRAM의 두배이다. 일 실시예에서, 상기 메모리 장치는 4 그룹의 상기 메모리 셀을 포함할 수 있다. 각 그룹의 상기 메모리 셀은 복수의 DDR4 입자(칩)을 포함할 수 있다. 일 실시예에서, 상기 칩 내부에 4개의 72 비트 DDR4 컨트롤러를 포함할 수 있고, 상기 72 비트 DDR4 컨트롤러 중 64 비트는 데이터 전송에 사용되며, 8 비트는 ECC 검증에 사용된다. 각 그룹의 상기 메모리 셀에 DDR4-3200입자를 사용할 경우, 데이터 전송의 이론적 대역폭은 25600MB/s에 도달할 수 있음을 이해할 수 있다.DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of a clock pulse. The speed of DDR is twice that of standard SDRAM. In one embodiment, the memory device may include four groups of the memory cells. The memory cells of each group may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, of which 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when using DDR4-3200 grains in the memory cells of each group, the theoretical bandwidth of data transmission can reach 25600 MB/s.

일 실시예에서, 각 그룹의 상기 메모리 셀은 병렬로 설치된 복수의 DDR SDRAM를 포함한다. DDR은 하나의 클록 사이클 내에서 데이터를 2회 전송할 수 있다. 상기 칩 내에 DDR를 제어하는 컨트롤러를 설치하여 각각의 상기 메모리 셀의 데이터 전송 및 데이터 저장을 제어한다.In one embodiment, the memory cells of each group include a plurality of DDR SDRAMs installed in parallel. DDR can transmit data twice within one clock cycle. A controller for controlling DDR is installed in the chip to control data transfer and data storage of each memory cell.

상기 인터페이스 장치는 상기 칩 패키지 구조 내의 칩에 전기적으로 연결된다. 상기 인터페이스 장치는 상기 칩과 외부 장치(예를 들면, 서버 또는 컴퓨터) 간의 데이터 전송을 구현한다. 예를 들면, 일 실시예에서 상기 인터페이스 장치는 표준 PCIE 인터페이스일 수 있다. 예를 들어, 피처리 데이터는 서버로부터 표준 PCIE 인터페이스를 통해 상기 칩으로 전달되는데, 이를 통해 데이터 전이를 구현한다. 대안적으로, PCIE 3.0 X 16 인터페이스를 통해 전송할 때, 이론적인 대역폭은 16000MB/s에 도달할 수 있다. 다른 일 실시예에서, 상기 인터페이스 장치는 다른 인터페이스일 수 있다. 본 출원의 인터페이스 유닛은 상기 기타 인터페이스의 구체적인 표현형식에 제한되지 않으며 접속 기능을 구현할 수 있으면 된다. 한편, 상기 칩의 계산 결과 또한 상기 인터페이스 장치를 통해 외부 장치(예를 들어, 서버)에 전송된다.The interface device is electrically connected to a chip in the chip package structure. The interface device implements data transmission between the chip and an external device (eg, a server or a computer). For example, in one embodiment the interface device may be a standard PCIE interface. For example, data to be processed is transmitted from a server to the chip through a standard PCIE interface, through which data transfer is implemented. Alternatively, when transmitting over a PCIE 3.0 X 16 interface, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may be another interface. The interface unit of the present application is not limited to the specific expression form of the above other interfaces, as long as it can implement a connection function. Meanwhile, the calculation result of the chip is also transmitted to an external device (eg, server) through the interface device.

상기 제어 장치는 상기 칩에 전기적으로 연결된다. 상기 제어 장치는 상기 칩의 상태를 모니터링한다. 구체적으로, 상기 칩과 상기 제어 장치는 SPI인터페이스를 통해 전기적으로 연결될 수 있다. 상기 제어 장치는 마이크로 컨트롤러 유닛(Micro Controller Unit, MCU)을 포함할 수 있다. 만약 상기 칩이 복수의 처리 칩, 복수의 처리 코어 또는 복수의 처리회로를 포함하면, 복수의 로드를 구동할 수 있다. 따라서, 상기 칩은 많은 로드 및 적은 로드와 같은 서로 다른 작동 상태에 있을 수 있다. 상기 제어 장치에 의해 상기 칩 내의 복수의 처리 칩, 복수의 처리 코어 또는 복수의 처리회로의 작동 상태를 제어할 수 있다.The control device is electrically connected to the chip. The control device monitors the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a micro controller unit (MCU). If the chip includes multiple processing chips, multiple processing cores or multiple processing circuits, it can drive multiple loads. Thus, the chip may be in different operating states, such as high load and low load. The operating state of a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits in the chip can be controlled by the control device.

일부 실시예에서, 본 출원은 상기 보드 카드를 포함하는 전자장치를 제공한다.In some embodiments, the present application provides an electronic device including the board card.

전자장치는 데이터 처리 장치, 로봇, 컴퓨터, 프린터, 스캐너, 태블릿, 스마트 단말기, 휴대폰, 운전 기록 장치, 네비게이터, 센서, 카메라, 서버, 클라우드 서버, 사진기, 비디오 카메라, 프로젝터, 손목시계, 헤드폰, 휴대용 저장장치, 착용형 장치, 교통 수단, 가전 제품 및/또는 의료 장비를 포함한다.Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headphones, and portable devices. This includes storage, wearable devices, transportation, household appliances and/or medical equipment.

상기 교통 수단은 비행기, 선박 및/또는 차량을 포함하고, 상기 가전 제품은 텔레비전, 에어컨, 전자 렌지, 냉장고, 전기 밥솥, 가습기, 세탁기, 전등, 가스 렌지, 레인지 후드를 포함하며, 상기 의료 장비는 핵 자기 공명 설비, B- 초음파 설비 및/또는 심전도 설비를 포함한다. The means of transportation includes airplanes, ships and/or vehicles, the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods, and the medical equipment includes Includes nuclear magnetic resonance facilities, B-ultrasound facilities and/or electrocardiography facilities.

일 실시예에서, 도 25에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 25, a network on-chip data processing method is provided. The method includes the following steps.

단계202: 제1 계산 장치를 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득한다.Step 202: Obtain first operation data by accessing a storage device via a first computing device.

여기서, 제1 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, 제1 계산 장치 내의 컨트롤러 유닛은 저장 장치로부터 제1 연산 데이터 및 계산 명령을 획득한다.Here, the first computing device includes an arithmetic unit and a controller unit, and the arithmetic unit includes one master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the first computing device obtains the first calculation data and calculation instructions from the storage device.

단계204: 상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 204: Calculate the first calculation data through the first calculation device to obtain a first calculation result.

여기서, 저장 장치로부터 판독된 제1 연산 데이터에 대해, 상응한 계산 명령에 따라 제1 계산 장치에서 연산하여 제1 연산 결과를 얻는다.Here, the first calculation data read from the storage device is calculated in the first calculation device according to a corresponding calculation instruction to obtain a first calculation result.

단계206: 상기 제1 연산 결과를 제2 계산 장치로 송신한다.Step 206: The first calculation result is sent to a second calculation device.

여기서, 제1 계산 장치는 제2 계산 장치와 설정된 통신 채널을 통해 제1 계산 장치 내의 컨트롤러 유닛에 의해 제1 연산 결과를 제2 계산 장치로 송신한다. 대안적으로, 제1 연산 결과를 제2 계산 장치로 송신할 수 있을 뿐만 아니라 저장 장치로 송신할 수도 있다.Here, the first calculation device transmits the first calculation result to the second calculation device by a controller unit in the first calculation device through a communication channel established with the second calculation device. Alternatively, the result of the first operation may be transmitted to the second computing device as well as to the storage device.

더 나아가, 본 실시예에 따른 네트워크 온칩 데이터 처리 방법은 도 1-5에 도시된 임의의 한 네트워크 온칩 처리 시스템에 적용될 수 있다.Furthermore, the network on-chip data processing method according to this embodiment may be applied to any one network on-chip processing system shown in FIGS. 1-5.

상기 네트워크 온칩 데이터 처리 방법은 제1 계산 장치 내의 제1 연산 결과를 제2 계산 장치로 송신하여 복수의 계산 장치 사이의 데이터 전송을 구현할 수 있으며, 동시에, 연산 데이터에 대한 재활용을 통해 계산 장치가 저장 장치에 여러 번 액세스하여 생기는 대역폭 오버헤드를 막을 수 있다. 상기 방법은 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터 처리 효율을 향상시킬 수 있다.The network on-chip data processing method transmits a first calculation result in a first calculation device to a second calculation device to implement data transmission between a plurality of calculation devices, and at the same time, the calculation device stores data through recycling of calculation data. This avoids the bandwidth overhead caused by accessing the device multiple times. The method can rationally use calculation data and intermediate calculation results, thereby improving data processing efficiency.

일 실시예에서, 도 26에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 26, a network on-chip data processing method is provided. The method includes the following steps.

단계302: 제1 계산 장치를 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득한다.Step 302: Obtain first operation data by accessing a storage device via a first computing device.

단계304: 상기 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 304: Calculate the first calculation data through the first calculation device to obtain a first calculation result.

단계306: 상기 제1 연산 결과를 제2 계산 장치로 송신한다.Step 306: The first calculation result is sent to a second calculation device.

여기서, 제1 계산 장치는 제2 계산 장치와 설정된 통신 채널을 통해 제1 계산 장치 내의 컨트롤러 유닛에 의해 제1 연산 결과를 제2 계산 장치로 송신한다.Here, the first calculation device transmits the first calculation result to the second calculation device by a controller unit in the first calculation device through a communication channel established with the second calculation device.

단계308: 상기 제2 계산 장치를 통해 상기 저장 장치를 액세스하여 제2 연산 데이터를 획득한다.Step 308: Accessing the storage device through the second computing device to obtain second operation data.

여기서, 제2 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, 제2 계산 장치 내의 컨트롤러 유닛은 저장 장치로부터 제2 연산 데이터 및 계산 명령을 획득한다.Here, the second computing device includes an arithmetic unit and a controller unit, and the arithmetic unit includes one master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second computing device obtains the second calculation data and calculation instructions from the storage device.

단계310: 상기 제2 계산 장치를 통해 상기 제2 연산 데이터 및 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는다.Step 310: Calculate the second calculation data and the first calculation result through the second calculation device to obtain a second calculation result.

여기서 저장 장치로부터 판독된 제2 연산 데이터 및 제1 계산 장치로부터 수신된 제1 연산 결과에 대해, 상응한 계산 명령에 따라 제1 계산 장치에서 연산하여 제2 연산 결과를 얻는다.Here, for the second calculation data read from the storage device and the first calculation result received from the first calculation device, calculation is performed in the first calculation device according to a corresponding calculation instruction to obtain a second calculation result.

상기 네트워크 온칩 데이터 처리 방법은, 제1 계산 장치 내의 제1 연산 결과를 제2 계산 장치로 송신한 후 제2 계산 장치가 상기 제1 연산 결과를 이용하여 재차 연산하는 것에 의해 연산 데이터의 재활용을 구현할 수 있다. 상기 방법은 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터 처리 효율을 향상시킬 수 있다.The network on-chip data processing method implements recycling of calculation data by transmitting a first calculation result in a first calculation device to a second calculation device, and then performing calculation again using the first calculation result by the second calculation device. can The method can rationally use calculation data and intermediate calculation results, thereby improving data processing efficiency.

일 실시예에서, 도 26에 도시된 네트워크 온칩 데이터 처리 방법을 도 9에 도시된 네트워크 온칩 처리 시스템(1900)에 적용한다. 여기서, 계산 장치(1902) 내지 계산 장치(1905)는 모두 소속된 네트워크 온칩 처리 모듈 내의 저장 장치(1901)에 연결되고, 계산 장치(1902) 내지 계산 장치(1905) 중 임의의 두 계산 장치는 상호간 직접적으로 연결되어 있다.In one embodiment, the network on-chip data processing method shown in FIG. 26 is applied to the network-on-chip processing system 1900 shown in FIG. Here, the computing devices 1902 to 1905 are all connected to the storage device 1901 in the network on-chip processing module to which they belong, and any two of the computing devices 1902 to 1905 can interact with each other. are directly connected.

예를 들어, 하나의 행렬 곱셈을 계산하고, 행렬A =

, 행렬B =

, 행렬C = A * B =

을 계산한다.For example, to calculate one matrix multiplication, matrixA =

, matrix B =

, matrix C = A * B =

Calculate

여기서,

=

+

；here,

=

+

;;

=

+

；

=

+

;;

=

+

；

=

+

;;

=

+

,

=

+

,

먼저, 시간을 나누어 3개의 시간 구간을 얻는다.First, time is divided to obtain three time intervals.

이어서, 제1 시간 구간에서 계산 장치(1902) 내지 계산 장치(1905)가 그 소속된 네트워크 온칩 처리 모듈 내의 저장 장치(1901)를 동시에 액세스한다.Subsequently, in the first time interval, the computing device 1902 to the computing device 1905 simultaneously access the storage device 1901 in the network on-chip processing module to which they belong.

구체적으로, 계산 장치(1902)가 저장 장치(1901)로부터 제1 연산 데이터

및

; 계산 장치(1903)가 저장 장치(1901)로부터 제1 연산 데이터

및

를 판독하고, 계산 장치(1904)가 저장 장치(1901)로부터 제1 연산 데이터

및

를 판독하며, 계산 장치(1905)가 저장 장치(1901)로부터 제1 연산 데이터

및

를 판독한다.Specifically, the first calculation data from the storage device 1901 by the calculation device 1902

and

; First calculation data from the storage device 1901 by the calculation device 1903

and

is read, and the calculation device 1904 first operation data from the storage device 1901

and

is read, and the calculation device 1905 first operation data from the storage device 1901

and

read

더 나아가, 계산 장치(1902)에서 판독된 제1 연산 데이터

및

에 대해 연산하여 제1 연산 결과

를 얻고, 계산 장치(1903)에서 판독된 제1 연산 데이터

및

에 대해 연산하여 제1 연산 결과

를 얻으며, 계산 장치(1904)에서 판독된 제1 연산 데이터

및

에 대해 연산하여, 제1 연산 결과

를 얻으며, 계산 장치(1905)에서 판독된 제1 연산 데이터

및

에 대해 연산하여 제1 연산 결과

를 얻는다.Furthermore, the first calculation data read from the calculation device 1902

and

The result of the first operation by calculating

is obtained, and the first operation data read by the computing device 1903

and

The result of the first operation by calculating

Obtaining, the first operation data read by the computing device 1904

and

By calculating the first calculation result

Obtaining, the first operation data read by the computing device 1905

and

The result of the first operation by calculating

get

이어서, 제2 시간 구간에서, 계산 장치(1902)가 각각 계산 장치(1903)로부터 제1 연산 데이터

를 판독하고, 계산 장치(1904)로부터 제1 연산 데이터

를 판독하여 연산을 통해 제2 연산 결과

를 얻고, 계산 장치(1903)가 각각 계산 장치(1902)로부터 제1 연산 데이터

를 판독하고, 계산 장치(1905)로부터 제1 연산 데이터

를 판독하여 연산을 통해 제2 연산 결과

를 얻으며, 계산 장치(1904)가 각각 계산 장치(1905)로부터 제1 연산 데이터

를 판독하고, 계산 장치(1902)로부터 제1 연산 데이터

를 판독하여 연산을 통해 제2 연산 결과

를 얻으며, 계산 장치(1905)가 각각 계산 장치(1904)로부터 제1 연산 데이터

를 판독하고 계산 장치(1903)로부터 제1 연산 데이터

를 판독하여 연산을 통해 제2 연산 결과

를 얻는다.Then, in the second time interval, the calculation device 1902 receives first calculation data from the calculation device 1903, respectively.

is read, and the first calculation data from the calculation device 1904

The result of the second operation through operation by reading

Obtain, and the computing device 1903 receives the first calculation data from the computing device 1902, respectively.

is read, and the first calculation data from the calculation device 1905

The result of the second operation through operation by reading

Obtaining, the first calculation data from the calculation device 1904, respectively, the calculation device 1904

is read, and the first calculation data from the calculation device 1902

The result of the second operation through operation by reading

Obtaining, the first calculation data from the calculation device 1905, respectively, the calculation device 1904

is read and the first calculation data from the calculation device 1903

The result of the second operation through operation by reading

get

이어서, 제3 시간 구간에서, 계산 장치(1902)가 제1 연산 결과

및 제2 연산 결과

에 대해 연산하여 제3 연산 결과

=

+

를 얻은 후, 제3 연산 결과

를 저장 장치(1901)에 송신하며; 계산 장치(1903)가 제1 연산 결과

및 제2 연산 결과

에 대해 연산하여 제3 연산 결과

=

+

를 얻은 후, 제3 연산 결과

를 저장 장치(1901)에 송신하고; 계산 장치(1904)가 제1 연산 결과

및 제2 연산 결과

에 대해 연산하여 제3 연산 결과

=

+

를 얻은 후 제3 연산 결과

를 저장 장치(1901)로 송신하고; 계산 장치(1905)가 제1 연산 결과

및 제2 연산 결과

에 대해 연산하여 제3 연산 결과

=

+

를 얻은 후, 제3 연산 결과

를 저장 장치(1901)에 송신한다.Then, in the third time interval, the calculation device 1902 performs the first calculation result.

and the result of the second operation

The result of the third operation by calculating

=

+

After obtaining, the result of the third operation

to the storage device 1901; The calculation device 1903 provides a first calculation result

and the result of the second operation

The result of the third operation by calculating

=

+

After obtaining, the result of the third operation

to the storage device 1901; The calculation device 1904 provides a first calculation result

and the result of the second operation

The result of the third operation by calculating

=

+

The result of the third operation after obtaining

to the storage device 1901; The calculation device 1905 provides a first calculation result

and the result of the second operation

The result of the third operation by calculating

=

+

After obtaining, the result of the third operation

to the storage device 1901.

일 실시예에서, 도 27에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 27, a network on-chip data processing method is provided. The method includes the following steps.

단계402: 제1 계산 장치 그룹을 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득한다. 여기서, 제1 계산 장치 그룹은 복수의 제1 계산 장치를 포함한다.Step 402: Obtain first operation data by accessing a storage device via a first computing device group. Here, the first computing device group includes a plurality of first computing devices.

여기서, 제1 계산 장치 그룹(cluster1) 내의 각 제1 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, cluster1 내의 컨트롤러 유닛은 저장 장치로부터 제1 연산 데이터 및 계산 명령을 획득한다.Here, each first computing device in the first computing device group cluster1 includes an arithmetic unit and a controller unit, and the arithmetic unit includes one master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster1 obtains the first calculation data and calculation instructions from the storage device.

대안적으로, cluster1 내의 복수의 제1 계산 장치는 저장 장치를 동시에 액세스하고, 각각의 제1 계산 장치는 저장 장치로부터 cluster1에 필요한 일부 데이터를 판독한다. 이러한 데이터는 cluster1 내에서 전송한다. 대안적으로, cluster1 내의 하나 또는 복수의 제1 계산 장치가 저장 장치에 액세스할 수 있도록 지정하며 나머지 제1 계산 장치는 그룹 내 통신만 할 수 있도록 한다.Alternatively, a plurality of first computing devices in cluster1 concurrently access the storage device, and each first computing device reads some data needed by cluster1 from the storage device. These data are transmitted within cluster1. Alternatively, designate that one or a plurality of first computing devices in cluster1 can access the storage device and the remaining first computing devices can only communicate within the group.

단계404: 상기 제1 계산 장치 그룹을 통해 상기 복수의 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 404: Calculate the plurality of first calculation data via the first calculation device group to obtain a first calculation result.

여기서, 복수의 제1 연산 데이터는 상응한 계산 명령에 따라 복수의 제1 계산 장치 사이에서 연산하고 포워딩하여 제1 연산 결과를 얻는다.Here, the plurality of first calculation data is calculated and forwarded between the plurality of first calculation devices according to corresponding calculation instructions to obtain a first calculation result.

단계406: 상기 제1 연산 결과를 제2 계산 장치 그룹으로 송신한다.Step 406: The first operation result is sent to a second group of computing devices.

여기서, cluster1은 제2 계산 장치 그룹(cluster2)과 구축된 통신 채널을 통해 cluster1 내의 컨트롤러 유닛에 의해 제1 연산 결과를 cluster2로 송신한다.Here, cluster1 transmits the first calculation result to cluster2 by the controller unit in cluster1 via the established communication channel with the second computing device group (cluster2).

대안적으로, 제1 연산 결과를 cluster2로 송신할 수 있을 뿐만 아니라 제1 연산 결과를 저장 장치로 송신할 수도 있다. 대안적으로, cluster2와 통신 채널이 구축된 cluster1의 임의의 한 제1 계산 장치를 통해 제1 연산 결과를 cluster2로 송신한다. 대안적으로, cluster1은 제1 연산 결과를 cluster1과 통신 채널이 구축된 cluster2의 임의의 한 제2 계산 장치에 송신할 수 있다.Alternatively, the first operation result may be transmitted to cluster2 as well as the first operation result to the storage device. Alternatively, the first calculation result is transmitted to cluster2 through any one first computing device of cluster1 with which a communication channel has been established with cluster2. Alternatively, cluster1 may send the first calculation result to any one second computing device in cluster2 with which a communication channel has been established with cluster1.

더 나아가, 본 실시예에 따른 네트워크 온칩 데이터 처리 방법은 도 6-8에 도시된 임의의 한 네트워크 온칩 처리 시스템에 적용될 수 있다.Furthermore, the network on-chip data processing method according to this embodiment may be applied to any one network on-chip processing system shown in FIGS. 6-8.

상기 네트워크 온칩 데이터 처리 방법은 복수의 계산 장치 그룹 사이에 그룹 내 통신뿐만 아니라 그룹간의 데이터 전송을 구현할 수 있으므로 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터의 처리 효율을 향상시켰다.The network on-chip data processing method can implement inter-group data transmission as well as intra-group communication between a plurality of groups of computing devices, thereby improving data processing efficiency by rationally using calculation data and intermediate calculation results.

일 실시예에서, 도 28에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 28, a network on-chip data processing method is provided. The method includes the following steps.

단계502: 제1 계산 장치 그룹을 통해 저장 장치에 액세스하여 제1 연산 데이터를 획득한다. 여기서, 제1 계산 장치 그룹은 복수의 제1 계산 장치를 포함한다.Step 502: Obtain first operation data by accessing a storage device via a first computing device group. Here, the first computing device group includes a plurality of first computing devices.

대안적으로, cluster1 내의 복수의 제1 계산 장치는 저장 장치를 동시에 액세스하고, 각각의 제1 계산 장치는 저장 장치로부터 cluster1에 필요한 일부 데이터를 판독한다. 이러한 데이터는 cluster1 내에서 전송한다. 대안적으로, cluster1 내의 하나 또는 복수의 제1 계산 장치를 지정하여 저장 장치에 액세스할 수 있도록 하며 나머지 제1 계산 장치는 그룹 내 통신만 할 수 있도록 한다.Alternatively, a plurality of first computing devices in cluster1 concurrently access the storage device, and each first computing device reads some data needed by cluster1 from the storage device. These data are transmitted within cluster1. Alternatively, designate one or a plurality of first computing devices in cluster1 to be able to access the storage device and allow the remaining first computing devices to only communicate within the group.

단계504: 상기 제1 계산 장치 그룹을 통해 상기 복수의 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 504: Calculate the plurality of first calculation data via the first calculation device group to obtain a first calculation result.

단계506: 상기 제1 연산 결과를 제2 계산 장치 그룹으로 송신한다.Step 506: The first operation result is sent to a second group of computing devices.

*여기서, cluster1은 제2 계산 장치 그룹(cluster2)과 구축된 통신 채널을 통해 cluster1 내의 컨트롤러 유닛에 의해 제1 연산 결과를 cluster2로 송신한다.*Here, cluster1 transmits the first calculation result to cluster2 by the controller unit in cluster1 through the established communication channel with the second computing device group (cluster2).

대안적으로, cluster2와 통신 채널이 구축된 cluster1내의 임의의 한 제1 계산 장치를 통해 제1 연산 결과를 cluster2로 송신한다. 대안적으로, cluster1은 제1 연산 결과를 cluster1과 통신 채널이 구축된 cluster2 내의 임의의 한 제2 계산 장치로 송신할 수 있다.Alternatively, the first calculation result is transmitted to cluster2 through any one first computing device in cluster1 with which a communication channel has been established with cluster2. Alternatively, cluster1 may send the first calculation result to any one second computing device in cluster2 with which a communication channel has been established with cluster1.

단계508: 상기 제2 계산 장치 그룹을 통해 상기 저장 장치에 액세스하여 제2 연산 데이터를 획득한다. 여기서, 상기 제2 계산 장치 그룹은 복수의 제2 계산 장치를 포함한다.Step 508: Accessing the storage device via the second computing device group to obtain second operation data. Here, the second computing device group includes a plurality of second computing devices.

여기서, cluster2 내의 각 제1 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, cluster2 내의 컨트롤러 유닛은 저장 장치로부터 제2 연산 데이터 및 계산 명령을 획득한다.Here, each first computing device in cluster2 includes an arithmetic unit and a controller unit, and the arithmetic unit includes one master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster2 obtains the second calculation data and calculation instructions from the storage device.

대안적으로, cluster2 내의 복수의 제2 계산 장치는 저장 장치를 동시에 액세스하고, 각각의 제2 계산 장치는 저장 장치로부터 cluster2에 필요한 일부 데이터를 판독한다. 이러한 데이터는 cluster2 내에서 전송한다. 대안적으로, cluster2 내의 하나 또는 복수의 제2 계산 장치를 지정하여 저장 장치에 액세스할 수 있도록 하고 나머지 제2 계산 장치는 그룹 내 통신만 할 수 있도록 한다.Alternatively, a plurality of second computing devices in cluster2 concurrently access the storage device, and each second computing device reads some data needed by cluster2 from the storage device. These data are transmitted within cluster2. Alternatively, designate one or a plurality of second computing units in cluster2 to be able to access the storage device and allow the remaining second computing devices to only communicate within the group.

단계510: 상기 제2 계산 장치 그룹을 통해 상기 제2 연산 데이터 및 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는다.Step 510: Calculate the second calculation data and the first calculation result through the second calculation device group to obtain a second calculation result.

여기서, 저장 장치로부터 판독된 제2 연산 데이터 및 제1 계산 장치 그룹으로부터 수신된 제1 연산 결과에 대해, 상응한 계산 명령에 따라 복수의 제2 계산 장치 사이에서 연산하고 포워딩하여 제2 연산 결과를 얻는다.Here, the second calculation data read from the storage device and the first calculation result received from the first calculation device group are calculated and forwarded between a plurality of second calculation devices according to a corresponding calculation instruction to obtain a second calculation result. get

상기 네트워크 온칩 데이터 처리 방법은 제1 계산 장치 그룹 내의 제1 연산 결과를 제2 계산 장치 그룹으로 송신한 후 제2 계산 장치 그룹이 상기 제1 연산 결과를 이용하여 재차 연산하하므로 연산 데이터의 재활용을 구현할 수 있다. 상기 방법은 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터의 처리 효율을 향상시켰다.The network on-chip data processing method transmits a first calculation result in a first calculation device group to a second calculation device group, and then the second calculation device group performs calculation again using the first calculation result, thereby reusing calculation data. can be implemented The method improves data processing efficiency by rationally using calculation data and intermediate calculation results.

일 실시예에서, 도 29에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 29, a network on-chip data processing method is provided. The method includes the following steps.

단계602: 제1 네트워크 온칩 처리 모듈을 통해 제1 연산 데이터를 획득한다. 여기서, 상기 제1 네트워크 온칩 처리 모듈은 제1 저장 장치 및 복수의 제1 계산 장치를 포함하고, 상기 제1 저장 장치에 제1 연산 데이터를 저장한다.Step 602: Obtain first operation data through a first network on-chip processing module. Here, the first network on-chip processing module includes a first storage device and a plurality of first calculation devices, and stores first operation data in the first storage device.

여기서, 제1 네트워크 온칩 처리 모듈 내의 제1 계산 장치마다 연산 유닛 및 컨트롤러 유닛을 포함하고 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, 제1 네트워크 온칩 처리 모듈 내의 컨트롤러 유닛은 제1 저장 장치로부터 제1 연산 데이터 및 계산 명령을 획득한다.Here, each first computing device in the first network on-chip processing module includes an arithmetic unit and a controller unit, and the arithmetic unit includes one master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the first network on-chip processing module obtains the first calculation data and calculation instructions from the first storage device.

대안적으로, 제1 네트워크 온칩 처리 모듈 내의 복수의 제1 계산 장치는 제1 저장 장치에 동시에 액세스하고, 각각의 제1 계산 장치는 제1 저장 장치로부터 상기 제1 네트워크 온칩 처리 모듈에 필요한 일부 데이터를 판독한다. 이러한 데이터는 제1 네트워크 온칩 처리 모듈 내에서 전송한다.Alternatively, a plurality of first computing devices in a first network on-chip processing module concurrently access a first storage device, each first computing device having some data required by the first network-on-chip processing module from the first storage device. read This data is transmitted within the first network on-chip processing module.

대안적으로, 제1 네트워크 온칩 처리 모듈 내의 하나 또는 복수의 제1 계산 장치를 지정하여 제1 저장 장치에 액세스할 수 있도록 하고 나머지 제1 계산 장치는 그룹 내 통신만 할 수 있도록 한다. 구체적으로, 제1 네트워크 온칩 처리 모듈이 처리해야 하는 연산 데이터는 모두 제1 저장 장치에 저장되어 있다.Alternatively, designate one or a plurality of first computing devices in the first network-on-chip processing module to be able to access the first storage device and allow the remaining first computing devices to only communicate within the group. Specifically, operation data to be processed by the first network on-chip processing module are all stored in the first storage device.

단계604: 상기 제1 네트워크 온칩 처리 모듈 내의 복수의 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 604: Calculate the first calculation data through a plurality of first calculation devices in the first network on-chip processing module to obtain a first calculation result.

단계606: 상기 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신한다.Step 606: The first operation result is sent to a second network on-chip processing module.

여기서, 제1 네트워크 온칩 처리 모듈은 제2 네트워크 온칩 처리 모듈과 구축된 통신 채널을 통해 제1 네트워크 온칩 처리 모듈 내의 컨트롤러 유닛에 의해 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신한다.Here, the first network-on-chip processing module sends the first operation result to the second network-on-chip processing module by the controller unit in the first network-on-chip processing module through the communication channel established with the second network-on-chip processing module.

대안적으로, 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신할 수 있을 뿐만 아니라 제1 연산 결과를 제1 저장 장치로 송신할 수 있다. 대안적으로, 제2 네트워크 온칩 처리 모듈과 통신 채널이 구축된 제1 네트워크 온칩 처리 모듈 중 임의의 한 제1 계산 장치를 통해 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신한다. 대안적으로, 제1 네트워크 온칩 처리 모듈은 제1 연산 결과를 제1 네트워크 온칩 처리 모듈과 통신 채널이 구축된 제2 네트워크 온칩 처리 모듈 중 임의의 한 제2 계산 장치로 송신한다.Alternatively, the first operation result may be sent to the second network-on-chip processing module as well as the first operation result may be sent to the first storage device. Alternatively, the first operation result is sent to the second network-on-chip processing module through any one of the first computing devices of the first network-on-chip processing module through which the communication channel is established with the second network-on-chip processing module. Alternatively, the first network-on-chip processing module sends the first operation result to a second computing device of any one of the second network-on-chip processing modules over which a communication channel is established with the first network-on-chip processing module.

더 나아가, 본 실시예에 따른 네트워크 온칩 데이터 처리 방법은 도 9-12에 도시된 임의의 한 네트워크 온칩 처리 시스템에 적용될 수 있다.Furthermore, the network on-chip data processing method according to this embodiment may be applied to any one network on-chip processing system shown in FIGS. 9 to 12 .

상기 네트워크 온칩 데이터 처리 방법은 복수의 계산 장치 그룹 사이에서 그룹 내 통신뿐만 아니라 그룹간 데이터 전송을 구현할 수 있으므로 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터의 처리 효율을 향상시켰다.The network on-chip data processing method can implement intra-group communication as well as inter-group data transmission between a plurality of groups of computing devices, so that calculation data and intermediate calculation results can be rationally used, thereby improving data processing efficiency.

일 실시예에서, 도 30에 도시된 바와 같이, 네트워크 온칩 데이터 처리 방법이 제공된다. 상기 방법은 다음 단계를 포함한다.In one embodiment, as shown in Fig. 30, a network on-chip data processing method is provided. The method includes the following steps.

단계702: 제1 네트워크 온칩 처리 모듈을 통해 제1 연산 데이터를 획득한다. 여기서, 상기 제1 네트워크 온칩 처리 모듈은 제1 저장 장치 및 복수의 제1 계산 장치를 포함하고, 상기 제1 연산 데이터는 상기 제1 저장 장치에 저장된다.Step 702: Obtain first operation data through a first network on-chip processing module. Here, the first network on-chip processing module includes a first storage device and a plurality of first calculation devices, and the first operation data is stored in the first storage device.

여기서, 제1 네트워크 온칩 처리 모듈 내의 각 제1 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, 제1 네트워크 온칩 처리 모듈 내의 컨트롤러 유닛은 제1 저장 장치로부터 제1 연산 데이터 및 계산 명령을 획득한다.Here, each first computing device in the first network on-chip processing module includes an arithmetic unit and a controller unit, and the arithmetic unit includes a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the first network on-chip processing module obtains the first calculation data and calculation instructions from the first storage device.

대안적으로, 제1 네트워크 온칩 처리 모듈 내의 하나 또는 복수의 제1 계산 장치를 지정하여 제1 저장 장치에 액세스할 수 있도록 하고 나머지 제1 계산 장치는 그룹 내 통신만 할 수 있도록 한다. 구체적으로, 제1 네트워크 온칩 처리 모듈이 처리해야 하는 연산 데이터는 모두 제1 저장 장치에 저장된다.Alternatively, designate one or a plurality of first computing devices in the first network-on-chip processing module to be able to access the first storage device and allow the remaining first computing devices to only communicate within the group. Specifically, all operational data to be processed by the first network on-chip processing module is stored in the first storage device.

단계704: 상기 제1 네트워크 온칩 처리 모듈 내의 복수의 제1 계산 장치를 통해 상기 제1 연산 데이터에 대해 연산하여 제1 연산 결과를 얻는다.Step 704: Calculate the first calculation data through a plurality of first calculation devices in the first network on-chip processing module to obtain a first calculation result.

단계706: 상기 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신한다.Step 706: The first operation result is sent to a second network on-chip processing module.

대안적으로, 제2 네트워크 온칩 처리 모듈과 통신 채널이 구축된 제1 네트워크 온칩 처리 모듈 중 임의의 한 제1 계산 장치를 통해 제1 연산 결과를 제2 네트워크 온칩 처리 모듈로 송신한다. 대안적으로, 제1 네트워크 온칩 처리 모듈은 제1 연산 결과를 제1 네트워크 온칩 처리 모듈과 통신 채널이 구축된 제2 네트워크 온칩 처리 모듈 중 임의의 한 제2 계산 장치로 송신한다.Alternatively, the first operation result is sent to the second network-on-chip processing module through any one of the first computing devices of the first network-on-chip processing module through which the communication channel is established with the second network-on-chip processing module. Alternatively, the first network-on-chip processing module sends the first operation result to a second computing device of any one of the second network-on-chip processing modules over which a communication channel is established with the first network-on-chip processing module.

단계708: 상기 제2 네트워크 온칩 처리 모듈을 통해 제2 연산 데이터를 획득한다. 여기서, 상기 제2 네트워크 온칩 처리 모듈은 제2 저장 장치 및 복수의 제2 계산 장치를 포함하고, 상기 제2 저장 장치에 상기 제2 연산 데이터를 저장한다.Step 708: Obtain second operation data through the second network on-chip processing module. Here, the second network on-chip processing module includes a second storage device and a plurality of second calculation devices, and stores the second calculation data in the second storage device.

여기서, 제2 네트워크 온칩 처리 모듈 내의 각 제2 계산 장치는 연산 유닛 및 컨트롤러 유닛을 포함하고, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 구체적으로, 제2 네트워크 온칩 처리 모듈 내의 컨트롤러 유닛은 제2 저장 장치로부터 제2 연산 데이터 및 계산 명령을 획득한다.Here, each second computing device in the second network on-chip processing module includes an arithmetic unit and a controller unit, and the arithmetic unit includes a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second network on-chip processing module obtains the second calculation data and calculation instructions from the second storage device.

대안적으로, 제2 네트워크 온칩 처리 모듈 내의 복수의 제2 계산 장치는 제2 저장 장치에 동시에 액세스하고, 각각의 제2 계산 장치는 제2 저장 장치로부터 상기 제2 네트워크 온칩 처리 모듈에 필요한 일부 데이터를 획득한다. 이러한 데이터는 제2 네트워크 온칩 처리 모듈 내에서 전송한다.Alternatively, a plurality of second computing devices in a second network on-chip processing module concurrently access a second storage device, each second computing device having some data required by the second network-on-chip processing module from the second storage device. Acquire This data is transmitted within the second network on-chip processing module.

대안적으로, 제2 네트워크 온칩 처리 모듈 내의 하나 또는 복수의 제2 계산 장치를 지정하여 제2 저장 장치에 액세스할 수 있도록 하고 나머지 제2 계산 장치는 그룹 내 통신만 할 수 있도록 한다. 구체적으로, 제2 네트워크 온칩 처리 모듈이 처리해야 하는 연산 데이터는 모두 제2 저장 장치에 저장되어 있다.Alternatively, designate one or a plurality of second computing devices in the second network-on-chip processing module to be able to access the second storage device and allow the other second computing devices to only communicate within the group. Specifically, the operation data to be processed by the second network on-chip processing module are all stored in the second storage device.

단계710：상기 제2 네트워크 온칩 처리 모듈 내의 복수의 제2 계산 장치를 통해 상기 제2 연산 데이터 및 상기 제1 연산 결과에 대해 연산하여 제2 연산 결과를 얻는다.Step 710: Calculate the second calculation data and the first calculation result through a plurality of second calculation devices in the second network on-chip processing module to obtain a second calculation result.

여기서, 단계710는 구체적으로 다음 단계를 포함한다.Here, step 710 specifically includes the following steps.

단계7102: 상기 복수의 제2 계산 장치 사이에서 상기 제2 연산 데이터 및 상기 제1 연산 결과를 연산하여 상기 제2 연산 결과를 얻는다.Step 7102: Calculate the second calculation data and the first calculation result between the plurality of second calculation devices to obtain the second calculation result.

구체적으로, 각 제2 계산 장치는 상응한 계산 명령에 따라 제2 연산 데이터 및 제1 연산 결과에 대해 연산하여 복수의 중간 결과를 얻는다. 이어서 상응한 계산 명령에 따라 복수의 중간 결과에 대해 연산하여 제2 연산 결과를 얻는다.Specifically, each second computing device calculates on the second calculation data and the first calculation result according to a corresponding calculation instruction to obtain a plurality of intermediate results. Subsequently, a plurality of intermediate results are calculated according to a corresponding calculation instruction to obtain a second calculation result.

단계7104: 상기 제2 연산 결과를 상기 제2 저장 장치에 저장한다.Step 7104: The second operation result is stored in the second storage device.

상기 네트워크 온칩 데이터 처리 방법은 제1 네트워크 온칩 처리 시스템 내의 제1 연산 결과를 제2 네트워크 온칩 처리 시스템으로 송신한 후, 제2 네트워크 온칩 처리 시스템이 상기 제1 연산 결과를 이용하여 재차 연산하하므로 연산 데이터의 재활용을 구현할 수 있다. 상기 방법은 연산 데이터 및 중간 연산 결과를 합리적으로 이용할 수 있어 데이터의 처리 효율을 향상시켰다.The network-on-chip data processing method transmits a first operation result in a first network-on-chip processing system to a second network-on-chip processing system, and then the second network-on-chip processing system performs another operation using the first operation result. Reuse of data can be implemented. The method improves data processing efficiency by rationally using calculation data and intermediate calculation results.

본 출원의 실시예에 따른 네트워크 온칩 처리 방법은 기계 학습 계산에 사용되며, 구체적으로 인공 신경망 연산에 사용될 수 있다. 여기서, 네트워크 온칩 처리 시스템 내의 연산 데이터는 구체적으로 입력 뉴런 데이터 및 가중치를 포함할 수 있으며, 네트워크 온칩 처리 시스템 내의 연산 결과는 구체적으로 인공 신경망 연산의 결과, 즉 출력 뉴런 데이터일 수 있다.The network on-chip processing method according to the embodiment of the present application is used for machine learning calculation, and can be specifically used for artificial neural network calculation. Here, the operation data in the network-on-chip processing system may specifically include input neuron data and weights, and the operation result in the network-on-chip processing system may specifically be an artificial neural network operation result, that is, output neuron data.

신경망 내의 연산은 신경망 내의 1층 연산일 수 있으며, 다층 신경망에 있어서, 그 구현 과정은 다음과 같다. 정방향 연산에서 이전 층 인공 신경망의 수행이 끝난 후 다음 층의 연산 명령은 연산 유닛으로부터 산출된 출력 뉴런을 다음 층 입력 뉴런으로 하여 연산(또는 상기 출력 뉴런에 대해 일부 동작한 후 다시 다음 층의 입력 뉴런으로 삼음)을 진행하는 동시에, 가중치를 또한 다음 층 가중치로 교체한다. 역방향 연산에서 이전 층 인공 신경망의 역방향 연산이 수행된 후, 다음 층 연산 명령은 연산 유닛으로부터 산출된 입력 뉴런 기울기를 다음 층의 출력 뉴런 기울기로 하여 연산(또는 상기 입력 뉴런 기울기에 대해 일부 동작한 후 다시 다음 층의 출력 뉴런 기울로 삼음)을 진행하는 동시에 가중치를 다음 층의 가중치로 교체한다.An operation in a neural network may be a single-layer operation in a neural network, and in a multi-layer neural network, the implementation process is as follows. In forward operation, after the execution of the artificial neural network of the previous layer is finished, the operation command of the next layer operates using the output neuron calculated from the operation unit as the input neuron of the next layer (or performs some operation on the output neuron, then input neuron of the next layer again) ), the weights are also replaced with the weights of the next layer. In the reverse operation, after the reverse operation of the artificial neural network of the previous layer is performed, the next layer operation command calculates the input neuron gradient calculated from the operation unit as the output neuron gradient of the next layer (or after partially operating on the input neuron gradient) Again, the gradient of the output neuron of the next layer is used) while the weight is replaced with the weight of the next layer.

상기 기계 학습 계산은 서포트 벡터 머신 연산, k-인접(k-nn)연산, k-평균(k-means)연산, 주성분 분석 연산 등을 더 포함할 수 있다. 설명의 편의를 위해, 이하 인공 신경망 연산을 예로 들어 기계 학습 계산의 구체적인 방안에 대해 설명한다.The machine learning calculation may further include a support vector machine operation, a k-adjacency (k-nn) operation, a k-means operation, a principal component analysis operation, and the like. For convenience of description, a detailed method of machine learning calculation will be described below using an artificial neural network calculation as an example.

인공 신경망 연산에 대해, 상기 인공 신경망 연산이 다층 연산을 갖는다면, 다층 연산의 입력 뉴런 및 출력 뉴런은 전체 신경망의 입력층의 뉴런 및 출력층의 뉴런을 의미하는 것이 아니라, 신경망 중 임의의 인접된 2개 층에 있어서 신경망 정방향 연산의 다음 층에서의 뉴런이 바로 입력 뉴런이고, 신경망 정방향 연산의 이전 층에서의 뉴런이 바로 출력 뉴런이다. 컨볼루션 신경망을 예로 들어 하나의 컨볼루션 신경망을

층으로 설정하고,

. 제

층 및 제

층에 대해, 제

층을 입력층으로 부르되, 그 중의 뉴런은 상기 입력 뉴런이고, 제

층을 출력층으로 부르되, 그 중의 뉴런은 상기 출력 뉴런이다. 꼭대기 층을 제외한 각층을 모두 입력층으로 하고 그 다음 층이 대응하는 출력층일 수 있다.Regarding artificial neural network operation, if the artificial neural network operation has multi-layer operation, the input neurons and output neurons of the multi-layer operation do not mean the neurons of the input layer and the neurons of the output layer of the entire neural network, but any two adjacent neurons of the neural network. For each layer, neurons in the next layer of forward computation of the neural network are input neurons, and neurons in the previous layer of forward computation of the neural network are output neurons. Taking convolutional neural networks as an example, one convolutional neural network

set to layer

. my

floor and th

about the floor,

A layer is called an input layer, neurons in which are the input neurons,

A layer is called an output layer, and the neurons in it are said output neurons. Each layer except the top layer may be an input layer, and the next layer may be a corresponding output layer.

하나의 선택 가능한 실시예에서, 신경망 연산 중 완전 연결 연산을 예로 들면, 과정은 y=f(wx+b)일 수 있으며, 여기서 x는 입력 뉴런 행렬이고, w는 가중치 행렬이고, b는 오프셋 스칼라이며, f는 활성화 함수이다. 구체적으로, sigmoid함수, tanh, relu, softmax함수 중 임의의 한 함수일 수 있다. 여기서 2-진 트리 구조이고, 8개의 슬레이브 처리회로를 가진다고 가정할 경우 그 구현 방법은 다음과 같다.In one possible embodiment, taking a fully connected operation among neural network operations as an example, the process may be y=f(wx+b), where x is an input neuron matrix, w is a weight matrix, and b is an offset scalar , and f is the activation function. Specifically, it may be any one of the sigmoid function, tanh, relu, and softmax functions. Here, assuming that it is a binary tree structure and has 8 slave processing circuits, the implementation method is as follows.

컨트롤러 유닛이 저장 유닛으로부터 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 획득한 후 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 마스트 처리회로로 전송하며,After the controller unit acquires the input neuron matrix x, the weight matrix w, and fully connected operation instructions from the storage unit, it transmits the input neuron matrix x, the weight matrix w, and fully connected operation instructions to the master processing circuit;

마스트 처리회로가, 상기 입력 뉴런 행렬x을 브로드캐스트 데이터로, 가중치 행렬w을 배부 데이터로 확정하고 가중치 행렬w를 8개 자행렬로 분할한 다음 트리 모듈을 통해 8개 자행렬을 8개 슬레이브 처리회로로 배분하고, 입력 뉴런 행렬x를 8개 슬레이브 처리회로로 브로드캐스팅하고,The master processing circuit determines the input neuron matrix x as broadcast data and the weight matrix w as distribution data, divides the weight matrix w into 8 child matrices, and then processes the 8 child matrices into 8 slaves through a tree module. distributed to circuits, broadcasting the input neuron matrix x to 8 slave processing circuits,

슬레이브 처리회로가 8개 서브 행렬과 입력 뉴런 행렬x의 곱셈 연산 및 누적 연산을 병행 수행하여 8개 중간 결과를 얻은 후, 8개 중간 결과를 마스트 처리회로로 송신하며,The slave processing circuit performs multiplication and accumulation operations of the 8 sub-matrices and the input neuron matrix x in parallel to obtain 8 intermediate results, and then transmits the 8 intermediate results to the master processing circuit,

마스트 처리회로가 8개 중간 결과를 정렬하여 wx의 연산 결과를 얻으며, 상기 연산 결과에 대해 오프셋b의 연산을 수행한 후 활성화 동작을 진행하여 최종 결과y를 얻으며, 최종 결과y를 컨트롤러 유닛으로 송신하여 컨트롤러 유닛에 의해 상기 최종 결과y를 저장 유닛으로 출력하거나 저장한다.The master processing circuit arranges the 8 intermediate results to obtain an operation result of wx, performs an operation of offset b on the operation result, and then proceeds with an activation operation to obtain a final result y, and transmits the final result y to the controller unit. and outputs or stores the final result y to a storage unit by the controller unit.

도 1에 도시된 계산 장치가 신경망 정방향 연산 명령을 수행하는 방법은 구체적으로 다음과 같다.A method for the computing device shown in FIG. 1 to perform a neural network forward computation command is as follows.

컨트롤러 유닛이 명령 저장 유닛으로부터 신경망 정방향 연산 명령, 신경망 연산 명령에 대응하는 동작 도메인 및 적어도 하나 이상의 동작 코드를 추출하고, 컨트롤러 유닛이 상기 동작 도메인을 데이터 액세스 유닛으로 전송하고 상기 적어도 하나 이상의 동작 코드를 연산 유닛으로 송신한다.The controller unit extracts a neural network forward operation command, an operation domain corresponding to the neural network operation command, and at least one or more operation codes from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit and extracts the at least one or more operation codes. sent to the calculation unit.

컨트롤러 유닛은 저장 유닛으로부터 상기 동작 도메인에 대응하는 가중치w 및 오프셋b(b가 0인 경우, 오프셋b를 추출할 필요없음)을 추출하고, 가중치w 및 오프셋b을 연산 유닛의 마스트 처리회로로 전송하고, 컨트롤러 유닛은 저장 유닛으로부터 입력 데이터Xi을 추출하고 상기 입력 데이터Xi를 마스트 처리회로로 송신한다.The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (if b is 0, there is no need to extract the offset b), and transmits the weight w and the offset b to the master processing circuit of the arithmetic unit. and the controller unit extracts the input data Xi from the storage unit and transmits the input data Xi to the master processing circuit.

마스트 처리회로는 상기 적어도 하나 이상의 동작 코드에 의해 곱셈 연산으로 확정되며 입력 데이터Xi를 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 가중치w를 n개 데이터 블록으로 분할하고,The master processing circuit is determined as a multiplication operation by the at least one operation code, determines the input data Xi as broadcast data and weights as distribution data, divides the weights w into n data blocks,

컨트롤러 유닛의 명령 처리 유닛은 상기 적어도 하나 이상의 동작 코드에 의해 곱셈 명령, 오프셋 명령 및 누적 명령을 확정하되, 곱셈 명령, 오프셋 명령 및 누적 명령을 마스트 처리회로로 송신하고, 마스트 처리회로가 상기 곱셈 명령, 입력 데이터Xi를 브로드캐스트 방식으로 복수의 슬레이브 처리회로로 송신하고, 상기 n개 데이터 블록을 상기 복수의 슬레이브 처리회로(예를 들어, n개 슬레이브 처리회로를 가지며, 각각의 슬레이브 처리회로가 하나의 데이터 블록을 송신함)로 배분하며, 복수의 슬레이브 처리회로는 상기 곱셈 명령에 의해 상기 입력 데이터Xi와 수신된 데이터 블록에 대해 곱셈 연산을 진행하여 중간 결과를 얻은 후, 상기 중간 결과를 마스트 처리회로로 송신하고, 상기 마스트 처리회로는 상기 누적 명령에 따라 복수의 슬레이브 처리회로에서 송신된 중간 결과에 대해 누적 연산하여 누적 결과를 얻으며, 상기 오프셋 명령에 따라 상기 누적 결과에 대해 오프셋b을 더하여 최종 결과를 얻은 후 상기 최종 결과를 상기 컨트롤러 유닛으로 전송한다.The command processing unit of the controller unit determines a multiplication command, an offset command, and an accumulation command according to the at least one operation code, and transmits the multiplication command, the offset command, and the accumulation command to a master processing circuit, and the master processing circuit determines the multiplication command. , transmits the input data Xi to a plurality of slave processing circuits in a broadcast manner, and transmits the n data blocks to the plurality of slave processing circuits (e.g., has n slave processing circuits, each slave processing circuit is one A plurality of slave processing circuits perform a multiplication operation on the input data Xi and the received data block by the multiplication command to obtain an intermediate result, and then master the intermediate result. circuit, wherein the master processing circuit performs an accumulation operation on intermediate results transmitted from a plurality of slave processing circuits according to the accumulation command to obtain an accumulation result, and adds an offset b to the accumulation result according to the offset command to obtain a final result. After obtaining the result, the final result is transmitted to the controller unit.

한편, 덧셈 연산과 곱셈 연산의 순서는 바뀔 수 있다.Meanwhile, the order of the addition operation and the multiplication operation may be changed.

본 출원에 제공되는 기술적 방안은 하나의 명령, 즉 신경망 연산 명령에 의해 신경망의 곱셈 연산과 오프셋 연산을 구현하고, 신경망에서 계산한 중간 결과를 저장하거나 추출할 필요없으므로, 중간 데이터의 저장 및 추출 작업을 생략할 수 있어 대응하는 동작 단계를 생략하고 신경망의 계산 효과를 개선하는 이점이 있다.The technical solution provided in this application implements the multiplication operation and offset operation of the neural network by one command, that is, the neural network operation command, and does not need to store or extract the intermediate result calculated by the neural network, so that the intermediate data is stored and extracted. can be omitted, which has the advantage of omitting the corresponding operation step and improving the computational effect of the neural network.

정보 기술이 끊임없이 발전하고 날로 향상됨에 따라 사람들이 데이터 액세스 및 데이터 처리에 대한 요구가 점점 높아져 일부 데이터를 처리하고 액세스하는 프로세서에 대한 요구도 점점 엄격해지고 있다. 범용 프로세서를 예로 들면, 복수의 범용 프로세서 코어(예를 들어, CPU 코어)로 구성된 멀티 코어 프로세서는 강력한 병행 계산 능력으로 주된 추세가 되었다.With the constant development and improvement of information technology, people's demands for data access and data processing are getting higher and higher, and the requirements for processors that process and access some data are also getting stricter. Taking a general-purpose processor as an example, a multi-core processor composed of a plurality of general-purpose processor cores (eg, CPU cores) has become a major trend due to its powerful concurrent computing capability.

그러나, 현재 인공 신경망이 끊임없이 발전함에 따라 점점 많은 구성의 기계 학습 칩이 점차 생기고 있다. 이런 기계 학습 칩은 운행시 명령에 따라 데이터를 액세스하거나 공유 저장 중인 데이터를 처리해야 한다. 데이터를 액세스하거나 공유 저장 중인 데이터가 많을 경우, 기계 학습 칩의 명령이 점차 복잡해져 명령에 의해 공유 저장 중인 데이터를 판독하는 속도에 영향을 미치므로 뉴런 데이터의 처리 효율을 저하시킨다.However, with the constant development of artificial neural networks, machine learning chips with more and more configurations are gradually appearing. These machine learning chips need to access data or process data in shared storage according to commands while driving. When data is accessed or there is a large amount of data in shared storage, the instructions of the machine learning chip become increasingly complex, which affects the speed at which the data in shared storage is read by the command, reducing the processing efficiency of neuron data.

따라서, 데이터 액세스시 기계 학습 칩이 액세스하는 속도를 향상하는 것은 현재 당업자가 시급히 해결해야 하는 기술문제이다.Therefore, improving the access speed of machine learning chips when accessing data is currently a technical problem that those skilled in the art urgently need to solve.

상기 문제를 해결하기 위하여 본 출원은 아래 기술적 방안을 제공한다.In order to solve the above problem, the present application provides the following technical solution.

본 출원에 따른 데이터 처리 방법은 도 31에 도시된 하드웨어 회로에 적용될 수 있다. 상기 회로는 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)를 포함하며, 기계 학습 장치(11)와 상기 전송 회로(12), 전송 회로(12)와 공유 메모리(13)는 모두 인터페이스를 통해 연결되고, 상기 기계 학습 장치(11), 전송 회로(12)와 공유 메모리(13), 및 상기 인터페이스는 모두 하드웨어 회로 방식으로 구현될 수 있다. 예를 들어, 기계 학습 장치는 복수의 기계 학습 유닛(Machine Learning Unit, MLU으로 약칭함)으로 구성된 연산 기능을 갖춘 장치일 수 있으며, 전송 회로는 브로드캐스트 버스(broadcast bus)일 수 있으며, 공유 메모리는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 캐시 메모리(cache memory) 등을 포함하나 이에 한정되지 않는다. 본 실시예는 상기 하드웨어 형식에 대해 구체적으로 한정하지 않는다. 여기서, 전송 회로(12)는 기계 학습 장치(11)가 보낸 데이터 동작 신호에 따라 공유 메모리(13)로부터 기계 학습 장치(11)에 필요한 입력 데이터를 획득하여 입력 데이터를 기계 학습 장치(11)로 리턴하는데 사용된다. 기계 학습 장치(11)는 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 데이터를 획득하고 출력 데이터를 새로운 입력 데이터로 삼아 전송 회로(12)를 통해 공유 메모리(13)에 전송한 후 저장한다.The data processing method according to the present application can be applied to the hardware circuit shown in FIG. 31 . The circuit includes a machine learning device 11, a transmission circuit 12 and a shared memory 13, and the machine learning device 11 and the transmission circuit 12, the transmission circuit 12 and the shared memory 13 are all connected through an interface, and the machine learning device 11, the transmission circuit 12 and the shared memory 13, and the interface may all be implemented in a hardware circuit manner. For example, the machine learning device may be a device with an arithmetic function composed of a plurality of machine learning units (abbreviated as MLU), a transmission circuit may be a broadcast bus, and a shared memory may be non-volatile and/or volatile memory, including, but not limited to, random access memory (RAM), cache memory, and the like. This embodiment does not specifically limit the hardware form. Here, the transmission circuit 12 obtains input data necessary for the machine learning device 11 from the shared memory 13 according to the data operation signal sent by the machine learning device 11, and transmits the input data to the machine learning device 11. used to return The machine learning device 11 performs a machine learning operation according to the input data to obtain output data, transmits the output data as new input data to the shared memory 13 through the transmission circuit 12, and stores the data.

일 실시예에서, 도 32는 데이터 처리 방법을 제공한다. 본 실시예는 전송 회로가 데이터 동작 신호의 제1 유형 플래그 비트 및 제2 유형 플래그 비트에 따라 데이터 동작 신호의 유형을 확정한 후 확정된 유형에 대응하는 동작에 따라 메모리로부터 필요한 데이터를 얻은 후 액세스 속도를 향상시키는 구체적인 과정에 관한 것이다. 도 2에 도시된 바와 같이, 상기 방법은 다음 단계를 포함한다.In one embodiment, Figure 32 provides a data processing method. In this embodiment, the transmission circuit determines the type of the data operation signal according to the first type flag bit and the second type flag bit of the data operation signal, obtains necessary data from the memory according to the operation corresponding to the determined type, and then accesses the data. It's about the specific process of improving speed. As shown in Fig. 2, the method includes the following steps.

S2101: 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신한다. 상기 데이터 동작 신호는 동작 도메인 및 동작 코드를 포함하고, 상기 동작 코드는 제1 유형 플래그 비트를 포함하고, 상기 동작 도메인은 제2 유형 플래그 비트를 포함하며, 상기 제1 유형 플래그 비트는 상기 데이터 동작 신호가 I/O 명령인지 여부를 나타내며, 상기 제2 유형 플래그 비트는 상기 데이터 동작 신호가 상기 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령인지 여부를 나타낸다.S2101: Receive a data operation signal sent from an internal or external device. The data operation signal includes an operation domain and an operation code, the operation code includes a first type flag bit, the operation domain includes a second type flag bit, and the first type flag bit comprises the data operation signal. indicates whether the signal is an I/O command, and the second type flag bit indicates whether the data operation signal is a broadcast or multicast command among the I/O commands.

본 실시예에서, 전송 회로는 상기 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신한다. 상기 데이터 동작 신호는 데이터 동작 신호의 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 가진다. 여기서, 상기 내부 또는 외부 장치는 인터페이스를 통해 전송 회로에 연결된 기계 학습 장치일 수 있으며, 상기 기계 학습 장치는 임의의 하드웨어 형식일 수 있는 바, 예를 들어 복수의 MLU으로 이루어진 연산 기능을 갖는 장치이다. 여기서, 전송 회로는 상기 데이터 동작 신호가 갖는 데이터 동작 신호의 제1 유형 플래그 비트에 따라 상기 데이터 동작 신호의 값이 I/O 명령인지 여부를 확정하고, 제2 유형 플래그 비트는 상기 데이터 동작 신호가 I/O 명령 내의 구체적인 유형인지 여부를 확정할 수 있다. 예를 들어, 상기 데이터 동작 신호의 제1 유형 플래그 비트의 값이 I/O 명령이고 제2 유형 플래그 비트의 값이 1이면, 상기 데이터 동작 신호가 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령이다.In this embodiment, the transmission circuit receives the data operation signal transmitted from the internal or external device. The data operation signal has a first type flag bit and a second type flag bit of the data operation signal. Here, the internal or external device may be a machine learning device connected to a transmission circuit through an interface, and the machine learning device may be any type of hardware, for example, a device having an arithmetic function composed of a plurality of MLUs. . Here, the transmission circuit determines whether the value of the data operation signal is an I/O command according to the first type flag bit of the data operation signal of the data operation signal, and the second type flag bit determines whether the data operation signal is Whether it is a specific type within an I/O command can be determined. For example, if the value of the first type flag bit of the data operation signal is an I/O command and the value of the second type flag bit is 1, the data operation signal is a broadcast or multicast command among I/O commands. .

S2102: 상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는다.S2102: Obtain necessary input data by performing a corresponding operation on operation schedule data in the memory according to the data operation signal.

상기 단계S2101에서 전송 회로가 상기 내부 또는 외부 장치로부터 송신된 데이터 동작 신호를 수신함에 따라, 전송 회로는 상기 데이터 동작 신호의 유형 플래그 비트에 의해 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터, 예를 들어 뉴런 데이터 및 가중치를 얻는다. 여기서, 뉴런 데이터 및 가중치는 내부 또는 외부 장치에 필요한 데이터이다. 예를 들어 상기 내부 또는 외부 장치가 기계 학습 장치일 때, 상기 뉴런 데이터 및 가중치는 기계 학습 장치가 기계 학습 연산시 입력해야 하는 데이터이다. 이상 데이터는 메모리 내에 미리 저장된 데이터일 수 있으며, 기계 학습 장치가 기계 학습 연산을 진행한 후 출력하는 데이터일 수도 있다. 본 실시예는 이에 대해 한정하지 않는다.As the transmission circuit receives the data operation signal transmitted from the internal or external device in the step S2101, the transmission circuit performs a corresponding operation on the operation scheduled data in the memory according to the type flag bit of the data operation signal, so as to be required. Get input data, e.g. neuron data and weights. Here, neuron data and weights are data required for an internal or external device. For example, when the internal or external device is a machine learning device, the neuron data and weights are data that the machine learning device needs to input during a machine learning operation. The abnormal data may be data previously stored in a memory, or may be data output after the machine learning device performs a machine learning operation. This embodiment is not limited in this regard.

본 실시예에 따른 데이터 처리 방법에 있어서, 전송 회로는 내부 또는 외부 장치로부터 송신되고 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 갖는 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는다. 본 실시예에서, 데이터 동작 신호가 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 가지므로, 전송 회로는 상기 데이터 동작 신호를 수신한 후 그 중의 데이터 동작 신호의 제1 유형 플래그 비트 및 제2 유형 플래그 비트에 따라 상기 데이터 동작 신호의 구체적인 유형을 판단하고 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행한다. 이에 따라, 데이터 동작 신호의 유형 플래그 비트에 따라 우선 분류하여 대응하는 동작에 신속하게 매핑하므로 데이터 액세스하는 로직을 간소화하고 데이터의 액세스 효율을 높이므로 기계 학습 칩이 데이터를 액세스하는 액세스 속도를 크게 향상시켰다.In the data processing method according to the present embodiment, the transmission circuit performs a corresponding operation on operation scheduled data in the memory according to a data operation signal transmitted from an internal or external device and having a first type flag bit and a second type flag bit. to get the required input data. In this embodiment, since the data operation signal has the first type flag bit and the second type flag bit, the transmission circuit receives the data operation signal and then the first type flag bit and the second type flag bit of the data operation signal therein. A specific type of the data operation signal is determined according to the flag bit, and a corresponding operation is performed on operation scheduled data in the memory. Accordingly, the data operation signal is first classified according to the type flag bit and quickly mapped to the corresponding operation, which simplifies data access logic and improves data access efficiency, greatly improving the access speed of the machine learning chip accessing data. made it

아래 몇 개 실시예를 통해 상기 동작 코드 및 동작 도메인, 그리고 이들과 데이터 동작 신호의 유형 플래그 비트, 동작 예정 데이터의 정보 및 데이터 수신 플래그 비트 사이의 관계를 각각 설명한다.The action codes and action domains and the relationship between them and the type flag bit of the data operation signal, the information of the operation scheduled data, and the data reception flag bit will be described through several examples below.

일 실시예에서, 상기 동작 도메인은 데이터 수신 플래그 비트를 더 포함하고, 상기 데이터 수신 플래그 비트는 상기 입력 데이터를 수신하는 장치 또는 처리회로를 나타낸다. 대안적으로, 상기 데이터 수신 플래그 비트의 개수는 상기 메모리와 상호 작용가능한 장치의 개수 또는 처리회로의 개수를 나타낸다. 대안적으로, 상기 제1 유형 플래그 비트의 값이 I/O이면, 상기 데이터 동작 신호를 I/O 명령으로 확정하고, 상기 제2 유형 플래그 비트의 값이 1이면, 상기 데이터 동작 신호를 상기 I/O 명령 내의 브로드캐스트 또는 멀티캐스트 명령으로 확정한다.In one embodiment, the operation domain further includes a data reception flag bit, wherein the data reception flag bit indicates a device or processing circuit receiving the input data. Alternatively, the number of data reception flag bits indicates the number of devices or processing circuits capable of interacting with the memory. Alternatively, if the value of the first type flag bit is I/O, the data operation signal is determined as an I/O command, and if the value of the second type flag bit is 1, the data operation signal is determined as the I/O command. It is determined by the broadcast or multicast command in the /O command.

본 실시예에서, 데이터 동작 신호의 동작 코드는 상기 데이터 동작 신호의 동작 유형을 가리키고, 상기 데이터 동작 신호의 제1 유형 플래그 비트를 포함하고, 상기 동작 도메인은 상기 데이터 동작 신호가 수행과정에 필요한 데이터 정보를 저장하며, 제2 유형 플래그 비트를 포함한다. 예시적으로, 동작 코드 내의 데이터 동작 신호의 제1 유형 플래그 비트의 값이 I/O이면, 상기 데이터 동작 신호가 I/O 명령임을 나타내고, 동작 도메인 내의 제2 유형 플래그 비트의 값이 1이면, 상기 데이터 동작 신호가 I/O 명령 중 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 유의할 것은, 본 실시예에 설명된 제2 유형 플래그 비트가 1일 때, 상기 데이터 동작 신호를 I/O 명령 중 브로드캐스트 또는 멀티캐스트 명령으로 확정한다. 이는 다만 일 실시형태에 불과하며 사용자의 실제 수요에 따라 제2 유형 플래그 비트가 0 또는 기타 식별자일 때, 상기 데이터 동작 신호를 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령으로 확정할 수도 있다. 본 실시예는 이에 대해 한정하지 않는다. 여기서, 상기 데이터 수신 플래그 비트는 내부 또는 외부 장치 중 입력 데이터(예를 들어, 입력 뉴런 데이터 및 가중치)를 수신할 수 있는 장치 또는 처리회로를 나타낸다. 여기서, 상기 장치는 기계 학습 장치 또는 MLU일 수 있으며, 처리회로는 연산 유닛 또는 연산 유닛의 마스트 처리회로 또는 슬레이브 처리회로일 수 있으며, 본 실시예는 이에 대해 한정하지 않는다. 여기서, 데이터 수신 플래그 비트의 개수는 상기 메모리와 상호 작용가능한 장치의 개수 또는 처리회로의 개수를 나타낸다. 예시적으로, 상기 동작 도메인 내의 데이터 수신 플래그 비트에 3개 MLU(기계 학습 유닛)의 플래그 비트가 1이면, 상기 3개 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그 비트가 0이면, 상기 하나의 MLU는 데이터를 수신할 수 없음을 나타낸다. 유의할 것은, 데이터 수신할 수 있는 MLU를 1로 표기하는 것은 단지 일 예시에 속할 뿐, 사용자는 실제 수요에 따라 데이터 수신 가능한 MLU를 0 또는 다른 식별자로 표기할 수 있다. 본 실시예는 이에 대해 한정하지 않는다.In this embodiment, the operation code of the data operation signal indicates the operation type of the data operation signal and includes a first type flag bit of the data operation signal, and the operation domain is the data necessary for the data operation signal to be performed. It stores information and includes a second type flag bit. Exemplarily, if the value of the first type flag bit of the data operation signal in the operation code is I/O, it indicates that the data operation signal is an I/O command, and if the value of the second type flag bit in the operation domain is 1, Indicates that the data operation signal is a broadcast or multicast command among I/O commands. Note that, when the second type flag bit described in this embodiment is 1, the data operation signal is determined as a broadcast or multicast command among I/O commands. This is just one embodiment, and the data operation signal may be determined as a broadcast or multicast command among I/O commands when the second type flag bit is 0 or other identifiers according to user's actual needs. This embodiment is not limited in this regard. Here, the data reception flag bit indicates a device or processing circuit capable of receiving input data (eg, input neuron data and weights) among internal or external devices. Here, the device may be a machine learning device or an MLU, and the processing circuit may be an arithmetic unit or a master processing circuit or a slave processing circuit of the arithmetic unit, but the present embodiment is not limited thereto. Here, the number of data reception flag bits represents the number of devices or processing circuits capable of interacting with the memory. Exemplarily, if the flag bits of three MLUs (Machine Learning Units) in the data reception flag bits in the operation domain are 1, it indicates that the three MLUs can receive data, and if one MLU flag bit is 0, , indicates that the one MLU cannot receive data. It is to be noted that marking an MLU capable of receiving data as 1 is only an example, and a user may mark an MLU capable of receiving data as 0 or another identifier according to actual needs. This embodiment is not limited in this regard.

본 실시예에서, 전송 회로는 데이터 신호의 제1 유형 플래그 비트 및 제2 유형 플래그 비트에 따라, 상기 데이터 동작 신호의 구체적인 유형을 확정한 후 대응하는 동작에 매핑하고 데이터 수신 플래그 비트에 따라 동작 수행한 후의 데이터를 송신하는 목표 장치를 확정함으로써 데이터 액세스 로직을 간소화하고 데이터의 액세스 효율을 높이므로 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In this embodiment, the transmission circuit determines the specific type of the data operation signal according to the first type flag bit and the second type flag bit of the data signal, maps it to a corresponding operation, and performs an operation according to the data reception flag bit. By determining the target device to transmit the data after processing, the data access logic is simplified and the data access efficiency is improved, so the machine learning chip greatly improves the access speed when accessing data.

다른 일 실시예에서, 상기 동작 도메인은 동작 예정 데이터의 정보를 더 포함하고, 상기 동작 예정 데이터의 정보는 상기 동작 예정 데이터가 상기 메모리 내의 소스 주소, 동작 예정 데이터 길이 및 데이터를 동작시킨 후의 데이터 리턴 주소를 포함한다. 도 33에 도시된 바와 같이, 데이터 처리 방법이 제공되며, 본 실시예는, 데이터 동작 신호가 가진 데이터 정보에 따라 전송 회로가 메모리로부터 데이터를 판독하고 다시 상기 데이터 동작 정보에 따라 판독된 데이터를 장치 또는 처리회로로 리턴하는 구체적인 과정에 관한 것이다. 상기 S2102는 다음 단계를 포함한다.In another embodiment, the action domain further includes information on action schedule data, wherein the action schedule data information includes a source address of the action schedule data in the memory, an action schedule data length, and data return after operating the data. Include your address. As shown in FIG. 33, a data processing method is provided, and in this embodiment, a transmission circuit reads data from a memory according to data information of a data operation signal, and the read data is stored again according to the data operation information. Or it relates to a specific process of returning to the processing circuit. The above S2102 includes the following steps.

S2201: 상기 소스 주소로부터 상기 메모리를 판독하기 시작하여 상기 데이터 길이를 만족하는 입력 데이터를 획득한다.S2201: Start reading the memory from the source address to obtain input data satisfying the data length.

본 실시예에서, 데이터 동작 신호의 동작 예정 데이터의 정보에 동작 예정 데이터가 메모리 내의 소스 주소, 동작 예정 데이터 길이, 및 데이터를 동작시킨 후의 데이터 리턴 주소를 포함하므로 전송 회로는 상기 메모리 내의 소스 주소로부터 데이터를 판독하기 시작하고, 사전에 설정된 규칙에 따라 동작 예정의 데이터 길이를 만족하는 만큼 판독한다. 여기서 상기 동작 예정 데이터 길이는 사용자가 실제 상황에 따라 자체로 설정한 것이며, 본 실시예는 이에 대해 한정하지 않는다. 여기서, 전송 회로가 상기 데이터 길이를 만족하는 입력 데이터 및 데이터를 획득하는 것은, 사전에 설정된 규칙에 따라 메모리로부터 상기 데이터 길이를 만족하는 데이터를 판독하는 것이다. 여기서, 상기 사전에 설정된 규칙 또한 사용자가 실제 상황에 따라 정한 규칙이다. 본 실시예는 이에 대해 한정하지 않으나, 예를 들어, 소스 주소로부터 하나씩 판독하여 판독된 데이터 길이가 해당 데이터 길이를 만족할 때까지 판독한다.In this embodiment, since the operation schedule data includes the source address in the memory, the operation schedule data length, and the data return address after operating the data in the information of the operation schedule data of the data operation signal, the transmission circuit can obtain the source address in the memory. Data is started to be read, and an amount of data that satisfies the data length scheduled for operation is read according to a rule set in advance. Here, the operation schedule data length is set by the user according to the actual situation, and the present embodiment is not limited thereto. Here, when the transfer circuit obtains input data and data satisfying the data length, data satisfying the data length is read from the memory according to a preset rule. Here, the preset rules are also rules set by the user according to actual situations. Although this embodiment is not limited in this respect, for example, reading is performed one by one from the source address until the read data length satisfies the corresponding data length.

S2202: 상기 데이터 수신 플래그 비트에 따라 입력 데이터를 수신하는 장치 또는 처리회로를 확정한다.S2202: According to the data reception flag bit, a device or processing circuit for receiving input data is determined.

상기 S2201단계에 의해 전송 회로는 데이터 길이를 만족하는 입력 데이터를 획득하고, 전송 회로는 데이터 신호 내의 데이터 수신 플래그 비트에 따라 데이터를 리턴하는 장치 또는 처리회로를 수신한다. 예를 들어, 상기 장치가 기계 학습 장치인 경우, 전송 회로는 데이터 수신 플래그 비트에 따라 데이터가 상기 기계 학습 장치 내의 하나 또는 복수의 목표 기계 학습 유닛으로 리턴하는 것을 확정한다.In step S2201, the transmission circuit acquires input data that satisfies the data length, and the transmission circuit receives a device or processing circuit that returns data according to the data reception flag bit in the data signal. For example, when the device is a machine learning device, the transmitting circuit determines that data is returned to one or more target machine learning units in the machine learning device according to a data reception flag bit.

S2203: 상기 데이터 리턴 주소에 따라, 상기 입력 데이터를 상기 장치 또는 처리회로 내의 상기 데이터 리턴 주소에 대응하는 저장 공간으로 리턴한다.S2203: According to the data return address, return the input data to the storage space corresponding to the data return address in the device or processing circuit.

본 단계에서, 위 단계에서 확정된 데이터가 리턴하는 장치 또는 처리회로에 의해, 전송 회로가 데이터 동작 신호의 동작 예정 데이터의 정보 중 데이터 리턴 주소에 따라 상기 입력 데이터를 장치 또는 처리회로 중 상기 데이터 리턴 주소에 대응하는 저장 공간으로 리턴한다. 여기서, 상기 동작 예정 데이터의 정보 내의 데이터 리턴 주소는 기계 학습 장치의 복수의 목표 기계 학습 유닛 내의 주소일 수 있다.In this step, by the device or processing circuit returned by the data confirmed in the above step, the transmitting circuit transmits the input data according to the data return address of the operation schedule data information of the data operation signal, and the device or processing circuit returns the data. Returns to the storage space corresponding to the address. Here, the data return address in the information of the operation schedule data may be an address in a plurality of target machine learning units of the machine learning device.

예시적으로, 표 3에 표시된 바와 같이, 상기 실시예에 기초하여 본 실시예는 아래와 같은 예를 들수 있다. 동작 코드 내의 제1 유형 데이터 플래그 비트의 값이 I/O인 경우 상기 데이터 동작 신호가 I/O 명령임을 나타내고, 동작 도메인 내의 제2 유형 데이터 플래그 비트의 값이 1인 경우 상기 데이터 동작 신호가 I/O 명령 내의 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 따라서 상기 제2 유형 데이터 플래그 비트의 값이 0인 경우, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령이 아님을 나타낸다. 동작 도메인 내의 동작 예정의 데이터 정보는 소스 주소0x110011, 목표 주소0x000100 및 데이터 길이0x0100를 포함한다. 상기 데이터 길이는 사용자가 스스로 설정한 길이이며, 사용자는 상기 설정 길이를 하나의 값으로 설정할 수 있고 상기 설정 길이를 복수의 값으로 설정할 수도 있다. 본 실시예는 이에 대해 한정하지 않는다. 동작 도메인 내의 데이터 수신 플래그 비트에 3개 MLU 플래그가 1인 경우, 상기 3개 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그가 0인 경우, 상기 하나의 MLU가 데이터를 수신할 수 없음을 나타낸다. 구체적으로, 전송 회로는 상기 데이터 동작 신호에 따라 공유 메모리 내의 주소0x110011로부터 0x0100 길이만큼의 데이터를 판독한 후 기계 학습 장치의 MLU3, MLU1 및 MLU0의 주소 0x000100에 각각 기록한다.Illustratively, as shown in Table 3, based on the above embodiments, the present embodiment may include the following examples. If the value of the first type data flag bit in the operation code is I/O, the data operation signal indicates an I/O command, and if the value of the second type data flag bit in the operation domain is 1, the data operation signal is I/O. /O Indicates a broadcast or multicast command within a command. Accordingly, when the value of the second type data flag bit is 0, it indicates that the data operation signal is not a broadcast or multicast command. Data information of an operation schedule in an operation domain includes a source address of 0x110011, a target address of 0x000100, and a data length of 0x0100. The data length is a length set by the user himself, and the user may set the set length as one value or may set the set length as a plurality of values. This embodiment is not limited in this regard. If the three MLU flags in the data reception flag bits in the operation domain are 1, it indicates that the three MLUs can receive data, and if one MLU flag is 0, the one MLU can receive data. indicates no Specifically, the transmission circuit reads data as long as 0x0100 from address 0x110011 in the shared memory according to the data operation signal, and then writes the data to addresses 0x000100 of MLU3, MLU1, and MLU0 of the machine learning device, respectively.

표 3Table 3

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 데이터 동작 신호에 따라 소스 주소로부터 메모리 판독을 시작하여 데이터 길이를 만족하는 입력 데이터를 획득하고, 데이터 수신 플래그 비트에 따라 입력 데이터를 수신하는 장치 또는 처리회로를 확정한 다음, 데이터 리턴 주소에 따라 입력 데이터를 장치 또는 처리회로 내의 데이터 리턴 주소에 대응하는 저장 공간으로 리턴한다. 본 실시예에서, 전송 회로가 상기 데이터 길이를 만족하는 입력 뉴런 데이터와 가중치를 획득할 때 상기 데이터 동작 신호 중 데이터 동작 정보가 가리키는 판독 규칙에 따라 데이터를 판독하므로 전송 회로의 데이터 판독 로직을 간소화하고 데이터의 액세스 효율을 높이므로, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In the data processing method according to the present embodiment, the transmission circuit starts memory reading from a source address according to a data operation signal, obtains input data that satisfies the data length, and receives the input data according to a data reception flag bit; or After determining the processing circuit, according to the data return address, the input data is returned to the storage space corresponding to the data return address in the device or processing circuit. In this embodiment, when the transmission circuit acquires the input neuron data and weights that satisfy the data length, data is read according to the reading rule indicated by the data operation information among the data operation signals, thereby simplifying the data reading logic of the transmission circuit. In order to increase the efficiency of data access, the machine learning chip greatly improved the access speed when accessing data.

대안적으로, 상기 도 33에 도시된 실시예에서, 상기 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 각각의 기계 학습 유닛은 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함한다. 여기서, 상기 기계 학습 장치에 포함된 적어도 하나 이상의 기계 학습 유닛(즉, MLU)이 수행한 데이터 신호 동작은 하나의 데이터 수신 인터페이스를 공유할 수 있으며 상기 기계 학습 유닛은 송신 인터페이스 또는 공유 데이터 수신 인터페이스를 통해 전송 회로에 연결될 수 있다. 유의할 것은, 상기 송신 인터페이스 및 공유 데이터 수신 인터페이스는 모두 하드웨어 회로 방식으로 구현될 수 있다. 본 실시예는 상기 송신 인터페이스와 공유 데이터 수신 인터페이스의 유형에 대해 한정하지 않는다. 여기서, 각각의 기계 학습 유닛은 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함하고, 마스트 처리회로는 입력 데이터(뉴런 데이터 및 가중치)를 복수의 슬레이브 처리회로로 배분하는데 사용되며; 복수의 슬레이브 처리회로는 마스트 처리회로로부터 전송된 입력 데이터에 의해 중간 연산을 병행 수행하여 복수의 중간 결과를 얻은 후 상기 복수의 중간 결과를 마스트 처리회로로 전송하는데 사용된다. 이렇게 되면, 장치는, 기계 학습 유닛마다 각자 그 안에 있는 뉴런을 처리하도록 배분할 수 있고, 상응한 출력 데이터를 대응하여 출력함으로써 한층 또 한층의 신경망 병행 계산을 진행하여 신경망 계산의 병행 처리를 구현할 수 있고 처리 효율을 향상시킬 수 있다.Alternatively, in the embodiment shown in Fig. 33, the apparatus includes at least one machine learning unit, each machine learning unit including a master processing circuit and a plurality of slave processing circuits. Here, the data signal operation performed by at least one machine learning unit (ie, MLU) included in the machine learning apparatus may share one data reception interface, and the machine learning unit may use a transmission interface or a shared data reception interface. can be connected to the transmission circuit through It should be noted that both the transmission interface and the shared data reception interface may be implemented in a hardware circuit manner. This embodiment is not limited to the types of the transmission interface and the shared data reception interface. Here, each machine learning unit includes a master processing circuit and a plurality of slave processing circuits, and the master processing circuit is used to distribute input data (neuron data and weights) to the plurality of slave processing circuits; The plurality of slave processing circuits perform intermediate operations in parallel with input data transmitted from the master processing circuit to obtain a plurality of intermediate results, and then transmit the plurality of intermediate results to the master processing circuit. In this way, the device can allocate each machine learning unit to process the neurons therein, and correspondingly output the corresponding output data, thereby performing layer-by-layer parallel calculation of the neural network, realizing parallel processing of the calculation of the neural network; Processing efficiency can be improved.

상기 실시예에 기초하여 상기 동작 도메인은 점프 서브 동작 도메인을 더 포함하고, 상기 점프 서브 동작 도메인은 점프 폭 및 매번 점프한 후 동작하는 점프 데이터 길이를 포함한다. 도 34에 도시된 바와 같이, 데이터 처리 방법이 제공되며, 본 실시예는 전송 회로가 동작 도메인 내의 점프 서브 동작 도메인에 따라 메모리 내의 데이터를 판독하는 구체적인 과정에 관한 것이다. 상기 S2201은 다음 단계를 포함한다.Based on the above embodiment, the action domain further includes a jump sub action domain, and the jump sub action domain includes a jump width and a jump data length operated after each jump. As shown in Fig. 34, a data processing method is provided, and this embodiment relates to a specific process of reading data in a memory according to a jump sub-operation domain in an operation domain by a transmission circuit. The above S2201 includes the following steps.

S2301: 상기 소스 주소로부터 상기 메모리를 판독하기 시작하여 금번 점프한 후의 점프 데이터 길이에 따라 제1 점프 데이터를 획득한다.S2301: Start reading the memory from the source address and obtain first jump data according to the jump data length after the current jump.

본 실시예에서, 데이터 동작 신호의 동작 도메인은 점프 서브 동작 도메인을 포함하고, 상기 점프 서브 동작 도메인은 상기 데이터 동작 신호에 따라 동작 예정 데이터 정보를 판독할 때 상기 서브 동작 도메인의 규칙에 따라 판독하도록 상기 전송 회로를 지시하는데 사용된다. 대안적으로, 상기 점프 서브 동작 도메인은 stride동작 도메인 및/또는 segment동작 도메인을 포함하고, 상기 stride동작 도메인은 상기 데이터 동작 신호가 매번 점프한 폭을 나타내며, 상기 segment동작 도메인은 사전에 설정된 상기 데이터 동작 신호가 매번 분할한 크기를 나타낸다. 유의할 것은, 상기 stride동작 도메인 및 segment동작 도메인이 본 출원의 실시예에서의 길이와 명칭은 단지 일 예시일 뿐 본 출원의 실시예는 이에 한정되지 않는다. 여기서, 상기 점프 서브 동작 도메인은 점프 폭 및 매번 점프한 후 동작하는 점프 데이터 길이를 포함하며 상기 점프 데이터 길이는 사전에 설정된 데이터 길이일 수 있다. 구체적으로, 전송 회로는 동작 예정 데이터 정보 내의 소스 주소로부터 메모리를 판독하기 시작하여 금번 점프한 후 판독된 점프 데이터 길이의 데이터를 제1 점프 데이터로 확정하고, 상기 제1 점프 데이터는 전송 회로가 데이터를 판독할 때 사전에 설정된 길이 데이터만큼 점프한 후 획득한 데이터를 나타내며, 상기 사전에 설정된 길이는 사용자가 실제 상황에 따라 스스로 설정한 것이다. 본 실시예는 이에 대해 한정하지 않는다.In this embodiment, the action domain of the data operation signal includes a jump sub-action domain, and the jump sub-action domain reads operation schedule data information according to the rule of the sub-action domain according to the data operation signal. Used to indicate the transmission circuit. Alternatively, the jump sub operation domain includes a stride operation domain and/or a segment operation domain, the stride operation domain indicates a jump width of the data operation signal each time, and the segment operation domain includes a previously set data operation domain. The operating signal indicates the size of each division. Note that the lengths and names of the stride operation domain and the segment operation domain in the embodiments of the present application are only examples, and the embodiments of the present application are not limited thereto. Here, the jump sub-action domain includes a jump width and a jump data length operated after each jump, and the jump data length may be a preset data length. Specifically, the transmission circuit starts reading the memory from the source address in the operation schedule data information, jumps this time, and determines the data of the read jump data length as the first jump data, and the first jump data is the data indicates data obtained after jumping by a preset length data when reading , and the preset length is set by the user himself according to actual situations. This embodiment is not limited in this respect.

S2302: 상기 점프 데이터의 마지막 주소를 획득하고 상기 점프 폭에 따라 상기 마지막 주소에서 목표 점프 주소로 점프한다.S2302: Obtain the last address of the jump data and jump from the last address to the target jump address according to the jump width.

상기 S2301단계에서 판독된 제1 점프 데이터에 의해, 전송 회로는 상기 제1 점프 데이터의 마지막 주소를 획득하고, 점프 서브 동작 도메인 내의 점프 폭(예를 들어, stride 폭)에 따라 상기 제1 점프 데이터의 마지막 주소로부터 상기 점프 폭의 길이만큼 목표 점프 주소로 점프한다. 상기 제1 점프 데이터의 마지막 주소와 목표 점프 주소 사이의 길이가 점프 서브 동작 도메인 내의 점프 폭이라는 것을 이해해야 한다.According to the first jump data read in step S2301, the transmission circuit obtains the last address of the first jump data, and the first jump data according to the jump width (eg, stride width) in the jump sub operation domain. Jumps from the last address of to the target jump address by the length of the jump width. It should be understood that the length between the last address of the first jump data and the target jump address is the jump width in the jump sub operation domain.

S2303: 매번 점프한 후 얻은 점프 데이터의 길이가 상기 데이터 길이를 만족할 때까지 상기 목표 점프 주소로부터, 점프한 후의 점프 데이터 길이에 따라 제2 점프 데이터를 획득한다.S2303: Obtain second jump data from the target jump address according to the length of jump data after jumping until the length of jump data obtained after every jump satisfies the data length.

본 단계에서, 전송 회로는 데이터를 판독할 때 상기 S2302단계에 확정된 목표 점프 주소로부터 사전에 설정된 길이의 데이터를 점프한 후 상기 사전에 설정된 길이만큼 점프한 데이터를 제2 점프 데이터로 확정한다. 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하면, 상기 기계 학습 장치에 필요한 데이터가 판독 완료됨을 나타내며, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하지 않으면, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족할 때까지, 즉 상기 기계 학습 장치에 필요한 데이터 판독이 완료될 때 까지, 상기 제2 점프 데이터의 마지막 주소로부터 상기 S2301 내지 S2303단계의 점프 순서에 따라 점프하여 해당 데이터를 판독한다.In this step, when reading data, the transmission circuit jumps data of a preset length from the target jump address determined in step S2302 and then determines the jumped data by the preset length as second jump data. If the length between the address of the second jump data and the source address of the start of the jump satisfies the data length of data required for the machine learning device, it indicates that the data necessary for the machine learning device is read, and the second jump data If the length between the address and the source address of the jump start does not satisfy the data length of the data required for the machine learning device, the length between the address of the second jump data and the source address of the jump start is the data required for the machine learning device. The corresponding data is read by jumping from the last address of the second jump data according to the jump sequence of steps S2301 to S2303 until the data length of is satisfied, that is, until reading of data necessary for the machine learning device is completed.

예시적으로, 표 4에 표시된 바와 같이, 본 실시예에서 전송 회로가 데이터를 판독하는 과정은 다음과 같다. 동작 도메인에 점프 서브 동작 도메인 stride동작 도메인을 더 포함하면, 전송 회로가 데이터 정보의 소스 주소0x110011로부터 공유 메모리 내의 데이터를 판독하되, 사전에 설정된 길이의 데이터(상기 사전에 설정된 길이는 아래 표의 데이터 정보 내의 데이터 길이0x0100보다 작다)를 우선 판독하고 그런 다음 stride길이(0x0008)의 주소를 점프하여 사전에 설정된 길이의 데이터를 다시 판독한다. 이러한 순서에 따라 상기 데이터를 계속하여 판독하되, 판독된 상기 데이터의 총 길이가 표 4의 데이터 정보의 데이터 길이 0x0100가 되면 상기 데이터 판독이 완료됨을 나타낸다. 동작 도메인에 점프 서브 동작 도메인의 segment동작 도메인을 더 포함하면, 전송 회로는 데이터 정보의 소스 주소0x110011로부터 공유 메모리 내의 데이터를 판독하기 시작하되, 우선 segment길이(0x0010)의 데이터를 판독한 다음 stride길이(0x0008)의 주소를 점프하고 다시 segment길이(0x0010)의 데이터를 판독한다. 이러한 순서에 따라 상기 데이터를 계속 판독하되, 판독된 상기 데이터의 총길이가 표 3의 데이터 정보의 데이터 길이0x0100가 되면 상기 데이터 판독이 완료됨을 나타낸다. 유의할 것은, 상기 점프 서브 동작 도메인에서 segment동작 도메인만 있고 stride동작 도메인이 없는 경우, 전송 회로는, 데이터를 판독할 때 소스 주소0x110011로부터 segment길이(0x0010)의 데이터를 판독하되, 판독된 상기 데이터의 총 길이가 아래 표 4의 데이터 정보의 데이터 길이0x0100가 되면, 상기 데이터 판독이 완료됨을 나타낸다.Illustratively, as shown in Table 4, the process of reading data by the transmission circuit in this embodiment is as follows. If the action domain further includes the jump sub action domain stride action domain, the transmission circuit reads data in the shared memory from the source address 0x110011 of the data information, and reads data of a preset length (the preset length is the data information in the table below). It first reads the data length within (less than 0x0100) and then jumps to the address of the stride length (0x0008) to read the data of the preset length again. The data is continuously read in this order, and when the total length of the read data reaches 0x0100 of the data information in Table 4, it indicates that the data reading is completed. If the operation domain further includes the segment operation domain of the jump sub operation domain, the transmission circuit starts reading the data in the shared memory from the source address 0x110011 of the data information, first reading the segment length (0x0010) data and then the stride length Jump to the address of (0x0008) and read data of segment length (0x0010) again. The data is continuously read in this order, and when the total length of the read data becomes 0x0100 of the data information in Table 3, it indicates that the data reading is completed. Note that in the jump sub operation domain, if there is only the segment operation domain and no stride operation domain, the transmission circuit reads data of the segment length (0x0010) from the source address 0x110011 when reading data, When the total length reaches 0x0100 of the data information in Table 4 below, it indicates that the reading of the data is completed.

표 4Table 4

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 소스 주소로부터 공유 메모리를 판독하기 시작하여 금번 점프한 후의 점프 데이터 길이에 따라 제1 점프 데이터를 획득하고, 상기 제1 점프 데이터의 마지막 주소로부터 점프 폭에 따라 목표 점프 주소로 점프한 다음, 매번 점프한 후 얻은 점프 데이터의 길이가 데이터 길이를 만족할 때까지 목표 점프 주소로부터 점프한 후의 점프 데이터 길이에 따라 제2 점프 데이터를 획득한다. 이에 따라, 동작 도메인에 점프 서브 동작 도메인이 포함될 때, 전송 회로가 서브 동작 도메인의 점프 규칙에 따라 데이터를 판독하므로 전송 회로의 데이터 판독 로직을 간소화하고, 데이터의 액세스 효율을 높여 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시킨다.In the data processing method according to the present embodiment, the transmission circuit starts reading the shared memory from the source address, obtains the first jump data according to the length of the jump data after jumping this time, and jumps from the last address of the first jump data. After jumping to the target jump address according to the width, second jump data is acquired according to the jump data length after jumping from the target jump address until the length of the jump data obtained after each jump satisfies the data length. Accordingly, when the action domain includes the jump sub action domain, the transmission circuit reads data according to the jump rule of the sub action domain, thereby simplifying the data reading logic of the transmission circuit and improving data access efficiency, so that the machine learning chip can The access speed at the time of access is greatly improved.

전송 회로가 수신된 데이터 동작 신호에 따라 동작할 때, 수신하기 시작한 데이터 동작 신호가 하나의 코딩된 명령(coded instruction)이며, 우선 상기 데이터 동작 신호에 대해 디코딩하고 분석해야 하므로, 본 출원의 실시예는 데이터 처리 방법을 제공한다. 도 35에 도시된 바와 같이, 상기 데이터 처리 장치 내의 전송 회로가 상기 데이터 처리 장치 내의 기계 학습 장치에서 송신된 데이터 동작 신호를 수신하는 단계는 다음 단계를 포함한다.When the transmission circuit operates according to the received data operation signal, since the data operation signal to be received is one coded instruction, and the data operation signal must be decoded and analyzed first, the embodiment of the present application provides a data processing method. As shown in Fig. 35, the step of receiving the data operation signal sent from the machine learning device in the data processing device by the transmission circuit in the data processing device includes the following steps.

S2401: 상기 데이터 동작 신호를 분석하여 상기 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻는다.S2401: The data operation signal is analyzed to obtain information on the type flag bit of the data operation signal and operation schedule data.

유의할 것은, 일반적으로 데이터 처리 과정에서 데이터 동작 신호의 수는 비교적 많으며, 전송 회로가 그 중 하나의 데이터 동작 신호를 처리할 때, 다른 신호는 저장되어 있어야 한다. 구체적으로, 전송 회로가 상기 데이터 동작 신호를 분석하는 데, 구체적으로 상기 데이터 동작 신호에 포함된 데이터 정보 및 상기 데이터 동작 신호의 유형 플래그 비트를 분석한다. 여기서, 상기 데이터 동작 정보는 동작 예정 데이터 길이, 목표 주소 및 최초 주소 등 정보를 포함할 수 있다. 본 실시예는 이에 대해 한정하지 않는다.It should be noted that, in general, the number of data operation signals is relatively large during data processing, and when the transmission circuit processes one data operation signal among them, other signals must be stored. Specifically, the transmission circuit analyzes the data operation signal, specifically analyzing the data information included in the data operation signal and the type flag bit of the data operation signal. Here, the data operation information may include information such as an operation schedule data length, a target address, and an initial address. This embodiment is not limited in this regard.

S2402: 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행하며 상기 명령 큐는 상기 데이터 동작 신호의 수행 순위를 나타낸다.S2402: The analyzed data operation signal is executed according to a command queue, and the command queue indicates an execution order of the data operation signal.

상기 데이터 동작 신호는 수행시 순서에 따라 순차적으로 진행되어야 하고 상기 S401단계에서 전송 회로가 상기 데이터 동작 신호를 분석하여 얻은 데이터 동작 정보 및 유형 플래그 비트에 기초하여, 전송 회로는, 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행한다.The data operation signal must be sequentially processed according to the order in which it is executed, and based on the data operation information and the type flag bit obtained by the transmission circuit analyzing the data operation signal in step S401, the transmission circuit, according to the command queue, Perform the analyzed data operation signal.

본 실시예에 따른 데이터 처리 방법은 전송 회로를 통해 상기 데이터 동작 신호를 분석하여 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻은 다음 전송 회로가 명령 큐에 의해 분석한 데이터 동작 신호를 수행한다. 이에 따라, 데이터 동작 신호를 수행하기 전에, 먼저 데이터 동작 신호를 분석하고, 그런 다음 순차적으로 수행하므로 전송 회로가 데이터 동작 신호에 따라 동작하는 속도를 크게 향상시킨다.The data processing method according to the present embodiment analyzes the data operation signal through a transmission circuit to obtain information on the type flag bit of the data operation signal and operation schedule data, and then the transmission circuit performs the analysis of the data operation signal by a command queue. do. Accordingly, before performing the data operation signal, the data operation signal is first analyzed and then sequentially performed, greatly improving the speed at which the transmission circuit operates according to the data operation signal.

전송 회로가 큐 내의 순서에 따라 데이터 동작 신호를 수행할 때, 수행 대상이 서로 관련있는 데이터 동작 신호인 점을 고려하여 본 출원의 실시예는 다른 일 실시예를 제공한다. 도 36에 도시된 바와 같이, 상기 전송 회로가 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행하기 전에, 상기 방법은 다음 단계를 더 포함한다.When the transmission circuit performs the data operation signal according to the order in the queue, the embodiment of the present application provides another embodiment in consideration of the fact that the execution target is the data operation signal related to each other. As shown in Fig. 36, before the transmission circuit performs the analyzed data operation signal according to the command queue, the method further includes the following step.

S2501: 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하여 판단 결과를 얻는다. 상기 의존 관계는 제s번째 데이터 동작 신호와 상기 제s번째 데이터 동작 신호 이전의 제s-1번째 데이터 동작 신호 사이에 연관 관계가 존재하는지 여부를 나타낸다.S2501: Determine the dependent relationship of the adjacent analyzed data operation signal to obtain a judgment result. The dependency relation indicates whether a correlation exists between the s-th data operation signal and the s-1-th data operation signal before the s-th data operation signal.

여기서, 전송 회로는 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단해야 하며, 판단 결과에 따라 처리하는 두 인접되어 있는 데이터 동작 신호 사이에 관련성이 있어야 함을 결정한다. 여기서, 제s번째 데이터 동작 신호는 데이터 동작 신호 중 임의의 하나의 신호를 나타낼 뿐 어느 특정한 신호를 표시하지 않는다. 제s-1번째 데이터 동작 신호는 제s번째 데이터 동작 신호 바로 이전의 신호를 나타낸다.Here, the transmission circuit needs to determine the dependency relationship between the analyzed data operation signals that are adjacent to each other, and determines that there must be a relationship between two adjacent data operation signals to be processed according to the determination result. Here, the s-th data operation signal represents an arbitrary one of the data operation signals and does not represent any specific signal. The s-1 th data operation signal represents a signal right before the s th data operation signal.

대안적으로, 상기 전송 회로가, 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하는 일 구현방식은 다음과 같다. 즉, 상기 제s번째 데이터 동작 신호에 의해 상기 제s번째 데이터 동작 신호에 필요한 데이터를 추출하는 제1 저장 주소 구간, 및 상기 제s-1번째 데이터 동작 신호에 의해 상기 제s-1번째 데이터 동작 신호에 필요한 데이터를 추출하는 제0 저장 주소 구간을 각각 획득하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 있으면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 있음을 확정하고. 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 없으면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 없다고 확정한다. 여기서, 전송 회로는 제s번째 데이터 동작 신호의 제1 주소 저장 구간과 제s-1번째 데이터 동작 신호의 제0 저장 주소 구간 사이의 관계에 따라 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하며, 그 판단 방식은 다음과 같다. 즉, 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 없으면, 상기 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호가 의존 관계가 없고, 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 있으면, 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호가 의존 관계가 있음을 나타낸다.Alternatively, an implementation manner in which the transmission circuit determines the dependency relationship of the analyzed data operation signal that is adjacent is as follows. That is, a first storage address period in which data necessary for the s-th data operation signal is extracted by the s-th data operation signal, and the s-1-th data operation by the s-1-th data operation signal Each of the 0th storage address intervals for extracting data necessary for the signal is acquired, and if there is an overlapping area between the 1st storage address interval and the 0th storage address interval, the s-th data operation signal and the s-1 th confirming that the th data operation signal has a dependency relationship; If there is no overlapping area between the first storage address period and the 0th storage address period, it is determined that the s-th data operation signal and the s-1-th data operation signal do not have a dependency relationship. Here, the transmission circuit determines the dependency relationship of the analyzed data operation signal adjacent to the relationship between the first address storage period of the s-th data operation signal and the 0th storage address period of the s-1-th data operation signal. judgment, and the judgment method is as follows. That is, if there is no overlapping area between the first storage address period and the 0th storage address period, the s-th data operation signal and the s-1-th data operation signal have no dependency relationship, and the first storage address period and the If there is an overlapping area in the 0 th storage address section, it indicates that the s th data operation signal and the s-1 th data operation signal have a dependency relationship.

S2502: 상기 판단 결과가 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호 사이에 의존 관계가 있으면, 상기 제s번째 데이터 동작 신호를 캐싱하고, 상기 제s-1번째 데이터 동작 신호를 수행한 후, 상기 제s번째 데이터 동작 신호를 추출한다.S2502: If the determination result is that there is a dependency relationship between the s-th data operation signal and the s-1-th data operation signal, the s-th data operation signal is cached, and the s-1-th data operation signal After performing, the s-th data operation signal is extracted.

상기 단계에서 전송 회로가 판단한 두 인접되어 있는 데이터 동작 신호의 의존 관계에 의해, 순서에 따라 데이터 동작 신호를 수행하고, 판단 결과가 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호 사이에 의존 관계가 있으면, 전송 회로는 우선 상기 제s번째 데이터 동작 신호를 캐싱하고, 제s-1번째 데이터 동작 신호에 대한 수행이 완료된 후, 상기 제s번째 데이터 동작 신호를 다시 추출한다.According to the dependency relationship of the two adjacent data operation signals determined by the transmission circuit in the above step, the data operation signals are performed in order, and the determination result is between the s-th data operation signal and the s-1-th data operation signal. If there is a dependency relationship, the transmission circuit first caches the s-th data operation signal, and extracts the s-th data operation signal again after processing on the s-1-th data operation signal is completed.

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 우선 두 인접되어 있는 데이터 동작 신호 사이의 관련성을 판단하여 데이터 동작 신호의 일관성을 보장한다. 이와 같이, 앞단의 질서있는 준비작업에 의해 후속 상기 데이터 동작 신호에 따라 상응한 동작을 수행하는 순서가 보장됨으로써 데이터의 액세스 효율을 높여 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시킨다.In the data processing method according to the present embodiment, the transmission circuit first determines the correlation between two adjacent data operation signals to ensure consistency of the data operation signals. In this way, the order in which the corresponding operation is performed according to the subsequent data operation signal is guaranteed by the orderly preparation work at the front stage, thereby increasing the data access efficiency and greatly improving the access speed when the machine learning chip accesses the data.

게다가, 전송 회로가 데이터 동작 신호에 따라 판독한 데이터가 기계 학습 장치에 맞는 격식이 아니므로 전송 회로가 판독된 데이터에 대해 일정한 처리를 진행한 후 기계 학습 장치로 다시 전송해야 하는 점을 고려할 때, 대안적으로, 상기 동작 도메인은, 판독된 데이터에 대해 진행하는 처리작업을 나타내는 기능 플래그 비트를 더 포함한다. 여기서, 데이터 동작 신호의 동작 도메인은 기능 플래그 비트를 포함하고, 상기 기능 플래그 비트는 전송 회로가 상기 기능 플래그 비트의 판독된 데이터에 대해 대응하는 처리를 진행하는 것을 나타낸다. 상기 동작 도메인에 포함된 기능 플래그 비트의 개수는 하나 또는 복수일 수 있다. 본 실시예는 이에 대해 한정하지 않는다. 예시적으로, 상기 기능 플래그 비트가 압축 해제 플래그 비트이며, 상기 플래그가 1이면, 데이터를 판독한 후, 전송 회로는 상기 데이터에 대해 압축 해제한 다음 기계 학습 장치 내에 지정된 MLU에 재전송한다. 혹은 상기 기능 플래그 비트가 암호화 플래그 비트이고 상기 암호화된 플래그 비트가 1이면, 데이터를 판독한 후, 전송 회로는 상기 데이터에 대해 압축 해제한 다음 기계 학습 장치 내에 지정된 MLU로 재전송한다. 본 실시예에서, 전송 회로가 우선 데이터 동작 신호의 동작 도메인 내의 기능 플래그 비트에 따라 판독된 데이터에 대해 대응하는 처리를 진행하고 상기 데이터를 기계 학습 장치로 송신하여 기계 학습 장치가 상기 데이터를 수신한 후 바로 식별하고 연산하도록 하므로 데이터의 처리 효율을 높여 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In addition, considering that the data read by the transmission circuit according to the data operation signal is not formal for the machine learning device, the transmission circuit must perform certain processing on the read data and transmit it back to the machine learning device, Alternatively, the action domain further includes a function flag bit indicating a processing operation to be performed on the read data. Here, the operation domain of the data operation signal includes a function flag bit, and the function flag bit indicates that the transmission circuit performs a corresponding process on the read data of the function flag bit. The number of function flag bits included in the operation domain may be one or plural. This embodiment is not limited in this regard. Exemplarily, the function flag bit is a decompression flag bit, and if the flag is 1, after reading data, the transmission circuit decompresses the data and retransmits it to the MLU specified in the machine learning device. or if the function flag bit is an encryption flag bit and the encrypted flag bit is 1, after reading the data, the transmission circuit decompresses the data and retransmits it to the MLU specified in the machine learning device. In this embodiment, the transmission circuit first performs corresponding processing on the read data according to the function flag bit in the operation domain of the data operation signal, and sends the data to the machine learning device, so that the machine learning device receives the data. Since it is identified and calculated immediately after processing, data processing efficiency is increased, and the machine learning chip greatly improves the access speed when accessing data.

일 실시예에서, 본 출원의 실시예는 데이터 처리 장치를 추가 제공한다. 상기 데이터 처리 장치는 프로세서 및 메모리를 포함하고, 상기 메모리에 컴퓨터 프로그램이 저장되어 있고, 상기 프로세서는 상기 컴퓨터 프로그램을 실행시,In one embodiment, embodiments of the present application further provide a data processing device. The data processing apparatus includes a processor and a memory, a computer program is stored in the memory, and the processor executes the computer program,

내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하는 단계; 및Receiving a data operation signal transmitted from an internal or external device; and

상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 원하는 입력 데이터를 얻는 단계;를 진행한다. 여기서 상기 데이터 동작 신호는 상기 데이터 동작 신호 브로드캐스트 또는 멀티캐스트 명령을 나타내는 유형 플래그 비트를 포함한다.A step of obtaining desired input data by performing a corresponding operation on operation schedule data in a memory according to the data operation signal. Here, the data operation signal includes a type flag bit indicating a data operation signal broadcast or multicast command.

본 실시예에 따른 데이터 처리 장치의 구현 원리 및 기술효과는 상기 데이터 처리 방법의 실시예와 유사하므로 여기서 중복 설명하지 않는다.Implementation principles and technical effects of the data processing apparatus according to the present embodiment are similar to those of the above data processing method, so a redundant description will not be made here.

일 실시예에서, 도 37은 데이터 처리 방법을 제공한다. 본 실시예는 전송 회로가 데이터 동작 신호의 유형 플래그 비트에 따라 데이터 동작 신호의 유형을 확정하고 확정된 유형에 대응하는 동작에 따라 메모리로부터 필요한 데이터를 획득함으로써 액세스 속도를 향상시키는 구체적인 과정에 관한 것이다. 도 37에 도시된 바와 같이, 상기 방법은 다음 단계를 포함한다.In one embodiment, Figure 37 provides a data processing method. The present embodiment relates to a specific process in which a transmission circuit determines the type of a data operation signal according to the type flag bit of the data operation signal and obtains necessary data from a memory according to an operation corresponding to the determined type, thereby improving access speed. . As shown in Fig. 37, the method includes the following steps.

S3101: 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신한다. 여기서 상기 데이터 동작 신호는 동작 코드를 포함하고, 상기 동작 코드는 상기 데이터 동작 신호 브로드캐스트 또는 멀티캐스트 명령을 나타내는 상기 유형 플래그 비트를 포함한다.S3101: Receive a data operation signal sent from an internal or external device. Here, the data operation signal includes an operation code, and the operation code includes the type flag bit indicating the data operation signal broadcast or multicast command.

본 실시예에서, 전송 회로는 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하고, 상기 데이터 동작 신호의 동작 코드는 상기 데이터 동작 신호의 동작 유형을 가리키는데 사용되되, 상기 데이터 동작 신호의 유형 플래그 비트를 포함한다. 여기서, 상기 내부 또는 외부 장치는 인터페이스를 통해 전송 회로에 연결된 기계 학습 장치일 수 있으며, 상기 기계 학습 장치는 예를 들어 복수의 MLU으로 이루어진 연산 기능을 가진 장치와 같은 임의의 하드웨어 형식일 수 있다. 여기서, 전송 회로는 상기 데이터 동작 신호에 포함된 데이터 동작 신호의 유형 플래그 비트에 따라 상기 데이터 동작 신호의 유형을 확정할 수 있다. 예를 들어, 상기 데이터 동작 신호의 유형 플래그 비트의 값이 1이면, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령이다.In this embodiment, the transmission circuit receives a data operation signal transmitted from an internal or external device, and an operation code of the data operation signal is used to indicate an operation type of the data operation signal, and a type flag of the data operation signal is used. contains bits Here, the internal or external device may be a machine learning device connected to a transmission circuit through an interface, and the machine learning device may be any hardware type, such as a device having an arithmetic function consisting of a plurality of MLUs. Here, the transmission circuit may determine the type of the data operation signal according to the type flag bit of the data operation signal included in the data operation signal. For example, if the value of the type flag bit of the data operation signal is 1, the data operation signal is a broadcast or multicast command.

S3102: 상기 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는다.S3102: Obtain necessary input data by performing a corresponding operation on operation schedule data in the memory according to the data operation signal.

상기 단계S3101에서 전송 회로가 내부 또는 외부 장치에서 송신된 데이터 동작 신호를 수신하는 것에 의해, 전송 회로는 상기 데이터 동작 신호의 유형 플래그 비트에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터, 예를 들어 뉴런 데이터 및 가중치를 얻는다. 여기서, 상기 뉴런 데이터 및 가중치는 내부 또는 외부 장치에 필요한 데이터이다. 예를 들어 상기 내부 또는 외부 장치가 기계 학습 장치일 때, 상기 뉴런 데이터 및 가중치는 기계 학습 장치가 기계 학습 연산시 입력해야 하는 데이터이다. 상기 데이터는 메모리에 미리 저장된 데이터일 수 있으며, 기계 학습 장치가 기계 학습 연산한 후 출력한 데이터일 수도 있다. 본 실시예는 이에 대해 한정하지 않는다.In the step S3101, when the transmission circuit receives the data operation signal transmitted from the internal or external device, the transmission circuit performs a corresponding operation on the operation scheduled data in the memory according to the type flag bit of the data operation signal to perform the necessary operation. Get input data, e.g. neuron data and weights. Here, the neuron data and weights are data required for an internal or external device. For example, when the internal or external device is a machine learning device, the neuron data and weights are data that the machine learning device needs to input during a machine learning operation. The data may be data previously stored in a memory, or may be data output after a machine learning operation is performed by the machine learning device. This embodiment is not limited in this regard.

본 실시예에 따른 데이터 처리 방법에 있어서, 전송 회로는 내부 또는 외부 장치로부터 송신되고 데이터 동작 신호의 유형 플래그 비트를 갖는 데이터 동작 신호에 따라 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행하여 필요한 입력 데이터를 얻는다. 본 실시예에서, 데이터 동작 신호에 데이터 동작 신호의 유형 플래그 비트가 있으므로, 전송 회로는 상기 데이터 동작 신호를 수신한 후 그 중의 데이터 동작 신호의 유형 플래그 비트에 따라 상기 데이터 동작 신호의 구체적인 유형을 판단한 후 메모리 내의 동작 예정 데이터에 대해 대응하는 동작을 수행한다. 이에 따라, 데이터 동작 신호의 유형 플래그 비트에 따라 우선 분류하여, 대응하는 동작에 신속하게 매핑시키므로 데이터의 액세스 로직을 간소화하고 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In the data processing method according to the present embodiment, the transmission circuit performs a corresponding operation on operation scheduled data in the memory according to a data operation signal transmitted from an internal or external device and having a type flag bit of the data operation signal, thereby inputting necessary inputs. get the data In this embodiment, since the data operation signal has the type flag bit of the data operation signal, the transmission circuit receives the data operation signal and determines the specific type of the data operation signal according to the type flag bit of the data operation signal therein. Then, a corresponding operation is performed on the operation schedule data in the memory. Accordingly, the data operation signal is first classified according to the type flag bit and quickly mapped to the corresponding operation, thereby simplifying the data access logic and increasing the efficiency of data access, so that the machine learning chip significantly increases the access speed when accessing data. improved

일 실시예에서, 상기 데이터 동작 신호는 동작 도메인을 더 포함하고, 상기 동작 도메인은 상기 입력 데이터를 수신한 장치 또는 처리회로를 나타내는 데이터 수신 플래그 비트를 포함한다. 대안적으로, 상기 데이터 수신 플래그 비트의 개수는 상기 메모리와 상호 작용가능한 장치의 개수 또는 처리회로의 개수를 나타낸다. 대안적으로, 상기 유형 플래그 비트의 값이 CAST이면, 상기 데이터 동작 신호를 브로드캐스트 또는 멀티캐스트 명령으로 확정한다.In one embodiment, the data operation signal further includes an operation domain, and the operation domain includes a data reception flag bit indicating a device or processing circuit that received the input data. Alternatively, the number of data reception flag bits indicates the number of devices or processing circuits capable of interacting with the memory. Alternatively, if the value of the type flag bit is CAST, the data operation signal is determined as a broadcast or multicast command.

본 실시예에서, 데이터 동작 신호의 동작 코드는 상기 데이터 동작 신호의 동작 유형을 가리키는데 사용되고 상기 데이터 동작 신호의 유형 플래그 비트를 포함한다. 예시적으로, 동작 코드 중 데이터 동작 신호의 유형 플래그 비트가 CAST이면, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 상기 동작 도메인은 상기 데이터 동작 신호가 수행 과정에 필요한 데이터 정보를 저장하고, 데이터 수신 플래그 비트를 포함할 수 있다. 상기 데이터 수신 플래그 비트는 내부 또는 외부 장치 중 입력 데이터를 수신할 수 있는 장치 또는 처리회로를 나타낸다. 여기서, 상기 장치는 기계 학습 장치 또는 MLU일 수 있으며, 처리회로는 연산 유닛 또는 연산 유닛의 마스트 처리회로 또는 슬레이브 처리회로일 수 있으며, 본 실시예는 이에 대해 한정하지 않는다. 여기서, 데이터 수신 플래그 비트의 개수는 상기 메모리와 상호 작용가능한 장치의 개수 또는 처리회로의 개수를 나타낸다. 예시적으로, 상기 동작 도메인 내의 데이터 수신 플래그 비트에 3개 MLU(기계 학습 유닛)의 플래그 비트가 1이면, 상기 3개 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그 비트가 0이면, 상기 하나의 MLU가 데이터를 수신할 수 없음을 나타낸다. 유의할 것은, 데이터 수신할 수 있는 MLU를 1로 표기하는 방식은 단지 일 예시일 뿐, 사용자는 실제 수요에 따라 데이터를 수신할 수 있는 MLU를 0 또는 다른 식별자로 표기할 수 있다. 본 실시예는 이에 대해 한정하지 않는다.In this embodiment, the operation code of the data operation signal is used to indicate the operation type of the data operation signal and includes the type flag bit of the data operation signal. Exemplarily, if the type flag bit of the data operation signal in the operation code is CAST, it indicates that the data operation signal is a broadcast or multicast command. The operation domain may store data information necessary for a process of performing the data operation signal and may include a data reception flag bit. The data reception flag bit indicates a device or processing circuit capable of receiving input data among internal or external devices. Here, the device may be a machine learning device or an MLU, and the processing circuit may be an arithmetic unit or a master processing circuit or a slave processing circuit of the arithmetic unit, but the present embodiment is not limited thereto. Here, the number of data reception flag bits represents the number of devices or processing circuits capable of interacting with the memory. Exemplarily, if the flag bits of three MLUs (Machine Learning Units) in the data reception flag bits in the operation domain are 1, it indicates that the three MLUs can receive data, and if one MLU flag bit is 0, , indicates that the one MLU cannot receive data. Note that the method of marking the MLU capable of receiving data as 1 is only an example, and the user may mark the MLU capable of receiving data as 0 or another identifier according to actual demand. This embodiment is not limited in this regard.

본 실시예에서, 전송 회로는, 데이터 신호의 유형 플래그 비트에 따라, 상기 데이터 동작 신호의 구체적인 유형을 확정한 후 대응하는 동작에 매핑하고, 데이터 수신 플래그 비트에 따라 동작 수행한 후의 데이터를 송신하는 목표 장치를 확정함으로써 데이터의 액세스 로직을 간소화하고 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In this embodiment, the transmission circuit determines the specific type of the data operation signal according to the type flag bit of the data signal, maps it to a corresponding operation, and transmits data after performing an operation according to the data reception flag bit. Determining the target device simplifies data access logic and improves data access efficiency, and the machine learning chip greatly improves the access speed when accessing data.

예시적으로, 표 5에 표시된 바와 같이, 상기 실시예에 기초하여 본 실시예는 다음 예를 들수 있다. 즉, 동작 코드 중 데이터 동작 신호의 유형 플래그 비트가 CAST이면, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 나타내고, 동작 도메인 중 동작 예정 데이터 정보에 소스 주소 0x110011, 목표 주소 0x000100 및 데이터 길이 0x0100가 포함된다. 여기서, 상기 데이터 길이는 사용자가 스스로 설정한 길이이며, 사용자는 상기 설정한 길이를 하나의 값으로 설정할 수 있으며 복수의 값으로도 설정할 수 있다. 본 실시예는 상기 설정 길이의 구체적인 값 및 개수에 대해 한정하지 않는다. 동작 도메인 내의 데이터 수신 플래그 비트에 3개 MLU의 플래그가 1인 경우 상기 3개의 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그가 0인 경우 상기 하나의 MLU가 데이터를 수신할 수 없음을 나타낸다. 구체적으로, 전송 회로는 상기 데이터 동작 신호에 따라 공유 메모리 내의 주소 0x110011로부터 0x0100 길의 데이터를 판독한 후 기계 학습 장치 내의 MLU3, MLU1 및 MLU0의 주소 0x000100에 각각 기록한다. Illustratively, as shown in Table 5, based on the above embodiments, this embodiment may take the following examples. That is, if the type flag bit of the data operation signal in the operation code is CAST, it indicates that the data operation signal is a broadcast or multicast command, and the source address 0x110011, the target address 0x000100, and the data length 0x0100 are included in the operation schedule data information of the operation domain. included Here, the data length is a length set by the user himself, and the user may set the set length as one value or as a plurality of values. This embodiment is not limited to the specific value and number of the setting length. If the flags of the three MLUs are 1 in the data reception flag bit in the operation domain, it indicates that the three MLUs can receive data, and if one MLU flag is 0, the one MLU cannot receive data. indicates Specifically, the transmission circuit reads data of length 0x0100 from address 0x110011 in the shared memory according to the data operation signal, and then writes the data to addresses 0x000100 of MLU3, MLU1, and MLU0 in the machine learning device, respectively.

표 5table 5

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 데이터 동작 신호에 따라 소스 주소로부터 메모리 판독을 시작하여 데이터 길이를 만족하는 입력 데이터를 획득하고, 데이터 수신 플래그 비트에 따라 입력 데이터를 수신하는 장치 또는 처리회로를 확정한 다음, 데이터 리턴 주소에 따라 입력 데이터를 장치 또는 처리회로 내의 데이터 리턴 주소에 대응하는 저장 공간으로 리턴한다. 본 실시예에서, 전송 회로가 상기 데이터 길이를 만족하는 입력 데이터를 획득할 때 상기 데이터 동작 신호 중 데이터 동작 정보가 가리키는 판독 규칙에 따라 데이터를 판독하므로 전송 회로의 데이터 판독 로직을 간소화하고 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다. In the data processing method according to the present embodiment, the transmission circuit starts memory reading from a source address according to a data operation signal, obtains input data that satisfies the data length, and receives the input data according to a data reception flag bit; or After determining the processing circuit, according to the data return address, the input data is returned to the storage space corresponding to the data return address in the device or processing circuit. In this embodiment, when the transmission circuit obtains input data that satisfies the data length, the data is read according to the reading rule indicated by the data operation information in the data operation signal, thereby simplifying the data reading logic of the transmission circuit and accessing data. By increasing efficiency, the machine learning chip greatly improved the access speed when accessing data.

일 실시예에서, 도 31에 도시된 바와 같이, 본 출원의 실시예에 따른 데이터 처리 장치는 도 31에 도시된 일부분 또는 전체일 수 있으며, 소프트웨어, 하드웨어 또는 이들을 결합하는 방식으로 구현될 수 있다. 상기 데이터 처리 장치(10)는 기계 학습 데이터를 처리하는데 사용되며, 상기 데이터 처리 장치(10)는 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)를 포함하고, 상기 기계 학습 장치(11)는 상기 전송 회로(12)에 연결되고, 상기 전송 회로(12)는 상기 공유 메모리(13)에 연결되며, 상기 전송 회로(12)는 상기 기계 학습 장치(11)에서 발송한 데이터 동작 신호에 따라 상기 공유 메모리(13)로부터 상기 기계 학습 장치(11)에 필요한 입력 데이터를 획득하고 상기 입력 데이터를 상기 기계 학습 장치(11)로 리턴한다. 상기 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 가진다. 대안적으로, 상기 기계 학습 장치(11)는 상기 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 뉴런 데이터를 얻는데 사용된다. 대안적으로, 상기 기계 학습 장치(11)는 또한 상기 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 삼아 상기 전송 회로(12)를 통해 상기 공유 메모리(13)에 전송하여 저장하는데 사용된다.In one embodiment, as shown in FIG. 31 , the data processing apparatus according to an embodiment of the present application may be partially or entirely shown in FIG. 31 , and may be implemented in software, hardware, or a combination thereof. The data processing device 10 is used to process machine learning data, the data processing device 10 includes a machine learning device 11, a transmission circuit 12 and a shared memory 13, The device 11 is connected to the transmission circuit 12, the transmission circuit 12 is connected to the shared memory 13, and the transmission circuit 12 transmits the data transmitted by the machine learning device 11. According to an operation signal, input data necessary for the machine learning device 11 is obtained from the shared memory 13 and the input data is returned to the machine learning device 11 . The data operation signal has a type flag bit of the data operation signal and information on operation scheduled data. Alternatively, the machine learning device 11 is used to obtain output neuron data by performing machine learning operations according to the input data. Alternatively, the machine learning device 11 is also used to transmit and store the output neuron data as new input neuron data to the shared memory 13 through the transmission circuit 12.

유의할 것은, 상기 기계 학습 장치, 전송 회로 및 공유 메모리는 모두 하드웨어 회로 방식으로 구현될 수 있다. 예시적으로, 기계 학습 장치는 복수의 기계 학습 유닛(Machine Learning Unit, MLU으로 약칭함)으로 구성된 연산 기능을 갖는 장치이며, 전송 회로는 브로드캐스트 버스(broadcast bus)일 수 있으며, 공유 메모리는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 캐시 메모리(cache memory) 등을 포함하나 이에 한정되지 않는다. 여기서, 상기 기계 학습 장치, 전송 회로 및 공유 메모리 사이는 인터페이스를 통해 데이터를 전송한다. 예를 들어, 기계 학습 장치가 상기 인터페이스를 통해 데이터 동작 신호를 송신할 수 있고, 상기 인터페이스를 통해 데이터를 송신하거나 수신할 수도 있다. 이에 따라, 상기 인터페이스는 송신 인터페이스뿐만 아니라 수신 인터페이스일 수도 있다. 즉, 상기 인터페이스가 송신 인터페이스일 때, 기계 학습 장치는 전송 회로로 데이터 동작 신호 또는 데이터를 전송할 수 있고, 상기 인터페이스가 수신 인터페이스일 때, 기계 학습 장치는 전송 회로에서 송신된 데이터 동작 신호 또는 데이터를 수신할 수 있다. 여기서 상기 인터페이스는 다양한 인터페이스일 수 있으며, 상기 다양한 인터페이스는 모두 하드웨어 회로를 통해 구현될 수 있다. 본 실시예는 상기 다양한 인터페이스의 구체적인 하드웨어 형식에 대해 한정하지 않는다. 상기 인터페이스는 기계 학습 장치, 전송 회로 및 공유 메모리 사이에 데이터 신호 상호작용을 구현할 수 있는 것이면 된다. 여기서, 입력 데이터는 기계 학습 장치가 기계 학습 연산할 때 입력해야 하는 데이터, 예를 들어 입력 뉴런 데이터 및 가중치이다. 상기 데이터는 공유 메모리에 미리 저장된 데이터일 수 있으며, 기계 학습 장치가 기계 학습 연산을 진행한 후 출력한 데이터일 수도 있다. 대안적으로, 기계 학습 장치는 복수의 데이터 I/O 인터페이스 또는 I/O 핀을 통해 바로 공유 메모리에 연결하여 상기 데이터를 획득할 수 있다. 대안적으로 기계 학습 장치는 또한 복수의 데이터 I/O 인터페이스 또는 I/O 핀을 통해 전송 회로에 연결한 후 다시 전송 회로를 통해 공유 메모리에 연결하여 상기 데이터를 획득할 수도 있다.It should be noted that the machine learning device, transmission circuit, and shared memory may all be implemented in a hardware circuit manner. Exemplarily, the machine learning device is a device having an arithmetic function composed of a plurality of machine learning units (abbreviated as Machine Learning Units, MLUs), the transmission circuit may be a broadcast bus, and the shared memory may be a non-transferable machine learning unit. - It may be volatile and/or volatile memory, including but not limited to random access memory (RAM), cache memory, and the like. Here, data is transmitted between the machine learning device, the transfer circuit, and the shared memory through an interface. For example, the machine learning device may transmit a data operation signal through the interface, and may transmit or receive data through the interface. Accordingly, the interface may be a reception interface as well as a transmission interface. That is, when the interface is a transmission interface, the machine learning device may transmit data operation signals or data to the transmission circuit, and when the interface is a reception interface, the machine learning device may transmit data operation signals or data transmitted from the transmission circuit. can receive Here, the interface may be various interfaces, and all of the various interfaces may be implemented through hardware circuits. This embodiment is not limited to the specific hardware form of the various interfaces. The interface may be any one capable of realizing data signal interaction between the machine learning device, the transmission circuit and the shared memory. Here, the input data is data to be input when the machine learning device performs a machine learning operation, for example, input neuron data and weights. The data may be previously stored in a shared memory, or may be data output after a machine learning device performs a machine learning operation. Alternatively, the machine learning device may obtain the data by directly connecting to the shared memory through a plurality of data I/O interfaces or I/O pins. Alternatively, the machine learning device may also acquire the data by connecting to a transmission circuit through a plurality of data I/O interfaces or I/O pins and then connecting to a shared memory through a transmission circuit again.

여기서, 데이터 동작 신호는 전송 회로가 공유 메모리 내의 데이터에 대한 판독 동작을 나타낼 수 있으며 공유 메모리 내의 데이터에 대한 기록 동작을 나타낼 수도 있다. 기계 학습 장치가 보낸 데이터 동작 신호가 판독 동작인 경우, 전송 회로는 공유 메모리에서 상응한 주소에 대응하는 입력 데이터를 찾아 판독한 후, 해당 데이터를 데이터 동작 신호를 보낸 기계 학습 장치로 리턴하고, 기계 학습 장치가 보낸 데이터 동작 신호가 기록 동작인 경우, 전송 회로는 기계 학습 장치에서 출력한 기록 데이터를 공유 메모리에 기록할 수 있다. 여기서, 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함하며, 상기 데이터 동작 신호의 유형 플래그 비트는 상기 데이터 동작 신호의 유형을 나타낸다. 예를 들어, 상기 데이터 동작 신호의 유형 플래그 비트가 CAST이면, 상기 데이터 동작 신호의 유형이 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 상기 동작 예정 데이터의 정보는, 전송 회로가 상기 데이터 동작 신호에 따라 대응하는 동작을 진행할 때 필요한 데이터를 나타내며, 본 실시예는 상기 데이터 동작 신호의 유형 플래그 비트의 구체적인 형식 및 동작 예정 정보 중 구체적인 데이터 정보에 대해 한정하지 않으며 실제 상황에 의해 정할 수 있다.Here, the data operation signal may represent a read operation of data in the shared memory by the transmission circuit, or may represent a write operation of data in the shared memory. If the data operation signal sent by the machine learning device is a read operation, the transmission circuit finds and reads the input data corresponding to the corresponding address in the shared memory, and returns the data to the machine learning device that sent the data operation signal, When the data operation signal sent by the learning device is a write operation, the transmission circuit may record the record data output from the machine learning device in the shared memory. Here, the data operation signal includes a type flag bit of the data operation signal and information on operation scheduled data, and the type flag bit of the data operation signal indicates the type of the data operation signal. For example, if the type flag bit of the data operation signal is CAST, it indicates that the type of the data operation signal is a broadcast or multicast command. The information of the operation schedule data indicates data required when the transmission circuit performs a corresponding operation according to the data operation signal, and the present embodiment provides specific data among the specific format and operation schedule information of the type flag bit of the data operation signal. The information is not limited and can be determined by the actual situation.

유의할 것은, 본 출원에 따른 데이터 처리 장치는 기계 학습 연산에 적용되며, 기계 학습 연산은 신경망 연산, k-means연산, 서포트 벡터 머신 연산 등을 포함한다. 신경망 연산을 예로 들어 설명하면, 기계 학습 장치가 수행하는 신경망 내의 연산은 신경망 내의 일층 연산일 수 있으며, 다층 신경망의 구현 과정은 다음과 같다. 정방향 연산에서 이전 층 인공 신경망의 수행이 끝난 후 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 출력 뉴런 데이터를 다음 층 입력 뉴런 데이터로 삼아 연산(또는 상기 출력 뉴런데이터에 대해 일부 동작하여 다시 다음 층의 입력 뉴런 데이터로 삼음)하는 동시에, 가중치를 또한 다음 층 가중치로 교체하며, 역방향 연산에서 이전 층 인공 신경망의 역방향 연산이 끝난 후, 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 입력 뉴런 기울기(마찬가지로 입력된 뉴런 데이터로 삼을 수 있음)를 다음 층의 출력 뉴런 기울기로 하여 연산(마찬가지로 출력된 뉴런 데이터로 삼을 수 있음)(또는 상기 입력 뉴런 기울기에 대해 일부 동작한 후 다시 다음 층의 출력 뉴런 기울로 삼음)하는 동시에 가중치를 다음 층의 가중치로 교체한다. 대안적으로, 본 출원의 실시예에 따른 신경망은 인공 신경망뿐만 아니라 펄스 신경망일 수 있으며, 본 실시예는 이에 대해 한정하지 않는다. 본 실시예에 따른 기계 학습 장치는 입력 데이터에 따라 기계 학습 연산을 수행하는데, 예를 들어, 기계 학습 연산에서 다층의 신경망에 대해 기계 학습 장치는 각층의 신경망이 출력한 뉴런 데이터를 계산할 수 있으며, 각층의 신경망 입력단에 대응하는 복수의 입력 데이터에 대해 곱셈 연산, 합산 연산, 및 함수 연산 등 일련의 기계 학습 연산에 포함된 연산 집합을 진행할 수 있다. 기계 학습 장치가 기계 학습 연산을 통해 현재 층의 출력 뉴런 데이터를 얻은 후 상기 출력 뉴런 데이터를 다음 층의 신경망의 입력 뉴런 데이터로 삼아 기계 학습 연산을 다시 진행한다. 그 전에 기계 학습 장치가 상기 데이터를 수시로 판독하여 기계 학습 연산을 진행하도록 현재 층의 출력 뉴런 데이터를 우선 전송 회로를 통해 공유 메모리에 기록하여 저장할 수 있다.It should be noted that the data processing apparatus according to the present application is applied to machine learning operation, and the machine learning operation includes neural network operation, k-means operation, support vector machine operation, and the like. Taking a neural network operation as an example, an operation in a neural network performed by a machine learning device may be a single-layer operation in a neural network, and an implementation process of a multi-layer neural network is as follows. In the forward operation, after the artificial neural network of the previous layer is finished, the calculation command of the next layer uses the output neuron data calculated in the calculation unit as the input neuron data of the next layer and performs the operation (or partially operates on the output neuron data to perform the next layer again). at the same time as the input neuron data), the weights are also replaced with the weights of the next layer, and in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is finished, the operation instruction of the next layer is the input neuron gradient calculated by the operation unit (also input calculated neuron data) as the gradient of the output neuron of the next layer (or the gradient of the output neuron of the next layer after some operation on the gradient of the input neuron) ) and at the same time replace the weights with the weights of the next layer. Alternatively, the neural network according to an embodiment of the present application may be an artificial neural network as well as a pulsed neural network, and the present embodiment is not limited thereto. The machine learning apparatus according to the present embodiment performs a machine learning operation according to input data. For example, in the machine learning operation, for a multi-layer neural network, the machine learning apparatus may calculate neuron data output by each layer of the neural network, A set of operations included in a series of machine learning operations, such as multiplication operation, summation operation, and function operation, may be performed on a plurality of input data corresponding to the input terminal of the neural network of each layer. After the machine learning device obtains the output neuron data of the current layer through the machine learning operation, the machine learning operation is performed again using the output neuron data as input neuron data of the neural network of the next layer. Prior to that, the output neuron data of the current layer may be first recorded and stored in a shared memory through a transmission circuit so that the machine learning device frequently reads the data and performs a machine learning operation.

구체적으로, 실제 응용에서 전송 회로는 기계 학습 장치가 보낸 데이터 동작 신호에 따라 공유 메모리로부터 기계 학습 장치에 필요한 입력 데이터를 획득한 후 수신 인터페이스를 통해 입력 데이터를 기계 학습 장치로 리턴한다. 그런 다음, 기계 학습 장치는 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻고, 상기 출력 데이터를 새로운 입력 데이터로 하여 전송 회로를 통해 공유 메모리에 전송하여 저장한다. 본 실시예에서, 데이터 동작 신호에 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터 정보가 있으므로, 전송 회로는 상기 데이터 동작 신호를 수신한 후 그 중의 데이터 동작 신호의 유형 플래그 비트에 따라 상기 데이터 동작 신호의 유형을 판단한 후 상기 데이터 동작 신호에 포함된 동작 예정 데이터 정보와 결합하여 동작을 수행한다. 이에 따라, 데이터 동작 신호의 유형 플래그 비트에 따라 우선 분류하여, 대응하는 동작에 신속하게 매핑시키므로 데이터를 액세스하는 로직을 간소화하고 데이터의 액세스 효율을 높이므로 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.Specifically, in practical applications, the transmission circuit obtains input data necessary for the machine learning device from the shared memory according to the data operation signal sent by the machine learning device, and then returns the input data to the machine learning device through the receiving interface. Then, the machine learning device performs a machine learning operation on the input data to obtain output data, transmits the output data as new input data through a transmission circuit to a shared memory, and stores the data. In this embodiment, since the data operation signal has the type flag bit of the data operation signal and the operation schedule data information, the transmitting circuit receives the data operation signal and then according to the type flag bit of the data operation signal therein, the data operation signal After determining the type of , the operation is performed by combining with the operation schedule data information included in the data operation signal. Accordingly, the data operation signal is first classified according to the type flag bit and quickly mapped to the corresponding operation, thereby simplifying data access logic and increasing data access efficiency, so that the machine learning chip can increase the access speed when accessing data. greatly improved

일 실시예에서, 도 38에 도시된 바와 같이, 본 출원의 실시예에 따른 데이터 처리 장치에 있어서, 상기 기계 학습 장치(11)는 적어도 하나 이상의 기계 학습 유닛(14)을 포함하고, 상기 데이터 동작 신호는 상기 입력 데이터를 수신하는 목표 기계 학습 유닛을 나타내는 데이터 수신 플래그 비트를 더 포함한다.In one embodiment, as shown in FIG. 38 , in the data processing device according to an embodiment of the present application, the machine learning device 11 includes at least one machine learning unit 14, and the data operation The signal further includes a data reception flag bit indicating a target machine learning unit receiving the input data.

여기서, 상기 기계 학습 장치에 포함된 적어도 하나 이상의 기계 학습 유닛(즉, MLU)이 수행한 데이터 신호 동작은 하나의 데이터 수신 인터페이스를 공유할 수 있으며 상기 기계 학습 유닛은 송신 인터페이스 또는 공유 데이터 수신 인터페이스를 통해 전송 회로에 연결될 수 있다. 유의할 것은, 상기 송신 인터페이스 및 공유 데이터 수신 인터페이스는 모두 하드웨어 회로 방식으로 구현될 수 있다. 본 실시예는 상기 송신 인터페이스와 공유 데이터 수신 인터페이스의 유형에 대해 한정하지 않는다. 여기서, 데이터 동작 신호는 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 나타내는 데이터 수신 플래그 비트를 포함한다. 상기 데이터 수신 플래그 비트의 표기방식은 예를 들어 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 1로 표기하고, 입력 데이터를 수신할 수 없는 목표 기계 학습 유닛을 0으로 표기한다. 여기서 수신하는 목표 기계 학습 유닛을 1로 표기하는 것은 단지 하나의 방식일 뿐 실제 응용에서 데이터를 수신할 수 있는 목표 기계 학습 유닛을 0으로 표기하고, 데이터를 수신할 수 없는 목표 기계 학습 유닛을 1로 표기할 수도 있다. 본 실시예에서 상기 데이터 수신 플래그 비트의 구체적인 표기 형식에 대해 한정하지 않는다.Here, the data signal operation performed by at least one machine learning unit (ie, MLU) included in the machine learning apparatus may share one data reception interface, and the machine learning unit may use a transmission interface or a shared data reception interface. can be connected to the transmission circuit through It should be noted that both the transmission interface and the shared data reception interface may be implemented in a hardware circuit manner. This embodiment is not limited to the types of the transmission interface and the shared data reception interface. Here, the data operation signal includes a data reception flag bit indicating a target machine learning unit capable of receiving input data. In the notation scheme of the data reception flag bit, for example, a target machine learning unit that can receive input data is marked as 1, and a target machine learning unit that cannot receive input data is marked as 0. Marking the target machine learning unit to be received here as 1 is just one way, and in actual applications, the target machine learning unit that can receive data is marked as 0, and the target machine learning unit that cannot receive data is marked as 1. may also be denoted as In this embodiment, the specific format of marking the data reception flag bit is not limited.

본 실시예에서, 데이터 동작 신호에 포함된 데이터 수신 플래그 비트의 표기 상황에 따라 기계 학습 장치에서 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 확정할 수 있다. 이에 따라, 기계 학습 장치의 각각의 기계 학습 유닛이 데이터를 수신할 때 데이터 동작 신호 내의 데이터 수신 플래그 비트에 의해 결정되므로 데이터의 메모리 액세스 과정에서의 액세스 로직을 간소화하고 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In this embodiment, the machine learning device may determine a target machine learning unit capable of receiving input data according to the marking condition of the data reception flag bit included in the data operation signal. Accordingly, when each machine learning unit of the machine learning device receives data, it is determined by the data reception flag bit in the data operation signal, thereby simplifying the access logic in the process of accessing the memory of data and increasing the efficiency of data access. The learning chip greatly improved the access speed when accessing data.

아래 몇 개 실시예를 통해 본 실시예에 따른 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보, 데이터 수신 플래그 비트 사이의 관계를 각각 소개한다.Through several examples below, the relationship between the type flag bit of the data operation signal, the information of the operation schedule data, and the data reception flag bit according to the present embodiment will be introduced.

일 실시예에서, 상기 데이터 동작 신호의 유형 플래그 비트의 값에 CAST가 포함된 경우, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 대안적으로, 상기 동작 예정 데이터의 정보는 상기 동작 예정 데이터가 상기 공유 메모리 중의 소스 주소, 동작 예정 데이터 길이, 및 데이터 동작시킨 후의 데이터 리턴 주소를 포함한다.In one embodiment, when the value of the type flag bit of the data operation signal includes CAST, it indicates that the data operation signal is a broadcast or multicast command. Alternatively, the information of the operation schedule data includes a source address of the operation schedule data in the shared memory, an operation schedule data length, and a data return address after data operation.

본 실시예에서, 데이터 동작 신호의 유형 플래그 비트는 상기 데이터 동작 신호의 동작 유형을 가리킨다. 예시적으로, 표 6에 표시된 바와 같이, 데이터 동작 신호의 유형 플래그 비트가 CAST이면, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 나타내고, 동작 예정 데이터 정보에 소스 주소 0x110011, 목표 주소 0x000100 및 데이터 길이 0x0100를 포함한다. 상기 데이터 길이는 사용자가 스스로 설정한 길이이며, 사용자는 상기 설정 길이를 하나의 값으로 설정할 수 있으며 복수의 값으로도 설정할 수 있다. 본 실시예는 상기 설정 길이의 구체적인 값 및 개수에 대해 한정하지 않는다. 데이터 수신 플래그 비트 중 3개 MLU 플래그가 1인 경우 상기 3개 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그가 0인 경우, 상기 하나의 MLU가 데이터를 수신할 수 없음을 나타낸다. 구체적으로, 전송 회로는 상기 데이터 동작 신호에 따라 공유 메모리 내의 주소 0x110011로부터 0x0100 길이만큼의 데이터를 판독한 후 기계 학습 장치 내의 MLU3, MLU1 및 MLU0의 주소 0x000100에 각각 기록한다.In this embodiment, the type flag bit of the data operation signal indicates the operation type of the data operation signal. Exemplarily, as shown in Table 6, if the type flag bit of the data operation signal is CAST, it indicates that the data operation signal is a broadcast or multicast command, and the source address 0x110011, the target address 0x000100, and the data operation schedule data information. Contains length 0x0100. The data length is a length set by the user himself, and the user may set the set length as one value or as a plurality of values. This embodiment is not limited to the specific value and number of the setting length. If three MLU flags of the data reception flag bits are 1, it indicates that the three MLUs can receive data, and if one MLU flag is 0, it indicates that the one MLU cannot receive data. Specifically, the transmission circuit reads data as long as 0x0100 from address 0x110011 in the shared memory according to the data operation signal, and then writes the data to addresses 0x000100 of MLU3, MLU1, and MLU0 in the machine learning device, respectively.

표 6table 6

다른 일 실시예에서, 상기 데이터 동작 신호의 유형 플래그 비트는 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 포함할 수 있다. 대안적으로, 상기 제1 유형 플래그 비트의 값에 I/O를 포함한 경우, 상기 데이터 동작 신호가 I/O 명령임을 나타내고, 상기 제2 유형 플래그 비트는 상기 데이터 동작 신호가 상기 I/O 명령 내의 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다.In another embodiment, the type flag bit of the data operation signal may include a first type flag bit and a second type flag bit. Alternatively, when the value of the first type flag bit includes I/O, it indicates that the data operation signal is an I/O command, and the second type flag bit indicates that the data operation signal is within the I/O command. Indicates a broadcast or multicast command.

본 실시예에서, 상기 데이터 동작 신호는 두 개의 데이터 유형 데이터 플래그 비트를 포함한다. 여기서, 제1 유형 데이터 플래그 비트는 상기 데이터 동작 신호의 유형을 나타내고, 상기 제2 유형 데이터 플래그 비트는 상기 데이터 동작 신호의 동작 정보 내에 포함되어 상기 데이터 동작 신호의 구체적인 서브 유형을 나타낸다. 여기서, 데이터 수신 플래그 비트는 본 실시예와 동일하게 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 나타낸다. 예시적으로, 표 7에 표시된 바와 같이, 제1 유형 데이터 플래그 비트의 값이 I/O인 경우, 상기 데이터 동작 신호가 I/O 명령임을 나타내고, 제2 유형 데이터 플래그 비트의 값이 1인 경우, 상기 데이터 동작 신호가 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령임을 나타낸다. 따라서 상기 제2 유형 데이터 플래그 비트의 값이 0인 경우 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령이 아님을 나타낸다. 동작 예정 데이터 정보는 소스 주소0x110011, 목표 주소0x000100 및 데이터 길이0x0100를 포함한다. 상기 데이터 길이는 사용자가 스스로 설정한 길이이며, 사용자는 상기 설정 길이를 하나의 값으로 설정할 수 있고 상기 설정 길이를 복수의 값으로 설정할 수도 있다. 본 실시예는 이에 대해 한정하지 않는다. 데이터 수신 플래그 비트에 3개 MLU 플래그가 1인 경우 상기 3개 MLU가 데이터를 수신할 수 있음을 나타내고, 하나의 MLU 플래그가 0인 경우 상기 하나의 MLU가 데이터를 수신할 수 없음을 나타낸다. 구체적으로, 전송 회로는 상기 데이터 동작 신호에 따라 공유 메모리 내의 주소0x110011로부터 0x0100 길이만큼의 데이터를 판독한 후 기계 학습 장치 내의 MLU3, MLU1 및 MLU0의 주소 0x000100에 각각 기록한다.In this embodiment, the data operation signal includes two data type data flag bits. Here, the first type data flag bit indicates the type of the data operation signal, and the second type data flag bit is included in the operation information of the data operation signal and indicates a specific subtype of the data operation signal. Here, the data reception flag bit indicates a target machine learning unit capable of receiving input data in the same manner as in the present embodiment. Exemplarily, as shown in Table 7, when the value of the first type data flag bit is I/O, the data operation signal indicates an I/O command, and when the value of the second type data flag bit is 1 , indicates that the data operation signal is a broadcast or multicast command among I/O commands. Accordingly, when the value of the second type data flag bit is 0, it indicates that the data operation signal is not a broadcast or multicast command. Operation scheduled data information includes a source address of 0x110011, a target address of 0x000100, and a data length of 0x0100. The data length is a length set by the user himself, and the user may set the set length as one value or may set the set length as a plurality of values. This embodiment is not limited in this regard. When three MLU flags in the data reception flag bit are 1, it indicates that the three MLUs can receive data, and when one MLU flag is 0, it indicates that the one MLU cannot receive data. Specifically, the transmission circuit reads data as long as 0x0100 from address 0x110011 in the shared memory according to the data operation signal, and then writes the data to addresses 0x000100 of MLU3, MLU1, and MLU0 in the machine learning device, respectively.

표 7table 7

다른 일 실시예에서, 상기 표 1 또는 표 2의 기초상에 상기 데이터 동작 신호는 점프 정보를 더 포함할 수 있다. 상기 점프 정보는 점프 폭 및 매번 점프한 후 동작하는 데이터 길이를 포함한다. 대안적으로, 상기 점프 정보는 stride점프 정보 및/또는 segment점프 정보를 포함한다.In another embodiment, based on Table 1 or Table 2, the data operation signal may further include jump information. The jump information includes a jump width and a data length operated after each jump. Alternatively, the jump information includes stride jump information and/or segment jump information.

본 실시예에서, 데이터 동작 신호에 포함된 점프 정보는, 상기 데이터 동작 신호에 따라 동작 예정 데이터 정보를 판독할 때 상기 점프 정보의 규칙에 따라 판독하도록 상기 전송 회로를 지시한다. 구체적인 판독방법은 다음과 같다. 전송 회로가 동작 예정 데이터 정보 내의 소스 주소로부터 공유 메모리 내의 데이터를 판독하기 시작하여 금번 점프한 후, 우선 판독된 점프 데이터 길이의 데이터를 제1 점프 데이터로 확정한 다음 전송 회로가 상기 제1 점프 데이터의 마지막 주소를 획득하고 점프 정보 내의 점프 폭에 따라 상기 제1 점프 데이터의 마지막 주소로부터 목표 점프 주소로 상기 점프 폭 길이의 데이터를 점프한다. 여기서 제1 점프 데이터의 마지막 주소와 목표 점프 주소 사이의 길이가 바로 점프 정보 내의 점프 폭이라는 것을 이해해야 한다. 이어서, 전송 회로는 다시 목표 점프 주소로부터 사전에 설정된 길이의 데이터를 점프하여 상기 사전에 설정된 길이만큼 점프한 데이터를 제2 점프 데이터로 확정하고, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하면, 상기 기계 학습 장치에 필요한 데이터의 판독이 완료됨을 나타내고, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하지 않으면, 상기 제2 점프 데이터의 마지막 주소로부터 상기 점프 순서에 따라 점프하기 시작하여 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족할 때까지, 즉 상기 기계 학습 장치에 필요한 데이터의 판독이 완료될 때까지 상기 데이터를 판독한다.In this embodiment, the jump information included in the data operation signal instructs the transmission circuit to read according to the rules of the jump information when reading operation schedule data information according to the data operation signal. The specific reading method is as follows. After the transfer circuit starts reading the data in the shared memory from the source address in the operation scheduled data information and jumps this time, first, the data of the jump data length read is determined as the first jump data, and then the transfer circuit determines the first jump data Obtains the last address of and jumps the data of the jump width length from the last address of the first jump data to the target jump address according to the jump width in the jump information. Here, it should be understood that the length between the last address of the first jump data and the target jump address is the jump width in the jump information. Subsequently, the transmission circuit again jumps data of a preset length from the target jump address, determines the data jumped by the preset length as second jump data, and determines the address of the second jump data and the source of the jump start. If the length between the addresses satisfies the data length of the data required for the machine learning device, it indicates that reading of the data necessary for the machine learning device is completed, and the length between the address of the second jump data and the source address of the start of the jump is If the data length of the data required by the machine learning device is not satisfied, jumping starts from the last address of the second jump data according to the jump sequence, and the length between the address of the second jump data and the source address of the start of the jump is The data is read until the data length of the data required for the machine learning device is satisfied, that is, until reading of the data required for the machine learning device is completed.

일반적으로, 본 출원의 실시예에 따른 데이터 처리 장치는 데이터 동작 신호에 대해 판독/기록 작업 진행하기 전에 데이터 동작 신호를 분석해야 한다. 대안적으로, 상기 전송 회로는 상기 데이터 동작 신호를 저장하는 명령 저장 유닛; 상기 데이터 동작 신호를 분석하여 상기 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻도록 구성된 명령 처리 유닛; 명령 큐를 저장하는 저장 큐 유닛을 포함한다. 여기서, 상기 명령 큐는 상기 명령 큐의 선후 순서에 따라 수행하고자 하는 복수의 상기 데이터 동작 신호를 포함한다. 여기서, 일반적인 데이터 처리 과정에서 데이터 동작 신호의 수가 비교적 많고, 그 중 하나의 데이터 동작 신호를 처리할 때, 다른 데이터 동작 신호는 상기 명령 저장 유닛에 저장되어 있어야 한다. 명령 처리 유닛은 상기 데이터 동작 신호를 분석하는 과정에 상기 데이터 동작 신호에 포함된 데이터 정보를 분석한다. 한편, 데이터 동작 신호의 값 획득, 디코딩 및 발송 과정이 일련의 절차 작업이고 모든 데이터 동작 신호가 순서에 따라 위 과정을 순차적으로 수행해야 하므로 저장 큐 유닛을 통해 명령 큐를 저장한다.In general, a data processing apparatus according to an embodiment of the present application needs to analyze a data operation signal before performing a read/write operation on the data operation signal. Alternatively, the transmission circuit may include an instruction storage unit for storing the data operation signal; a command processing unit configured to analyze the data operation signal to obtain information of a type flag bit of the data operation signal and operation schedule data; It includes a storage queue unit that stores command queues. Here, the command queue includes a plurality of the data operation signals to be executed according to the precedence order of the command queue. Here, in a general data processing process, the number of data operation signals is relatively large, and when one data operation signal is processed, other data operation signals must be stored in the command storage unit. The command processing unit analyzes data information included in the data operation signal in the process of analyzing the data operation signal. On the other hand, since the process of obtaining, decoding, and sending data operation signals is a series of procedural tasks, and all data operation signals must sequentially perform the above processes in order, a command queue is stored through a storage queue unit.

게다가， 명령 처리 유닛이 하나의 데이터 동작 신호를 처리한 후에야 큐 내의 다음 데이터 동작 신호를 처리할 수 있으므로, 상기 현재 처리가 완료된 데이터 동작 신호와 다음 데이터 동작 신호 사이에 관련이 있어야 한다. 대안적으로, 상기 전송 회로는 제s번째 데이터 동작 신호와 상기 제s번째 데이터 동작 신호 이전의 제s-1번째 데이터 동작 신호 사이에 연관 관계가 존재하는지 여부를 확정하는 의존 관계 처리 유닛을 더 포함한다. 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호 사이에 연관 관계가 존재하면, 상기 제s번째 데이터 동작 신호를 상기 명령 저장 유닛 내에 캐싱하고, 상기 제s-1번째 데이터 동작 신호를 수행한 후 상기 명령 저장 유닛으로부터 상기 제s번째 데이터 동작 신호를 추출하여 상기 명령 처리 유닛으로 전송한다. 여기서, 상기 제s번째 데이터 동작 신호와 제s번째 데이터 동작 신호 이전의 제s-1번째 데이터 동작 신호가 연관 관계가 존재하는지 여부를 확정하는 단계는 다음 과정을 포함한다. 즉, 상기 제s번째 데이터 동작 신호에 의해 상기 제s번째 데이터 동작 신호에 필요한 데이터의 제1 저장 주소 구간을 추출하고, 상기 제s-1번째 데이터 동작 신호에 의해 상기 제s-1번째 데이터 동작 신호에 필요한 데이터의 제0 저장 주소 구간을 추출한다. 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 연관 관계가 있음을 확정하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 존재하지 않으면, 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 연관 관계가 없음을 확정한다.Moreover, since the command processing unit can process the next data operation signal in the queue only after processing one data operation signal, there must be a relationship between the data operation signal for which the current processing has been completed and the next data operation signal. Alternatively, the transmitting circuit further includes a dependency processing unit configured to determine whether an association relationship exists between the s-th data operation signal and the s-1-th data operation signal before the s-th data operation signal. do. If a correlation exists between the s-th data operation signal and the s-1-th data operation signal, the s-th data operation signal is cached in the instruction storage unit, and the s-1-th data operation signal After performing, the s-th data operation signal is extracted from the command storage unit and transmitted to the command processing unit. Here, determining whether or not there is a correlation between the s-th data operation signal and the s-1-th data operation signal before the s-th data operation signal includes the following process. That is, the first storage address interval of data necessary for the s-th data operation signal is extracted by the s-th data operation signal, and the s-1-th data operation is performed by the s-1-th data operation signal. The 0th storage address section of data required for the signal is extracted. If there is an overlapping region between the first storage address interval and the 0th storage address interval, it is determined that the s-th data operation signal and the s-1-th data operation signal have a correlation, and the first If there is no overlapping area between the storage address period and the 0th storage address period, it is determined that the s-th data operation signal and the s-1-th data operation signal do not have a correlation.

본 실시예에서, 데이터 조작 장치가 데이터 처리 신호에 따라 동작하기 전에 상기 미사용 데이터 처리 신호를 순차적으로 저장하고, 사용시 순차적으로 분석하고 디코딩한다. 분석하고 디코딩하는 과정에 두 개의 인접되어 있는 데이터 동작 신호 사이의 관련성을 판단하여 데이터 동작 신호의 일관성을 보장한다. 이에 따라, 앞단의 질서있는 준비작업에 의해 후속 상기 데이터 동작 신호에 따라 상응한 동작을 수행하는 순서가 보장됨으로써 데이터의 액세스 효율을 향상시키므로 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시켰다.In this embodiment, before the data manipulation device operates according to the data processing signal, the unused data processing signal is sequentially stored, and when used, it is sequentially analyzed and decoded. In the process of analysis and decoding, the correlation between two adjacent data operation signals is determined to ensure consistency of the data operation signals. Accordingly, the sequence of performing corresponding operations according to the subsequent data operation signals is guaranteed by the orderly preparation work at the front stage, thereby improving data access efficiency, and thus greatly improving the access speed of the machine learning chip when accessing data.

본 출원의 실시예는 데이터 처리 방법을 더 제공한다. 상기 방법은 도 31에 도시된 하드웨어 회로에 적용될 수 있다. 상기 회로는 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)를 포함하고, 기계 학습 장치(11)와 상기 전송 회로(12), 전송 회로(12)와 공유 메모리(13)는 모두 인터페이스를 통해 연결되며, 상기 인터페이스는 하드웨어 회로 방식으로 구현될 수 있다. 본 실시예는 상기 각종 인터페이스의 구체적인 하드웨어 형식에 대해 한정하지 않는다. 여기서, 전송 회로(12)는 기계 학습 장치(11)가 보낸 데이터 동작 신호에 따라 공유 메모리(13)로부터 기계 학습 장치(11)에 필요한 입력 데이터를 획득하고 입력 데이터를 기계 학습 장치(11)로 리턴하는데 사용된다. 기계 학습 장치(11)는 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 뉴런 데이터를 획득하고 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 삼아 전송 회로(12)를 통해 공유 메모리(13)에 전송한 후 저장한다.An embodiment of the present application further provides a data processing method. The above method can be applied to the hardware circuit shown in FIG. 31 . The circuit includes a machine learning device 11, a transmission circuit 12 and a shared memory 13, and the machine learning device 11 and the transmission circuit 12, the transmission circuit 12 and the shared memory 13 are all connected through an interface, and the interface may be implemented in a hardware circuit manner. This embodiment is not limited to the specific hardware format of the above various interfaces. Here, the transmission circuit 12 acquires input data necessary for the machine learning device 11 from the shared memory 13 according to the data operation signal sent by the machine learning device 11 and transfers the input data to the machine learning device 11. used to return The machine learning device 11 performs a machine learning operation according to the input data to obtain output neuron data, takes the output neuron data as new input neuron data, transmits the output neuron data to the shared memory 13 through the transmission circuit 12, and stores the data. do.

본 출원의 목적, 기술적 방안 및 장점이 더욱 명확해지도록 이하 첨부된 도면 및 실시예를 결합하여 본 출원을 더욱 상세하게 설명한다. 여기서 기술되는 구체적인 실시예는 본 출원을 설명하는 것일 뿐, 본 출원을 한정하는 것은 아니다. 본 출원의 실시예에 따른 데이터 처리 방법은, 데이터를 액세스하거나 공유 저장 중인 데이터가 많을 경우, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 향상시키는 기술적 과제를 해결하고자 한다. 아래 실시예와 첨부 도면을 결합하여 본 출원의 기술적 방안 및 본 출원의 기술적 방안을 통해 상기 기술 과제를 어떻게 해결할 것인가에 대해 상세하게 설명한다. 아래 실시예들은 결합할 수 있으며 동일 또는 유사한 개념 혹은 과정은 일부 실시예에서 중복 설명하지 않는다. 유의할 것은, 본 출원에 따른 데이터 처리 방법에서 그 수행 주체가 전송 회로이며, 상기 수행 주체는 데이터 처리 장치가 될 수도 있다. 상기 장치는 소프트웨어, 하드웨어 또는 이들의 결합방식으로 데이터 분석 단말의 일부분 또는 전체를 구현할 수 있다.In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be described in more detail by combining the accompanying drawings and embodiments. The specific embodiments described herein are only for explaining the present application, but do not limit the present application. A data processing method according to an embodiment of the present application is intended to solve a technical problem of improving an access speed when a machine learning chip accesses data when data is accessed or there is a lot of data being shared and stored. The technical solution of the present application and how to solve the technical problem through the technical solution of the present application will be described in detail by combining the following embodiments and the accompanying drawings. The embodiments below may be combined, and the same or similar concepts or processes are not described redundantly in some embodiments. It should be noted that, in the data processing method according to the present application, the performing subject is a transmission circuit, and the performing subject may be a data processing device. The device may implement part or all of the data analysis terminal using software, hardware, or a combination thereof.

일 실시예에서, 도 38은 데이터 처리 방법을 도시하고 있다. 본 실시예는 전송 회로가 데이터 동작 신호의 유형 플래그 비트에 따라 데이터 동작 신호의 유형을 확정함으로써 대응하는 동작에 매핑한 후 상기 동작에 따라 공유 메모리로부터 기계 학습 장치에 필요한 데이터를 획득하여 액세스 속도를 향상시키는 구체적인 과정에 관한 것이다. 도 38에 도시된 바와 같이, 상기 방법은 다음 과정을 포함한다.In one embodiment, Figure 38 illustrates a data processing method. In this embodiment, the transmission circuit determines the type of the data operation signal according to the type flag bit of the data operation signal, maps it to a corresponding operation, and acquires data necessary for the machine learning device from the shared memory according to the operation to increase the access speed. It is about a specific process of improvement. As shown in Fig. 38, the method includes the following steps.

S4101: 상기 데이터 처리 장치 내의 전송 회로가 상기 데이터 처리 장치 내의 기계 학습 장치가 송신한 데이터 동작 신호를 수신한다. 여기서 상기 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함한다.S4101: A transmission circuit in the data processing device receives a data operation signal transmitted by a machine learning device in the data processing device. Here, the data operation signal includes a type flag bit of the data operation signal and information on operation scheduled data.

여기서, 기계 학습 장치는 복수의 MLU로 구성된 연산 기능을 갖는 장치일 수 있으며, 전송 회로는 브로드캐스트 버스일 수 있으며, 공유 메모리는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 고속 캐시 메모리 등을 포함하나 이에 한정되지 않는다. 본 실시예에서, 데이터 처리 장치 내의 전송 회로는 상기 데이터 처리 장치 내의 기계 학습 장치가 송신한 데이터 동작 신호를 수신한다. 여기서 상기 데이터 동작 신호는 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함한다. 여기서, 전송 회로와 기계 학습 장치 간의 데이터 동작 신호의 전송은 인터페이스를 통해 이루어질 수 있다. 전송 회로는 상기 데이터 동작 신호에 포함된 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터 정보에 따라 상기 데이터 동작 신호의 유형 및 동작시 필요한 데이터 정보를 확정할 수 있다.Here, the machine learning device may be a device having an arithmetic function composed of a plurality of MLUs, the transmission circuit may be a broadcast bus, the shared memory may be non-volatile and/or volatile memory, and random access memory (RAM ), high-speed cache memory, etc., but is not limited thereto. In this embodiment, a transmission circuit in the data processing device receives a data operation signal transmitted by a machine learning device in the data processing device. Here, the data operation signal includes a type flag bit of the data operation signal and information on operation scheduled data. Here, transmission of a data operation signal between the transmission circuit and the machine learning device may be performed through an interface. The transmission circuit may determine the type of the data operation signal and data information necessary for operation according to the type flag bit of the data operation signal and operation schedule data information included in the data operation signal.

S4102: 상기 전송 회로는 상기 데이터 동작 신호의 유형 플래그 비트에 따라 공유 메모리 내의 데이터에 대해 수행하는 동작을 확정하고, 상기 동작 예정 데이터의 정보에 의해 상기 동작 예정 데이터에 대해 상기 동작을 수행하여 상기 기계 학습 장치에 필요한 입력 데이터를 얻으며, 상기 입력 데이터를 상기 기계 학습 장치로 리턴한다. S4102: The transmitting circuit determines an operation to be performed on data in the shared memory according to the type flag bit of the data operation signal, and performs the operation on the operation scheduled data according to the information of the operation scheduled data, so that the machine Obtain input data required by the learning device, and return the input data to the machine learning device.

상기 단계S4101에서 전송 회로가 수신한 기계 학습 장치에서 송신된 데이터 동작 신호에 의해, 전송 회로는 상기 데이터 동작 신호의 유형 플래그 비트에 의해 공유 메모리 내의 데이터에 대해 수행할 동작을 확정하며, 상기 데이터 동작 신호 중 동작 예정 데이터의 정보에 의해 상기 공유 메모리 내의 어떤 데이터에 대해 상기 동작(이러한 데이터가 바로 동작 예정 데이터이다)을 수행할지 확정한다. 그런 다음 기계 학습 장치에 필요한 입력 데이터를 얻고 상기 입력 데이터를 기계 학습 장치로 리턴한다. 여기서, 입력 데이터는 기계 학습 장치가 기계 학습 연산시 필요한 입력 데이터이다. 상기 데이터는 공유 메모리에 미리 저장된 데이터일 수 있으며, 기계 학습 장치가 기계 학습 연산을 수행한 후 출력한 데이터일 수도 있다.Based on the data operation signal transmitted from the machine learning device received by the transmission circuit in step S4101, the transmission circuit determines an operation to be performed on data in the shared memory according to the type flag bit of the data operation signal, and the data operation It is determined which data in the shared memory is to be subjected to the operation (this data is the operation scheduled data) according to the information of the operation scheduled data among the signals. It then obtains input data required by the machine learning device and returns the input data to the machine learning device. Here, the input data is input data required by the machine learning device for machine learning operation. The data may be data previously stored in a shared memory, or may be data output after a machine learning device performs a machine learning operation.

S4103: 상기 기계 학습 장치가 상기 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻으며, 상기 출력 데이터를 새로운 입력 데이터로 삼고, 상기 전송 회로를 통해 상기 공유 메모리로 전송하여 저장한다.S4103: The machine learning device performs a machine learning operation on the input data to obtain output data, takes the output data as new input data, transmits the output data to the shared memory through the transmission circuit, and stores the output data.

본 단계에서, 기계 학습 장치는 상기 S4102단계의 전송 회로가 송신한 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 데이터를 얻는다. 그런 다음, 상기 출력 데이터를 새로운 입력 데이터로 삼고 전송 회로를 통해 공유 메모리로 전송하여 저장한다. 여기서, 기계 학습 장치가 수행한 연산이 신경망 연산인 것을 예로 들어 설명한다. 상기 신경망 연산은 신경망 내의 일층 연산일 수 있으며, 다층 신경망에 대한 구현 과정은 다음과 같다. 정방향 연산에서 이전 층 인공 신경망의 수행이 끝난 후 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 출력 뉴런 데이터를 다음 층 입력 뉴런 데이터로 하여 연산(혹은 상기 출력 뉴런 데이터에 대해 일부 동작을 수행하여 다시 다음 층의 입력 뉴런 데이터로 사용함)하는 동시에, 가중치를 또한 다음 층 가중치로 교체하며, 역방향 연산에서 이전 층 인공 신경망의 역방향 연산이 끝난 후, 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 입력 뉴런 기울기(마찬가지로 입력된 뉴런 데이터로 삼을 수 있음)를 다음 층의 출력 뉴런 기울기로 하여 연산(마찬가지로 출력된 뉴런 데이터로 삼을 수 있음)(또는 상기 입력 뉴런 기울기에 대해 일부 동작을 수행한 후 다시 다음 층의 출력 뉴런 기울로 삼음)하는 동시에 가중치를 다음 층의 가중치로 교체한다. 대안적으로, 본 출원의 실시예에 따른 신경망은 인공 신경망뿐만 아니라 펄스 신경망일 수 있으며, 본 실시예는 이에 대해 한정하지 않는다. 본 실시예에 따른 기계 학습 장치는 입력 데이터에 따라 기계 학습 연산을 수행할 수 있다. 예를 들어, 기계 학습 연산의 다층 신경망에 대해, 기계 학습 장치는 각층의 신경망이 출력한 뉴런 데이터를 계산할 수 있으며, 각층 신경망 입력단에 대응하는 복수의 입력 데이터에 대해 곱셈 연산, 합산 연산, 및 함수 연산 등 일련의 기계 학습 연산에 포함된 연산 집합을 진행할 수 있다. 기계 학습 장치가 기계 학습 연산을 통해 현재 층의 출력 뉴런 데이터를 얻은 후 상기 출력 뉴런 데이터를 다음 층의 신경망의 입력 뉴런 데이터로 삼아 기계 학습 연산을 다시 진행할 수 있다. 그 전에 기계 학습 장치가 상기 데이터를 수시로 판독하여 기계 학습 연산하도록 현재 층의 출력 뉴런 데이터를 우선 전송 회로를 통해 공유 메모리에 기록하여 저장할 수 있다.In this step, the machine learning device obtains output data by performing a machine learning operation according to the input data transmitted by the transmission circuit of step S4102. Then, the output data is taken as new input data and transmitted to a shared memory through a transmission circuit to be stored. Here, an operation performed by the machine learning apparatus is a neural network operation will be described as an example. The neural network operation may be a single-layer operation within a neural network, and an implementation process for a multi-layer neural network is as follows. In the forward operation, after the artificial neural network of the previous layer is finished, the operation command of the next layer calculates the output neuron data calculated by the calculation unit as the input neuron data of the next layer (or performs some operation on the output neuron data and then returns to the next layer). At the same time, the weights are also replaced with the weights of the next layer, and in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is finished, the operation instruction of the next layer is the input neuron gradient calculated by the operation unit ( Similarly, it can be used as the input neuron data) as the output neuron gradient of the next layer, and the operation (also can be used as the output neuron data) (or, after performing some operation on the input neuron gradient, again the next layer of the output neuron gradient), and at the same time replace the weights with the weights of the next layer. Alternatively, the neural network according to an embodiment of the present application may be an artificial neural network as well as a pulsed neural network, and the present embodiment is not limited thereto. The machine learning apparatus according to the present embodiment may perform a machine learning operation according to input data. For example, for a multi-layer neural network of machine learning operation, the machine learning device may calculate neuron data output by each layer of the neural network, and multiply operation, summation operation, and function for a plurality of input data corresponding to the input terminal of each layer neural network. A set of operations included in a series of machine learning operations, such as operations, can be performed. After the machine learning apparatus obtains output neuron data of a current layer through machine learning operation, the machine learning operation may be performed again by using the output neuron data as input neuron data of a neural network of a next layer. Prior to that, the output neuron data of the current layer may be first recorded and stored in a shared memory through a transmission circuit so that the machine learning device frequently reads the data and performs a machine learning operation.

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 기계 학습 장치가 송신 인터페이스를 통해 발송한 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 포함하는 데이터 동작 신호에 따라, 공유 메모리로부터 기계 학습 장치에 필요한 입력 데이터를 획득하고, 수신 인터페이스를 통해 입력 데이터를 기계 학습 장치로 리턴한다. 그런 다음 기계 학습 장치는 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 데이터를 얻으며, 상기 출력 데이터를 다시 새로운 입력 데이터로 하여 전송 회로를 통해 공유 메모리로 전송하여 저장한다. 본 실시예에서, 데이터 동작 신호에 유형 플래그 비트 및 동작 예정 데이터 정보가 포함되므로 전송 회로는 상기 데이터 동작 신호를 수신한 후 그 중의 데이터 동작 신호의 유형 플래그 비트에 따라 상기 데이터 동작 신호의 유형을 판단한 후 상기 데이터 동작 신호에 포함된 동작 예정 데이터 정보를 결합하여 대응하는 동작을 수행한다. 이에 따라, 데이터 동작 신호의 유형 플래그 비트에 따라 우선 분류하여, 대응하는 동작에 신속하게 매핑시키므로 데이터를 액세스하는 로직을 간소화하고 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시 액세스 속도를 크게 향상시켰다.In the data processing method according to the present embodiment, the transmission circuit performs machine learning from the shared memory according to the data operation signal including the type flag bit of the data operation signal sent by the machine learning device through the transmission interface and information on operation schedule data. Acquire the input data required by the device, and return the input data to the machine learning device through the receiving interface. Then, the machine learning device performs a machine learning operation according to the input data to obtain output data, and transmits the output data as new input data through a transmission circuit to a shared memory for storage. In this embodiment, since the data operation signal includes the type flag bit and operation schedule data information, the transmission circuit receives the data operation signal and determines the type of the data operation signal according to the type flag bit of the data operation signal therein. Then, a corresponding operation is performed by combining operation schedule data information included in the data operation signal. Accordingly, the data operation signal is first classified according to the type flag bit and quickly mapped to the corresponding operation, thereby simplifying the data access logic and increasing the data access efficiency, so that the machine learning chip significantly increases the access speed when accessing data. improved

일 실시예에서, 상기 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 상기 데이터 동작 신호는 데이터 수신 플래그 비트를 더 포함하고, 상기 입력 데이터를 상기 기계 학습 장치로 리턴하는 과정은, 상기 전송 회로가 상기 데이터 수신 플래그 비트의 값에 의해 상기 입력 데이터를 수신하는 목표 기계 학습 유닛을 확정하고 상기 입력 데이터를 상기 목표 기계 학습 유닛으로 송신하는 것을 포함한다.In an embodiment, the machine learning device includes at least one machine learning unit, the data operation signal further includes a data reception flag bit, and the process of returning the input data to the machine learning device includes the transmission and a circuit determines a target machine learning unit to receive the input data according to the value of the data reception flag bit, and transmits the input data to the target machine learning unit.

본 실시예에서, 상기 기계 학습 장치에 포함된 적어도 하나 이상의 기계 학습 유닛(즉 MLU)이 수행하는 데이터 신호 동작은 하나의 데이터 수신 인터페이스를 공유할 수 있다. 상기 MLU는 송신 인터페이스 또는 공유 데이터 수신 인터페이스를 통해 전송 회로와 신호 또는 데이터 전송을 한다. 유의할 것은, 상기 송신 인터페이스 및 공유 데이터 수신 인터페이스는 모두 하드웨어 회로의 방식으로 구현될 수 있으며, 본 실시예에서는 상기 송신 인터페이스 및 공유 데이터 수신 인터페이스의 유형에 대해 한정하지 않는다. 여기서, 데이터 동작 신호는 데이터 수신 플래그 비트를 포함하고, 상기 데이터 수신 플래그 비트는 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 나타낸다. 상기 데이터 수신 플래그 비트의 표기방식은 예를 들어 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 1로 표기한다. 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 1로 표기하는 것은 일 방식일 뿐, 실제 응용에서 데이터를 수신할 수 있는 목표 기계 학습 유닛을 0으로 표기할 수도 있음을 이해할 수 있다. 본 실시예에서는 상기 데이터 수신 플래그 비트의 구체적인 표기 형식에 대해 한정하지 않는다. 구체적으로, 전송 회로는 데이터 동작 신호 중의 데이터 수신 플래그 비트의 값에 따라 입력 데이터를 수신하는 목표 MLU을 확정하고, 입력 데이터를 목표 MLU로 송신한다. 본 실시예에서, 데이터 동작 신호에 포함된 데이터 수신 플래그 비트의 표기 상황에 따라, 전송 회로는 기계 학습 장치에서 입력 데이터를 수신할 수 있는 목표 기계 학습 유닛을 확정할 수 있다. 이에 따라, 기계 학습 장치의 각각의 기계 학습 유닛이 데이터를 수신할 때, 데이터 동작 신호 중의 데이터 수신 플래그 비트에 의해 확정하므로, 데이터의 메모리 엑세스 과정에서의 메모리 엑세스 로직을 간소화하고, 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도를 크게 향상시킨다.In this embodiment, data signal operations performed by one or more machine learning units (ie, MLUs) included in the machine learning apparatus may share one data reception interface. The MLU transmits signals or data with a transmission circuit through a transmission interface or a shared data reception interface. It should be noted that both the transmitting interface and the shared data receiving interface may be implemented in a hardware circuit manner, and the types of the transmitting interface and the shared data receiving interface are not limited in this embodiment. Here, the data operation signal includes a data reception flag bit, and the data reception flag bit indicates a target machine learning unit capable of receiving input data. In the notation method of the data reception flag bit, for example, a target machine learning unit capable of receiving input data is marked as 1. It can be understood that marking a target machine learning unit capable of receiving input data as 1 is only one way, and a target machine learning unit capable of receiving data may be marked as 0 in an actual application. In this embodiment, the specific notation format of the data reception flag bit is not limited. Specifically, the transmission circuit determines the target MLU for receiving the input data according to the value of the data reception flag bit in the data operation signal, and transmits the input data to the target MLU. In this embodiment, according to the marking condition of the data reception flag bit included in the data operation signal, the transmission circuit may determine a target machine learning unit capable of receiving input data in the machine learning device. Accordingly, when each machine learning unit of the machine learning device receives data, it is determined by the data reception flag bit in the data operation signal, simplifying the memory access logic in the process of accessing the memory of data, and improving data access efficiency , the machine learning chip greatly improves the access speed when accessing data.

대안적으로, 상기 데이터 동작 신호의 유형 플래그 비트의 값이 CAST일 때, 전송 회로는 상기 데이터 동작 신호를 브로드캐스트 또는 멀티캐스트 명령으로 확정한다. 하나의 선택적인 방식에서, 데이터 동작 신호의 상기 데이터 동작 신호의 유형 플래그 비트는, 상기 데이터 동작 신호의 동작 유형을 가리키고, 데이터 동작 신호의 유형 플래그 비트가 CAST인 경우, 상기 데이터 동작 신호가 브로드캐스트 또는 멀티캐스트 명령임을 표시한다. 여기서 CAST로 브로드캐스트 또는 멀티캐스트 명령을 표시하는 것은, 단지 일 실시예에 속하며, 사용자는 실제 상황에 따라 상기 데이터 유형 플래그 비트에 대해 다시 정의할 수 있다는 것을 이해해야 한다. 본 실시예는 이에 대하여 한정하지 않는다.Alternatively, when the value of the type flag bit of the data operation signal is CAST, the transmission circuit determines the data operation signal as a broadcast or multicast command. In one optional way, the type flag bit of the data operation signal of the data operation signal indicates an operation type of the data operation signal, and when the type flag bit of the data operation signal is CAST, the data operation signal is broadcast. Or indicates that it is a multicast command. It should be understood that here, indicating a broadcast or multicast command with CAST belongs to only one embodiment, and the user may redefine the data type flag bit according to the actual situation. This embodiment is not limited in this respect.

대안적으로, 상기 데이터 동작 신호의 유형 플래그 비트는 제1 유형 플래그 비트 및 제2 유형 플래그 비트를 포함할 수 있으며, 상기 제1 유형 플래그 비트는 상기 데이터 동작 신호가 I/O 명령인지를 나타내고, 상기 제2 유형 플래그 비트는 상기 데이터 동작 신호가 상기 I/O 명령 중의 브로드캐스트 또는 멀티캐스트 명령인지를 나타낸다. 따라서, 상기 제1 유형 플래그 비트의 값이 I/O이면, 상기 전송 회로는 상기 데이터 동작 신호를 I/O 명령으로 확정하고, 상기 제2 유형 플래그 비트의 값이 1이면, 상기 전송 회로는 상기 데이터 동작 신호를 상기 I/O 명령 내의 브로드캐스트 또는 멀티캐스트 명령으로 확정한다.Alternatively, the type flag bit of the data operation signal may include a first type flag bit and a second type flag bit, wherein the first type flag bit indicates whether the data operation signal is an I/O command; The second type flag bit indicates whether the data operation signal is a broadcast or multicast command among the I/O commands. Therefore, if the value of the first type flag bit is I/O, the transmission circuit determines the data operation signal as an I/O command; if the value of the second type flag bit is 1, the transmission circuit determines the data operation signal as an I/O command. The data operation signal is determined as a broadcast or multicast command in the I/O command.

상기 선택 가능한 방식에서, 상기 데이터 동작 신호는 두 개의 데이터 유형의 데이터 플래그 비트를 포함하고, 여기서, 제1 유형 데이터 플래그 비트는 상기 데이터 동작 신호의 유형을 표시하고, 상기 제2 유형 데이터 플래그 비트는 상기 데이터 동작 신호의 동작 정보에 설정되어 상기 데이터 동작 신호의 구제적인 서브 유형을 표시한다. 구체적으로, 상기 데이터 동작 신호 중의 제1 유형 플래그 비트의 값이 I/O인 경우, 전송 회로는 상기 데이터 동작 신호를 입력/출력 명령으로 확정하고, 상기 데이터 동작 신호 중의 제2 유형 플래그 비트의 값이 1인 경우, 전송 회로는 상기 데이터 동작 신호를 입력/출력 명 중의 브로드캐스트 또는 멀티캐스트 명령으로 확정한다.In the selectable manner, the data operation signal includes data flag bits of two data types, wherein the first type data flag bit indicates the type of the data operation signal, and the second type data flag bit It is set in the operation information of the data operation signal to indicate a specific subtype of the data operation signal. Specifically, when the value of the first type flag bit in the data operation signal is I/O, the transmission circuit determines the data operation signal as an input/output command, and the value of the second type flag bit in the data operation signal. When is 1, the transmitting circuit determines the data operation signal as a broadcast or multicast command among input/output names.

일 실시예에서, 도 39는 데이터 처리 방법을 제공한다. 본 실시예는 전송 회로가 데이터 동작 신호에 포함된 데이터 정보에 의해 공유 메모리로부터 데이터를 판독하고, 상기 데이터 동작 정보에 따라 판독된 데이터를 목표 기계 학습 유닛으로 리턴하는 구체적인 과정에 관한 것이다. 도 39에 도시된 바와 같이, 상기 동작 예정 데이터의 정보가 상기 동작 예정 데이터가 상기 공유 메모리 내의 소스 주소, 동작 예정 데이터 길이, 및 데이터 동작한 후의 데이터 리턴 주소를 포함하면, 상기 S4103은 다음 단계를 포함한다. In one embodiment, Figure 39 provides a data processing method. This embodiment relates to a specific process in which a transmission circuit reads data from a shared memory according to data information included in a data operation signal and returns the read data to a target machine learning unit according to the data operation information. As shown in FIG. 39, if the information of the operation schedule data includes the source address in the shared memory, the operation schedule data length, and the data return address after data operation, S4103 proceeds to the next step. include

S4201: 상기 전송 회로가 상기 소스 주소로부터 상기 공유 메모리를 판독하여 상기 데이터 길이를 만족하는 상기 입력 데이터를 획득한다.S4201: The transfer circuit reads the shared memory from the source address to obtain the input data that satisfies the data length.

본 실시예에서, 데이터 동작 신호의 동작 예정 데이터의 정보에 동작 예정 데이터가 메모리 내의 소스 주소, 동작 예정 데이터 길이, 및 데이터를 동작시킨 후의 데이터 리턴 주소를 포함하므로 전송 회로는 상기 공유 메모리 내의 소스 주소로부터 데이터를 판독하고, 사전에 설정된 규칙에 따라 동작 대기를 만족하는 데이터 길이를 판독한다. 여기서 상기 동작 예정 데이터 길이는 사용자가 실제 상황에 따라 스스로 설정한 것이다. 본 실시예는 이에 대해 한정하지 않는다. 전송 회로가 상기 데이터 길이를 만족하는 입력 뉴런 데이터 및 데이터를 획득하는 과정은, 사전에 설정된 규칙에 따라 공유 메모리로부터 상기 데이터 길이를 만족하는 데이터를 판독한 것이며, 상기 사전에 설정된 규칙은 또한 사용자가 실제 상황에 따라 정한 규칙이며, 본 실시예는 이에 대해 한정하지 않는다. 예를 들어, 소스 주소로부터 하나씩 판독하되 판독된 데이터의 길이가 상기 데이터의 길이를 만족할 때가지 판독하는 방식일 수 있다.In this embodiment, since the operation schedule data information of the data operation signal includes the source address in the memory, the operation schedule data length, and the data return address after operating the data, the transmission circuit is configured to determine the source address in the shared memory. Data is read from, and a data length that satisfies the waiting for operation according to a rule set in advance is read. Here, the operation schedule data length is set by the user according to the actual situation. This embodiment is not limited in this regard. The transmission circuit acquires input neuron data and data satisfying the data length by reading data satisfying the data length from the shared memory according to a preset rule, and the preset rule also allows the user to This is a rule determined according to the actual situation, and the present embodiment is not limited thereto. For example, it may be a method of reading from the source address one by one until the length of the read data satisfies the length of the data.

S4202: 상기 전송 회로는 상기 데이터 리턴 주소 및 상기 데이터 수신 플래그 비트에 따라 상기 입력 데이터를 상기 목표 기계 학습 유닛으로 리턴한다.S4202: The transmission circuit returns the input data to the target machine learning unit according to the data return address and the data reception flag bit.

상기 S4201 단계에서 전송 회로가 데이터 길이를 만족하는 입력 데이터를 획득하고 상기 데이터를 동작 예정 데이터의 정보 중의 데이터 리턴 주소로 리턴한다. 여기서, 상기 동작 예정 데이터의 정보 중의 데이터 리턴 주소는 기계 학습 장치의 복수의 목표 기계 학습 유닛 내의 주소일 수 있다. 여기서, 전송 회로는 데이터 동작 신호에 포함된 데이터 수신 플래그 비트에 따라 데이터가 학습 장치 중의 목표 기계 학습 유닛으로 리턴하는 것을 확정한다.In the step S4201, the transmission circuit acquires input data satisfying the data length and returns the data to a data return address in information of operation schedule data. Here, the data return address in the information of the operation schedule data may be an address within a plurality of target machine learning units of the machine learning device. Here, the transmission circuit determines whether the data is returned to the target machine learning unit in the learning device according to the data reception flag bit included in the data operation signal.

본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 소스 주소로부터 공유 메모리를 판독하여 상기 데이터 길이를 만족하는 입력 데이터를 획득하고, 데이터 리턴 주소 및 데이터 수신 플래그 비트에 따라 입력 데이터를 목표 기계 학습 유닛으로 리턴한다. 전송 회로가 상기 데이터 길이를 만족하는 입력 데이터를 획득할 때, 상기 데이터 동작 신호 중의 데이터 동작 정보가 가리키는 판독규칙에 따라 데이터를 판독하므로, 전송 회로의 데이터 판독 로직을 간소화하고, 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스 시의 액세스 속도를 크게 향상시켰다.In the data processing method according to the present embodiment, the transmission circuit reads the shared memory from the source address to obtain input data satisfying the data length, and sends the input data to the target machine learning unit according to the data return address and the data reception flag bit. return to When the transmission circuit acquires input data that satisfies the data length, the data is read according to the reading rule indicated by the data operation information in the data operation signal, thereby simplifying the data reading logic of the transmission circuit and improving data access efficiency. As a result, the machine learning chip greatly improved the access speed when accessing data.

일 실시예에서, 도 40은 데이터 처리 방법을 제공한다. 상기 임의의 한 실시예에 따르면, 상기 실시예 중의 동작 정보는 점프 정보를 포함할 수 있고, 상기 점프 정보는 점프 폭 및 매번 점프 후 동작하는 점프 데이터 길이를 포함한다. 상기 실시예는 전송 회로가 동작 정보 중의 점프 정보에 의해 공유 메모리 내의 데이터를 판독하는 구체적인 과정에 관한 것이다. 도 40에 도시된 바와 같이, 상기 S4201은 다음 단계를 포함한다.In one embodiment, Figure 40 provides a data processing method. According to any one of the above embodiments, the action information in the above embodiment may include jump information, and the jump information includes a jump width and a jump data length operated after each jump. The above embodiment relates to a specific process in which the transfer circuit reads data in the shared memory according to jump information in operation information. As shown in Fig. 40, the above S4201 includes the following steps.

S4301: 상기 전송 회로는 상기 소스 주소로부터 상기 공유 메모리를 판독하며, 금번 점프한 후의 점프 데이터 길이에 의해 제1 점프 데이터를 획득한다.S4301: The transmission circuit reads the shared memory from the source address, and obtains the first jump data by the jump data length after jumping this time.

본 실시예에서, 데이터 동작 신호의 동작 정보는 점프 정보를 포함하고, 상기 점프 정보는 상기 데이터 동작 신호에 따라 동작 예정 데이터 정보를 판독할 때, 상기 점프 정보의 규칙에 따라 판독하도록 상기 전송 회로를 지시한다. 여기서, 상기 점프 정보는 점프 폭 및 매번 점프한 후 동작하는 점프 데이터 길이를 포함하며 상기 점프 데이터 길이는 사전에 설정된 데이터 길이일 수 있다. 대안적으로, 상기 점프 정보는 stride점프정보 및/또는 segment점프정보를 포함하고, 상기 stride점프정보는 상기 데이터 동작 신호의 매번 점프 폭을 나타내며, 상기 segment점프정보는 사전에 설정된 상기 데이터 동작 신호의 매번 분할 크기를 나타낸다.In this embodiment, the operation information of the data operation signal includes jump information, and the jump information causes the transmission circuit to read according to the rules of the jump information when reading operation schedule data information according to the data operation signal. instruct Here, the jump information includes a jump width and a jump data length that operates after each jump, and the jump data length may be a preset data length. Alternatively, the jump information includes stride jump information and/or segment jump information, the stride jump information indicates a jump width of each time of the data operation signal, and the segment jump information of the data operation signal set in advance. Indicates the split size each time.

구체적으로, 전송 회로는 동작 예정 데이터 정보 중의 소스 주소로부터 공유 메모리를 판독하고, 금번 점프한 후, 판독된 점프 데이터 길이의 데이터를 제1 점프 데이터로 확정하며, 상기 제1 점프 데이터는 전송 회로가 데이터 판독시 사전에 설정된 길이 데이터를 점프한 후 획득한 데이터를 표시한다. 여기서, 상기 사전에 설정된 길이는 사용자가 실제 상황에 따라 스스로 설정한 것이며, 본 실시예는 이에 대하여 한정하지 않는다.Specifically, the transmission circuit reads the shared memory from the source address in the operation schedule data information, jumps this time, and determines the data of the read jump data length as first jump data, and the first jump data is determined by the transmission circuit When reading data, it displays the acquired data after jumping the previously set length data. Here, the preset length is set by the user himself according to the actual situation, and the present embodiment is not limited thereto.

S4302: 상기 전송 회로는 상기 제1 점프 데이터의 마지막 주소를 획득하고, 상기 점프 폭에 따라 상기 마지막 주소로부터 목표 점프 주소로 점프한다.S4302: The transmission circuit obtains the last address of the first jump data, and jumps from the last address to a target jump address according to the jump width.

상기 S4301 단계에서 판독된 제1 점프 데이터에 의해, 전송 회로는 상기 제1 점프 데이터의 마지막 주소를 획득하고, 점프 정보 중의 점프 폭(예를 들어, stride 폭)에 따라 상기 제1 점프 데이터의 마지막 주소로부터 목표 점프 주소로 상기 점프 폭의 길이를 점프한다. 상기 제1 점프 데이터의 마지막 주소와 목표 점프 주소 사이의 길이가 점프 정보 중의 점프 폭이라는 것을 이해해야 한다.According to the first jump data read in step S4301, the transmission circuit obtains the last address of the first jump data, and according to the jump width (eg stride width) in the jump information, the last address of the first jump data Jump the length of the jump width from an address to a target jump address. It should be understood that the length between the last address of the first jump data and the target jump address is the jump width in the jump information.

S4303: 상기 전송 회로는 상기 목표 점프 주소부터, 점프 후의 점프 데이터 길이에 따라 제2 점프 데이터를 획득하되, 매번 점프한 후 얻는 점프 데이터의 길이가 상기 데이터 길이를 만족할 때까지 한다.S4303: The transmission circuit acquires second jump data from the target jump address according to the length of jump data after jumping, until the length of jump data obtained after each jump satisfies the data length.

본 단계에서, 전송 회로가 데이터를 판독할 때 상기 S4302단계에 확정된 목표 점프 주소로부터 시작하여 사전에 설정된 길이의 데이터를 점프한 후 상기 사전에 설정된 길이만큼 점프한 데이터를 제2 점프 데이터로 확정한다. 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하면, 상기 기계 학습 장치에 필요한 데이터의 판독이 완료됨을 나타내며, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족하지 않으면, 상기 제2 점프 데이터의 마지막 주소로부터 상기 S4301 내지 S4303단계 내의 점프 순서에 따라 점프하여 상기 데이터를 판독하되, 상기 제2 점프 데이터의 주소와 상기 점프 시작의 소스 주소 사이의 길이가 기계 학습 장치에 필요한 데이터의 데이터 길이를 만족할 때까지, 즉 상기 기계 학습 장치에 필요한 데이터의 판독이 완료될 때 까지 판독한다.In this step, when the transmission circuit reads the data, starting from the target jump address determined in the step S4302, jumps data of a preset length, and then determines the jumped data by the preset length as the second jump data. do. If the length between the address of the second jump data and the source address of the start of the jump satisfies the data length of data required for the machine learning device, it indicates that reading of data required for the machine learning device is completed, and the second jump data If the length between the address of and the source address of the jump start does not satisfy the data length of the data required by the machine learning device, jumps from the last address of the second jump data according to the jump order in steps S4301 to S4303, and the data read, until the length between the address of the second jump data and the source address of the start of the jump satisfies the data length of data required for the machine learning device, that is, when reading of the data required for the machine learning device is completed. read up to

본 실시예에 따른 데이터 처리 방법의 구현 원리와 기술적 효과가 상기 데이터 처리 장치의 실시예와 유사하므로 여기서 중복 설명하지 않는다. 본 실시예에 따른 데이터 처리 방법에서, 전송 회로는 소스 주소로부터 공유 메모리를 판독하기 시작하여 금번 점프한 후의 점프 데이터 길이에 따라 제1 점프 데이터를 획득하고, 상기 제1 점프 데이터의 마지막 주소로부터 점프 폭에 따라 목표 점프 주소에 점프한 후 목표 점프 주소로부터 점프한 후의 점프 데이터 길이에 따라 제2 점프 데이터를 획득하되, 매번 점프한 후 얻은 점프 데이터의 길이가 데이터 길이를 만족할 때까지 획득한다. 이에 따라, 동작 정보에 점프 정보를 할 때, 전송 회로가 점프 정보의 점프 규칙에 따라 데이터를 판독하므로 전송 회로의 데이터 판독 로직을 간소화하고, 데이터의 액세스 효율을 높여, 기계 학습 칩이 데이터 액세스시의 액세스 속도에 향상시킨다.Since the implementation principle and technical effect of the data processing method according to the present embodiment are similar to those of the embodiment of the data processing apparatus, a redundant description thereof will not be made. In the data processing method according to the present embodiment, the transmission circuit starts reading the shared memory from the source address, obtains the first jump data according to the length of the jump data after jumping this time, and jumps from the last address of the first jump data. After jumping to the target jump address according to the width, the second jump data is obtained according to the jump data length after jumping from the target jump address, until the length of the jump data obtained after each jump satisfies the data length. Accordingly, when adding jump information to motion information, the transmission circuit reads data according to the jump rule of the jump information, thereby simplifying the data reading logic of the transmission circuit and increasing data access efficiency, so that the machine learning chip can access data when accessing data. improve access speed.

전송 회로가 수신 데이터 동작 신호에 따라 동작할 때, 수신 시작한 데이터 동작 신호는 하나의 코딩된 명령(coded instruction)이며 우선 상기 데이터 동작 신호에 대해 디코딩하고 분석해야 한다. 그러므로 도 41에 도시된 바와 같이, 본 출원의 실시예에 따른 데이터 처리 방법에서 상기 데이터 처리 장치 내의 전송 회로가 상기 데이터 처리 장치 내의 기계 학습 장치에서 송신된 데이터 동작 신호를 수신하는 단계는 다음 단계를 포함한다.When the transmission circuit operates according to the received data operation signal, the received data operation signal is a coded instruction and must first be decoded and analyzed for the data operation signal. Therefore, as shown in FIG. 41 , in the data processing method according to the embodiment of the present application, the step of receiving the data operation signal transmitted from the machine learning device in the data processing device by the transmission circuit in the data processing device includes the following steps: include

S4401: 상기 전송 회로가 상기 데이터 동작 신호를 분석하여 상기 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻는다.S4401: The transmission circuit analyzes the data operation signal to obtain information of the type flag bit of the data operation signal and operation schedule data.

유의할 것은, 일반적인 데이터 처리과정에서 데이터 동작 신호의 수가 많고, 전송 회로가 그 중 하나의 데이터 동작 신호를 처리할 때, 다른 신호들은 저장해두어야 한다. 구체적으로, 전송 회로는 상기 데이터 동작 신호를 분석하고, 상기 데이터 동작 신호에 포함된 데이터 정보 및 상기 데이터 동작 신호의 유형 플래그 비트를 분석한다. 여기서, 상기 데이터 동작 정보는 동작 예정 데이터 길이, 목표 주소 및 소스 주소 등의 정보를 포함하며, 본 실시예는 이에 대해 한정하지 않는다.It should be noted that there are many data operation signals in a general data processing process, and when the transmission circuit processes one data operation signal, other signals must be stored. Specifically, the transmission circuit analyzes the data operation signal, and analyzes the data information included in the data operation signal and the type flag bit of the data operation signal. Here, the data operation information includes information such as an operation scheduled data length, a target address, and a source address, and the present embodiment is not limited thereto.

S4402: 상기 전송 회로가 명령 큐(queue)에 의해 상기 분석한 데이터 동작 신호를 수행한다. 여기서 상기 명령 큐는 상기 데이터 동작 신호의 수행 순위를 나타낸다.S4402: The transmitting circuit executes the analyzed data operation signal by a command queue. Here, the command queue represents the execution order of the data operation signal.

상기 데이터 동작 신호는 수행시 순서에 따라 순차적으로 진행되어야 하고 상기 S4401단계에서 전송 회로가 상기 데이터 동작 신호를 분석하여 얻은 데이터 동작 정보 및 유형 플래그 비트를 기반으로 전송 회로는 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행한다는 것을 이해해야 한다.The data operation signal must proceed sequentially according to the order in which it is executed, and based on the data operation information and the type flag bit obtained by the transmission circuit analyzing the data operation signal in step S4401, the transmission circuit analyzes the data operation signal according to the command queue. It should be understood that performing data operation signals.

본 실시예에 따른 데이터 처리 방법은 전송 회로를 통해 상기 데이터 동작 신호에 대해 분석하여 데이터 동작 신호의 유형 플래그 비트 및 동작 예정 데이터의 정보를 얻은 다음, 전송 회로가 명령 큐에 따라 이미 분석된 데이터 동작 신호를 수행한다. 이에 따라, 데이터 동작 신호를 수행하기 전에 데이터 동작 신호를 우선 분석한 후 순서에 따라 수행하므로 전송 회로가 데이터 동작 신호에 따라 동작을 수행하는 속도를 크게 향상시킨다.The data processing method according to the present embodiment analyzes the data operation signal through a transmission circuit to obtain information on the type flag bit of the data operation signal and operation scheduled data, and then the transmission circuit operates the already analyzed data according to the command queue. carry out the signal Accordingly, since the data operation signal is analyzed first before performing the data operation signal and then performed in order, the speed at which the transmission circuit performs an operation according to the data operation signal is greatly improved.

전송 회로가 큐 내의 순서에 따라 데이터 동작 신호를 수행할 때, 서로 연관이 있는 데이터 동작 신호를 수행해야 하는 것을 고려하면, 본 출원의 실시예는 다른 일 실시예를 제공한다. 도 42에 도시된 바와 같이, 상기 전송 회로가 명령 큐에 따라 상기 분석된 데이터 동작 신호를 수행하기 이전에, 상기 방법은 다음과 같은 단계를 더 포함한다.Considering that when the transmission circuit performs data operation signals according to the order in the queue, it is necessary to perform data operation signals that are related to each other, the embodiments of the present application provide another embodiment. As shown in FIG. 42 , before the transmitting circuit performs the analyzed data operation signal according to the command queue, the method further includes the following steps.

S4501: 상기 전송 회로는 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하여 판단결과를 얻는다. 상기 의존 관계는 제s 번째 데이터 동작 신호와 상기 제s 번째 데이터 동작 신호의 의전의 제s-1번째 데이터 동작 신호가 연관 관계가 존재하는지를 나타낸다.S4501: The transmission circuit judges the dependent relationship of the analyzed data operation signal adjacent to it, and obtains a judgment result. The dependency relation indicates whether a relation exists between the s-th data operation signal and the s-1-th data operation signal of the protocol of the s-th data operation signal.

여기서, 전송 회로는 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단해야 하며, 판단 결과에 따라 처리하는 두 인접되어 있는 데이터 동작 신호 사이에 관련성이 있어야 함을 결정한다. 여기서, 제s번째 데이터 동작 신호는 데이터 동작 신호 중 임의의 하나의 신호를 나타낼 뿐 어느 특정한 신호를 표시하지 않는다. 제s-1번째 데이터 동작 신호는 제s번째 데이터 동작 신호 바로 이전의 신호를 나타낸다.Here, the transmission circuit needs to determine the dependency relationship between the analyzed data operation signals that are adjacent to each other, and determines that there should be a relationship between two adjacent data operation signals to be processed according to the determination result. Here, the s-th data operation signal represents an arbitrary one of the data operation signals and does not represent any specific signal. The s-1 th data operation signal represents a signal right before the s th data operation signal.

대안적으로, 상기 전송 회로가 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하는 일 구현방식은 다음과 같다. 상기 전송 회로가 상기 제s번째 데이터 동작 신호에 의해 상기 제s번째 데이터 동작 신호에 필요한 데이터를 추출하는 제s번째 데이터 동작 신호, 및 상기 제s-1번째 데이터 동작 신호에 의해 상기 제s-1번째 데이터 동작 신호에 필요한 데이터를 추출하는 제0 저장 주소 구간을 각각 획득하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 있으면, 상기 전송 회로는 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 있음을 확정하고, 상기 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 없으면, 상기 전송 회로는 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 없다고 확정한다. 여기서, 전송 회로는 제s번째 데이터 동작 신호의 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호의 제0 저장 주소 구간 사이의 관계에 따라 인접되어 있는 상기 분석된 데이터 동작 신호의 의존 관계를 판단하며, 그 판단 방식은 다음과 같다. 즉, 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 없으면, 상기 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호가 의존 관계가 없고, 제1 저장 주소 구간과 상기 제0 저장 주소 구간에 중첩된 영역이 있으면, 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호가 의존 관계가 있음을 나타낸다.Alternatively, an implementation manner of determining the dependency relationship of the analyzed data operation signal to which the transmission circuit is adjacent is as follows. The transmission circuit extracts data necessary for the s-th data operation signal by the s-th data operation signal, and the s-1-th data operation signal by the s-1th data operation signal. Each of the 0th storage address intervals for extracting data required for the th data operation signal is obtained, and if there is an overlapping area between the 1st storage address interval and the 0th storage address interval, the transmission circuit performs the s-th data operation. signal and the s-1 th data operation signal have a dependency relationship, and if there is no overlapping area between the 1st storage address section and the 0 th storage address section, the transmission circuit performs the s-th data operation operation signal. It is determined that there is no dependency relationship between the signal and the s-1th data operation signal. Here, the transmission circuit determines the relationship between the s-th data operation signal of the s-th data operation signal and the 0-th storage address section of the s-1-th data operation signal. is judged, and the judgment method is as follows. That is, if there is no overlapping area between the first storage address period and the 0th storage address period, the s-th data operation signal and the s-1-th data operation signal have no dependency relationship, and the first storage address period and the If there is an overlapping area in the 0 th storage address section, it indicates that the s th data operation signal and the s-1 th data operation signal have a dependency relationship.

S4502: 상기 판단 결과가 상기 제s번째 데이터 동작 신호와 상기 제s-1번째 데이터 동작 신호가 의존 관계가 존재하는 것이면, 상기 제s번째 데이터 동작 신호를 캐싱하고, 상기 제s-1번째 데이터 동작 신호를 수행한 후, 상기 제s번째 데이터 동작 신호를 추출한다.S4502: If the determination result is that the s-th data operation signal and the s-1-th data operation signal have a dependency relationship, the s-th data operation signal is cached, and the s-1-th data operation signal is cached. After performing the signal, the s-th data operation signal is extracted.

상기 단계에서 전송 회로가 판단한 두 인접되어 있는 데이터 동작 신호의 의존 관계에 기반으로 순서에 따라 데이터 동작 신호를 수행하되, 판단 결과가 제s번째 데이터 동작 신호와 제s-1번째 데이터 동작 신호 사이에 의존 관계가 있으면, 전송 회로는 우선 상기 제s번째 데이터 동작 신호를 캐싱하고, 제s-1번째 데이터 동작 신호에 대한 수행이 완료된 후, 상기 제s번째 데이터 동작 신호를 다시 추출한다.Based on the dependency relationship between the two adjacent data operation signals determined by the transmission circuit in the above step, the data operation signals are performed in order, and the determination result is between the s-th data operation signal and the s-1-th data operation signal. If there is a dependency relationship, the transmission circuit first caches the s-th data operation signal, and extracts the s-th data operation signal again after processing on the s-1-th data operation signal is completed.

그러나, 현재 기계 학습 알고리즘이 끊임없이 발전됨에 따라 점점 많은 구성의 기계 학습 칩이 생기고 있다. 이런 기계 학습 칩은 유니캐스트 판독, 브로드캐스트 등 여러 가지 방식을 통해 공유 저장 중인 데이터를 자주 액세스하거나 처리해야 하므로 이에 상응한 복수의 전송 인터페이스를 설치해야 한다. 이는 기계 학습 칩의 면적을 크게 하였다.However, as machine learning algorithms continue to evolve, more and more configurations of machine learning chips are emerging. Since such a machine learning chip needs to frequently access or process data in shared storage through various methods such as unicast reading and broadcasting, a plurality of corresponding transmission interfaces must be installed. This increased the area of the machine learning chip.

따라서, 기계 학습 칩의 면적을 줄이기 위해 기계 학습 칩의 전송 인터페이스를 간소화하는 것이 현재 당업자들이 시급히 해결해야 하는 기술적 과제이다Therefore, simplifying the transmission interface of the machine learning chip to reduce the area of the machine learning chip is currently a technical challenge that those skilled in the art urgently need to solve.

본 출원의 실시예에 따른 데이터 처리 장치는 소프트웨어, 하드웨어 또는 이들을 결합하는 방식을 통해 구현할 수 있다. 상기 데이터 처리 장치는 도 43에서 도시된 일부분 또는 전체일 수 있다. 상기 데이터 처리 장치는 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)을 포함할 수 있으며, 상기 기계 학습 장치(11)은 적어도 하나 이상의 기계 학습 유닛(15)를 포함할 수 있으며, 상기 기계 학습 유닛(15)이 수행한 유니캐스트 판독 동작 및 브로드캐스트 동작은 하나의 데이터 수신 인터페이스(142)를 공유하며, 상기 기계 학습 유닛은 송신 인터페이스(141) 및 공유 데이터 수신 인터페이스(142)를 통해 상기 전송 회로(12)에 연결되고, 상기 전송 회로(12)는 상기 공유 메모리(13)에 연결된다. 상기 전송 회로(12)는 상기 기계 학습 장치(11)가 상기 송신 인터페이스(141)을 통해 보낸 데이터 동작 신호에 따라 상기 공유 메모리(13)로부터 상기 기계 학습 장치에 필요한 입력 데이터를 획득하고, 상기 공유 데이터 수신 인터페이스(142)를 통해 상기 입력 데이터를 상기 기계 학습 장치(11)로 리턴한다. 유의할 것은, 상기 기계 학습 유닛(15)은 제1 전송 인터페이스14(미도시)를 포함할 수 있으며, 제1 전송 인터페이스는 송신 인터페이스(141) 및 공유 데이터 수신 인터페이스(142)를 포함할 수 있다.A data processing apparatus according to an embodiment of the present application may be implemented through software, hardware, or a combination thereof. The data processing device may be part or all of the data shown in FIG. 43 . The data processing device may include a machine learning device 11, a transmission circuit 12, and a shared memory 13, and the machine learning device 11 may include at least one machine learning unit 15. The unicast read operation and the broadcast operation performed by the machine learning unit 15 share one data receiving interface 142, and the machine learning unit has a sending interface 141 and a shared data receiving interface 142. ) is connected to the transfer circuit 12, and the transfer circuit 12 is connected to the shared memory 13. The transmission circuit 12 obtains input data necessary for the machine learning device from the shared memory 13 according to a data operation signal sent from the machine learning device 11 through the transmission interface 141, and the sharing The input data is returned to the machine learning device 11 through the data receiving interface 142 . It should be noted that the machine learning unit 15 may include a first transmission interface 14 (not shown), and the first transmission interface may include a transmission interface 141 and a shared data reception interface 142.

대안적으로, 상기 기계 학습 장치(11)는 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻을 수 있다. 대안적으로, 상기 기계 학습 장치(11)는 전송 회로(12)를 통해 출력 데이터를 공유 메모리(13)에 전송하여 저장하는데 사용될 수도 있다. 구체적으로, 기계 학습 장치(11)가 일 신경망 연산을 수행하는데 사용될 경우, 기계 학습 장치(11)는 입력 뉴런 데이터 및 가중치에 의해 인공 신경망의 연산을 수행하여 출력 뉴런 데이터를 얻고, 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 삼아 전송 회로(12)를 통해 공유 메모리(13)로 전송하여 저장한다.Alternatively, the machine learning device 11 may obtain output data by performing a machine learning operation on input data. Alternatively, the machine learning device 11 may be used to transmit and store output data to the shared memory 13 through the transmission circuit 12 . Specifically, when the machine learning device 11 is used to perform a neural network operation, the machine learning device 11 performs an artificial neural network operation based on input neuron data and weights to obtain output neuron data, and output neuron data. As new input neuron data, it is transmitted to the shared memory 13 through the transmission circuit 12 and stored.

유의할 것은, 상기 기계 학습 유닛, 전송 회로, 공유 메모리 및 각 종 인터페이스는 모두 하드웨어 회로의 방식을 통해 구현될 수 있다. 예시적으로, 전송 회로는 브로드캐스트 버스(broadcast bus)일 수 있다. 공유 메모리는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 고속 캐시 메모리 등을 포함하나 이에 한정되지 않는다. 각 종 인터페이스는 하나 또는 복수의 데이터 I/O(in/out, 리드인 리드아웃)인터페이스 또는 I/O 핀과 대응할 수 있다.It should be noted that the machine learning unit, transmission circuit, shared memory, and various types of interfaces may all be implemented in a hardware circuit manner. Illustratively, the transmission circuit may be a broadcast bus. Shared memory can be non-volatile and/or volatile memory, and includes, but is not limited to, random access memory (RAM), high-speed cache memory, and the like. Each type of interface may correspond to one or a plurality of data I/O (in/out, lead-in/lead-out) interfaces or I/O pins.

본 출원에 따른 데이터 처리 장치는 기계 학습 연산에 적용될 수 있으며, 기계 학습 연산은 신경망 연산, k-means연산, 서포트 벡터 머신 연산 등을 포함한다. 대안적으로, 상기 기계 학습 장치가 신경망 계산을 수행할 경우, 상기 입력 데이터는 입력 뉴런 데이터 및/또는 가중치를 포함할 수 있으며, 상기 입력 뉴런 데이터 및 가중치는 기계 학습 장치가 인공 신경망 연산을 수행할 때 입력해야 하는 데이터이다. 이에 따라, 상기 출력 데이터는 출력 뉴런 데이터를 포함할 수 있으며, 상기 출력 뉴런 데이터는 기계 학습 장치가 인공 신경망 연산을 수행할 때 출력하는 중간 결과 또는 최종 결과이다. 가중치 및 뉴런 데이터는 재활용이 가능하므로, 계산 과정에서 입력 데이터는 입력 뉴런 데이터 및 가중치를 반드시 포함해야 하는 것은 아니며, 입력 뉴런 데이터만 포함하든지 가중치만 포함할 수 있다는 것을 이해해야 한다.The data processing apparatus according to the present application may be applied to machine learning operation, and the machine learning operation includes neural network operation, k-means operation, support vector machine operation, and the like. Alternatively, when the machine learning device performs neural network calculation, the input data may include input neuron data and/or weights, and the input neuron data and weights allow the machine learning device to perform artificial neural network calculation. This is the data you need to enter when Accordingly, the output data may include output neuron data, and the output neuron data is an intermediate result or a final result output when the machine learning device performs an artificial neural network operation. Since the weights and neuron data are reusable, it should be understood that the input data in the calculation process does not necessarily include the input neuron data and the weights, but may include only the input neuron data or only the weights.

신경망 연산을 예를 들면(별도의 설명이 없으면, 본 실시예에서는 모두 신경망 연산을 예를 들어 설명한다), 본 출원에 따른 데이터 처리 장치는 신경망의 일층 연산을 수행할 수 있으며, 신경망의 다층 연산을 수행할 수도 있다. 다층 신경망에 대한 구현 과정은 다음과 같다. 정방향 연산에서 이전 층 인공 신경망의 수행이 끝난 후 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 출력 뉴런 데이터를 다음 층 입력 뉴런 데이터로 삼아 연산(또는 상기 출력 뉴런데이터에 대해 일부 동작하여 다시 다음 층의 입력 뉴런 데이터로 삼음)하는 동시에, 가중치를 또한 다음 층 가중치로 교체하며, 역방향 연산에서 이전 층 인공 신경망의 역방향 연산이 끝난 후, 다음 층의 연산 명령은 연산 유닛에서 계산해 낸 입력 뉴런 기울기(마찬가지로 입력된 뉴런 데이터로 삼을 수 있음)를 다음 층의 출력 뉴런 기울기로 하여 연산(마찬가지로 출력된 뉴런 데이터로 삼을 수 있음)(또는 상기 입력 뉴런 기울기에 대해 일부 동작한 후 다시 다음 층의 출력 뉴런 기울로 삼음)하는 동시에 가중치를 다음 층의 가중치로 교체한다.Taking neural network operation as an example (unless otherwise described, all neural network operations will be described as an example in this embodiment), the data processing apparatus according to the present application can perform one-layer operation of a neural network, and multi-layer operation of a neural network. can also be performed. The implementation process for the multilayer neural network is as follows. In the forward operation, after the artificial neural network of the previous layer is finished, the calculation command of the next layer uses the output neuron data calculated in the calculation unit as the input neuron data of the next layer and performs the operation (or partially operates on the output neuron data to perform the next layer again). at the same time as the input neuron data), the weights are also replaced with the weights of the next layer, and in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is finished, the operation instruction of the next layer is the input neuron gradient calculated by the operation unit (also input calculated neuron data) as the gradient of the output neuron of the next layer (or the gradient of the output neuron of the next layer after some operation on the gradient of the input neuron) ) and at the same time replace the weights with the weights of the next layer.

도 1을 참조하면, 하나의 선택 가능한 기술적 방안에서, 상기 기계 학습 장치(11)는 복수의 기계 학습 유닛(15)을 포함할 수 있다. 다층 신경망의 연산에 대해, 정방향 연산 중의 한층의 신경망 계산을 예를 들어 설명한다. 일 실시 형태에서, 상기 기계 학습 장치는 복수의 기계 학습 유닛(MLU, Machine Learning Unit)을 통해 신경망 중 해당 층의 모든 뉴런의 출력 뉴런 데이터를 병행 계산할 수 있다. 예시적으로, 상기 기계 학습 장치가 4개의 기계 학습 유닛을 포함하고 해당 층의 신경망이 100개의 뉴런을 가지면, 각각의 기계 학습 유닛이 그 중 25개의 뉴런을 처리하도록 배분할 수 있고, 해당 연산 명령을 설정하여 구현할 수 있다. 상기 과정에서, 각각의 기계 학습 유닛은 모두 전송 회로를 통해 공유 메모리로부터 분배된 해당 층의 25개 뉴런에 각각 대응하는 입력 뉴런 데이터 및 가중치를 획득함으로써 분배된 해당 층의 25개 뉴런의 출력 뉴런 데이터를 계산하고, 전송 회로를 통해 분배된 해당 층의 25개 뉴런의 출력 뉴런 데이터를 공유 메모리에 전송하여 저장할 수 있다. 상기 각각의 기계 학습 유닛이 분배된 해당 층의 복수의 뉴런 데이터를 처리할 때, 병행 계산을 통해 처리할 수 있음을 이해할 수 있다. 이러한 한층 또 한층의 신경망의 병행 계산을 진행하므로 신경망 계산의 병행 처리를 구현할 수 있어 처리 효율을 향상시킬 수 있다.Referring to FIG. 1 , in one selectable technical scheme, the machine learning device 11 may include a plurality of machine learning units 15 . Regarding multi-layer neural network computation, one-layer neural network computation during forward computation will be described as an example. In one embodiment, the machine learning apparatus may calculate output neuron data of all neurons of a corresponding layer in a neural network in parallel through a plurality of machine learning units (MLUs). For example, if the machine learning device includes 4 machine learning units and the neural network of the corresponding layer has 100 neurons, each machine learning unit may allocate 25 neurons among them to be processed, and the corresponding operation command can be configured and implemented. In the above process, each machine learning unit obtains input neuron data and weights respectively corresponding to the 25 neurons of the corresponding layer distributed from the shared memory through the transmission circuit, thereby obtaining the output neuron data of the 25 neurons of the distributed layer. , and output neuron data of 25 neurons of the layer distributed through the transmission circuit can be transmitted and stored in the shared memory. It can be understood that when each of the machine learning units processes the distributed neuron data of the corresponding layer, it can be processed through parallel calculation. Since the parallel calculation of each layer of neural networks is performed, parallel processing of neural network calculations can be realized and processing efficiency can be improved.

다른 선택 가능한 방안에서, 상기 기계 학습 장치는 복수의 기계 학습 유닛을 이용하여 일정한 선후 순서로 신경망 각층의 모든 뉴런의 출력 뉴런 데이터를 각각 계산할 수 있다. 상기 과정에서, 직전 기계 학습 유닛은 전송 회로를 통해 해당 층의 모든 뉴런의 출력 뉴런 데이터를 공유 메모리로 전송하여 저장할 수 있으므로 다음 기계 학습 유닛은 해당 층의 모든 뉴런의 출력 뉴런 데이터를 추출하여 다음 층의 입력 뉴런 데이터로 사용하여 계산할 수 있다. 상기 적용은 각층의 신경망 계산량이 크지 않는 상황, 예를 들어 각층의 뉴런 수가 적은 신경망 계산에 적용된다는 것을 이해할 수 있다.In another alternative option, the machine learning apparatus may calculate output neuron data of all neurons of each layer of the neural network in a predetermined sequential order by using a plurality of machine learning units. In the above process, the previous machine learning unit transmits and stores the output neuron data of all neurons in the corresponding layer to the shared memory through the transmission circuit, so the next machine learning unit extracts the output neuron data of all neurons in the corresponding layer to the next layer. can be calculated using the input neuron data of It can be understood that the above application is applied to a situation in which the amount of computation of the neural network in each layer is not large, for example, the computation of the neural network in which the number of neurons in each layer is small.

도 44를 참조하면, 도 43의 기계 학습 유닛0을 예를 들어 기계 학습 유닛에 대해 상세하게 설명한다. 하나의 방안에서, 기계 학습 유닛(15)은 송신 인터페이스(141), 공유 데이터 수신 인터페이스(142), 적어도 하나 이상의 연산 유닛(151), 및 상기 연산 유닛(151)에 연결된 컨트롤러 유닛(152)을 포함할 수 있으며, 상기 연산 유닛(151)은 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)를 포함하고, 상기 연산 유닛(151)은 상기 송신 인터페이스(141) 및 공유 데이터 수신 인터페이스(142)를 통해 상기 전송 회로(12)에 연결되고,Referring to FIG. 44 , the machine learning unit 0 of FIG. 43 will be described in detail as an example. In one approach, the machine learning unit 15 includes a sending interface 141, a shared data receiving interface 142, at least one computing unit 151, and a controller unit 152 coupled to the computing unit 151. The calculation unit 151 includes one master processing circuit 151a and a plurality of slave processing circuits 151b, and the calculation unit 151 receives the transmission interface 141 and shared data. connected to the transmission circuit (12) via an interface (142);

상기 컨트롤러 유닛(152)은, 상기 송신 인터페이스(141)를 통해 상기 전송 회로(12)로 상기 데이터 동작 신호 및 상기 출력 뉴런 데이터를 송신하고, 상기 공유 데이터 수신 인터페이스(142)를 통해 상기 전송 회로(12)가 상기 공유 메모리(13)로부터 획득한 상기 입력 뉴런 데이터 및 상기 가중치를 수신한 후 상기 입력 뉴런 데이터 및 상기 가중치를 상기 마스트 처리회로(151a) 및/또는 상기 슬레이브 처리회로(151b)로 송신하는데 사용되며,The controller unit 152 transmits the data operation signal and the output neuron data to the transmission circuit 12 through the transmission interface 141, and through the shared data reception interface 142 the transmission circuit ( 12) receives the input neuron data and the weights acquired from the shared memory 13, and transmits the input neuron data and the weights to the master processing circuit 151a and/or the slave processing circuit 151b is used to

상기 마스트 처리회로(151a)는, 상기 입력 뉴런 데이터 및/또는 가중치를 상기 복수의 슬레이브 처리회로(151b)에 각각 송신하고 상기 복수의 슬레이브 처리회로(151b)는, 상기 뉴런 데이터 및 가중치에 의해 중간 연산을 병행 수행하여 복수의 중간 결과를 얻은 후 복수의 중간 결과를 상기 마스트 처리회로(151a)로 전송한다. 상기 마스트 처리회로(151a)는 또한 상기 복수의 중간 결과에 대해 후속 처리하여 계산 결과를 얻는데 사용된다. 여기서, 상기 후속 처리는 활성화 연산을 포함할 수 있다. 구체적으로, 상기 컨트롤러 유닛(152)은 계산 명령을 획득하고, 상기 계산 명령을 분석하여 복수의 연산 명령을 얻은 후 상기 복수의 연산 명령을 상기 마스트 처리회로로 송신할 수 있다. 본 실시예에서, 기계 학습 유닛에 복수의 연산 유닛이 포함될 경우, 각 연산 유닛은 상기 송신 인터페이스 및 상기 공유 데이터 수신 인터페이스를 공용할 수 있다는 것을 이해할 수 있다The master processing circuit 151a transmits the input neuron data and/or weights to the plurality of slave processing circuits 151b, respectively, and the plurality of slave processing circuits 151b transmits the intermediate neuron data and weights. After obtaining a plurality of intermediate results by performing operations in parallel, the plurality of intermediate results are transmitted to the master processing circuit 151a. The master processing circuit 151a is also used to perform subsequent processing on the plurality of intermediate results to obtain calculation results. Here, the subsequent processing may include an activation operation. Specifically, the controller unit 152 may obtain a calculation command, analyze the calculation command to obtain a plurality of calculation commands, and then transmit the plurality of calculation commands to the master processing circuit. In this embodiment, it can be understood that when a machine learning unit includes a plurality of computational units, each computational unit may share the transmission interface and the shared data reception interface.

예를 들어 설명하면, 하나의 선택 가능한 기술방안에서, 마스트 처리회로는 하나의 컨트롤러 유닛을 더 포함할 수 있으며, 상기 컨트롤러 유닛은 마스터 명령 처리 유닛을 포함할 수 있으며, 구체적으로 연산 명령을 마이크로 명령으로 디코딩하는데 사용된다. 다른 선택 가능한 기술방안에서, 슬레이브 처리회로는 다른 하나의 컨트롤러 유닛을 더 포함할 수 있으며, 상기 다른 하나의 컨트롤러 유닛은 슬레이브 명령 처리 유닛을 포함하고, 구체적으로 마이크로 명령을 수신하고 처리하는데 사용된다. 상기 마이크로 명령은 명령의 다음 레벨 명령일 수 있다. 상기 마이크로 명령은 명령을 분할 또는 디코딩하여 얻을 수 있으며 각 부재, 각 유닛 또는 각 처리회로의 제어 신호로 추가 디코딩될 수 있다. 예를 들어, 곱셈 마이크로 명령은 컨볼루션 명령의 다음 레벨 명령이다.For example, in one selectable technical solution, the master processing circuit may further include a controller unit, and the controller unit may include a master command processing unit, specifically processing an operation command as a micro-instruction. is used to decode. In another alternative technical solution, the slave processing circuit may further include another controller unit, and the other controller unit includes a slave command processing unit, specifically used for receiving and processing micro-instructions. The micro-instruction may be a next-level command of the command. The micro-instruction can be obtained by dividing or decoding the command, and can be additionally decoded as a control signal of each member, each unit or each processing circuit. For example, the multiply microinstruction is the next level instruction of the convolution instruction.

예시적으로, 상기 기계 학습 유닛의 구조를 예를 들어 상기 기계 학습 유닛의 신경망 연산 절차에 대해 상세하게 설명한다. 아래 단계S5101-S5106을 참조하기 바란다.By way of example, the structure of the machine learning unit will be described in detail with respect to a neural network calculation procedure of the machine learning unit. Please refer to steps S5101-S5106 below.

S5101: 컨트롤러 유닛의 명령 저장 유닛의 첫 주소에 하나의 IO명령을 미리 저장하고,S5101: store one IO command in advance in the first address of the command storage unit of the controller unit;

S5102: 컨트롤러 유닛은 명령 저장 유닛의 첫 주소로부터 상기 IO 명령을 판독한 다음 상기 IO명령으로부터 디코딩된 제어신호에 따라 오프 칩 인터페이스를 통해 오프 칩 메모리로부터 상기 기계 학습 유닛에 대응하는 신경망 연산 명령을 획득하거나, 혹은 전송 회로를 통해 공유 메모리로부터 상기 기계 학습 유닛에 대응하는 신경망 계산 명령을 획득한 후 상기 획득한 계산 명령을 명령 저장 유닛에 저장한다.S5102: The controller unit reads the IO command from the first address of the command storage unit, and obtains the neural network operation command corresponding to the machine learning unit from the off-chip memory through the off-chip interface according to the control signal decoded from the IO command. or, after acquiring a neural network calculation command corresponding to the machine learning unit from a shared memory through a transmission circuit, the obtained calculation command is stored in a command storage unit.

S5103: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 IO명령을 읽어 들인 후 상기 IO명령으로부터 디코딩된 데이터 동작 신호에 따라 전송 회로를 통해 공유 메모리로부터 연산 유닛에 필요한 모든 데이터 블록을 판독하며, 상기 데이터 블록은 배분하고자 하는 해당 층 뉴런의 입력 뉴런 데이터 및 가중치를 포함하고, 빠른 활성화 함수 연산에 사용되는 보간 테이블, 연산 장치의 파라미터를 설정하는 상수 테이블, 오프셋 데이터 등을 더 포함할 수 있고 상기 데이터 동작 신호는 상기 데이터 블록이 공유 메모리에서의 소스 주소를 포함한다.S5103: The controller unit reads the next IO command from the command storage unit, and then reads all data blocks necessary for the operation unit from the shared memory through the transmission circuit according to the data operation signal decoded from the IO command, and the data blocks are allocated It may further include input neuron data and weights of a corresponding layer neuron to be processed, an interpolation table used for calculating a fast activation function, a constant table for setting parameters of an arithmetic device, and offset data. A block of data contains a source address in shared memory.

S5104: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 CONFIG(설정)명령을 읽어 들인 후, 상기 CONFIG명령으로부터 디코딩된 제어신호에 의해 해당 층의 신경망 계산에 필요한 각종 상수를 설정한다. 예를 들어, 연산 유닛은 활성화 함수에 필요한 상수에 따라 그 인너 레지스터(inner register)의 값을 설정한다.S5104: After reading the next CONFIG (setting) command from the command storage unit, the controller unit sets various constants necessary for calculating the neural network of the corresponding layer by a control signal decoded from the CONFIG command. For example, the arithmetic unit sets the value of its inner register according to the constant required by the activation function.

S5105: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 COMPUTE(계산)명령을 읽어들이고, 상기 COMPUTE 명령으로부터 디코딩된 제어신호(즉, 연산 명령)에 따라, 연산 유닛은 분배된 해당 층 뉴런의 입력 뉴런 데이터, 가중치 및 연산 명령을 마스트 처리회로로 전송하고, 마스트 처리회로는 분배된 해당 층 뉴런의 입력 뉴런 데이터를 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 하나의 배부 데이터를 복수의 데이터 블록으로 배분하고, 복수의 데이터 블록 중 적어도 하나 이상의 데이터 블록, 브로드캐스트 데이터 및 복수의 연산 명령 중 적어도 하나 이상의 연산 명령을 슬레이브 처리회로로 송신하고, 슬레이브 처리회로로부터 곱셈 처리회로, 누적 처리회로 등을 통해 중간 결과를 얻고, 마스트 처리회로로부터 중간 결과 및 활성화 처리회로 등에 따라 분배된 해당 층의 뉴런이 출력한 뉴런 데이터를 얻을 수 있다.S5105: The controller unit reads the next COMPUTE (calculation) command from the command storage unit, and according to the control signal (ie, calculation command) decoded from the COMPUTE command, the calculation unit receives the distributed input neuron data and weights of the corresponding layer neurons. and send the operation command to the master processing circuit, the master processing circuit determines the distributed input neuron data of the corresponding layer neurons as broadcast data, weights as distribution data, and distributes one distribution data into a plurality of data blocks; , At least one data block among a plurality of data blocks, broadcast data, and at least one operation command among a plurality of operation commands are transmitted to the slave processing circuit, and the intermediate result is transmitted from the slave processing circuit through a multiplication processing circuit, an accumulation processing circuit, and the like. , and neuron data output by neurons of the corresponding layer distributed according to intermediate results and activation processing circuits from the master processing circuit can be obtained.

S5106: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 IO명령을 읽어들이고, 상기 IO명령으로부터 디코딩된 데이터 동작 신호에 따라, 전송 회로를 통해 상기 출력한 뉴런 데이터를 공유 메모리로 전송하고 저장하여 다음 층 부분 뉴런의 입력 뉴런 데이터를 얻고, 상기 데이터 동작 신호는 상기 출력의 뉴런 데이터가 공유 메모리 내의 목표 주소를 포함한다.S5106: The controller unit reads the next IO command from the command storage unit, and according to the data operation signal decoded from the IO command, transmits and stores the output neuron data to a shared memory through a transmission circuit, and stores the neuron data of the next layer. The input neuron data is obtained, and the data operation signal includes the target address in the shared memory of the neuron data of the output.

아래 S5105를 예시적으로 설명한다. 신경망 연산 중의 완전 연결 연산을 예로 들면, 어느 층의 신경망 네트워크의 과정은 y=f(wx+b)일 수 있으며, 여기서 x는 입력 뉴런 행렬이고, w는 가중치 행렬이고, b는 오프셋 스칼라이며, f는 활성화 함수이다. 구체적으로, sigmoid함수, tanh, relu, softmax함수 중 임의의 한 함수일 수 있다. 여기서, 마스트 처리회로와 슬레이브 처리회로 사이가 2진 트리 관계(트리 관계)이며, 연산 유닛이 하나의 마스트 처리회로와 8개의 슬레이브 처리회로를 가진다고 가정할 경우, 상기 S5105의 구현 방법은 다음과 같다. 컨트롤러 유닛이 공유 메모리로부터 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 획득하고, 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 마스트 처리회로로 전송하고, 마스트 처리회로는 상기 입력 뉴런 행렬x를 브로드캐스트 데이터로, 가중치 행렬w를 배부 데이터로 확정하고 가중치 행렬w를 8개의 자행렬로 분할한 다음 트리 모듈을 통해 8개의 자행렬을 8개의 슬레이브 처리회로로 배분하고, 입력 뉴런 행렬x를 8개의 슬레이브 처리회로로 브로드캐스팅하고 슬레이브 처리회로는 8개의 자행렬과 입력 뉴런 행렬x의 곱셈 연산 및 누적 연산을 병행 수행하여 8개의 중간 결과를 얻은 후, 8개의 중간 결과를 마스트 처리회로로 송신하며, 마스트 처리회로는 8개의 중간 결과를 정렬하여 wx의 연산 결과를 얻으며 상기 연산 결과에 대해 오프셋b의 연산을 수행한 후 활성화 연산을 진행하여 최종 결과y를 얻는다.S5105 is described below as an example. Taking full connection operation in neural network operation as an example, the process of a neural network of a certain layer may be y=f(wx+b), where x is an input neuron matrix, w is a weight matrix, b is an offset scalar, f is the activation function. Specifically, it may be any one of the sigmoid function, tanh, relu, and softmax functions. Here, assuming that there is a binary tree relationship (tree relationship) between the master processing circuit and the slave processing circuit, and that the arithmetic unit has one master processing circuit and eight slave processing circuits, the implementation method of S5105 is as follows. . A controller unit obtains an input neuron matrix x, a weight matrix w, and a fully connected operation command from a shared memory, and transmits the input neuron matrix x, a weight matrix w, and a fully connected operation command to a master processing circuit, wherein the master processing circuit receives the input Determine the neuron matrix x as broadcast data and the weight matrix w as distribution data, divide the weight matrix w into 8 child matrices, distribute the 8 child matrices to 8 slave processing circuits through the tree module, and The matrix x is broadcast to 8 slave processing circuits, and the slave processing circuit performs multiplication and accumulation operations of the 8 child matrices and the input neuron matrix x in parallel to obtain 8 intermediate results, and then the 8 intermediate results are master-processed. circuit, the master processing circuit sorts the 8 intermediate results to obtain an operation result of wx, performs an operation of offset b on the operation result, and then performs an activation operation to obtain a final result y.

상기 각각의 기계 학습 유닛이 어느 한층의 배분된 각 뉴런에 대해 병행 계산을 수행할 수 있으므로, 공유 메모리에 각층의 모든 뉴런의 출력 뉴런 데이터, 및 다음 층의 모든 뉴런에 필요한 입력 뉴런 데이터를 저장할 수 있고, 가중치는 재활용하거나, 혹은 공유 메모리로부터 새로운 한층의 신경망의 가중치를 획득할 수 있음을 이해할 수 있다.Since each of the machine learning units can perform parallel computation for each neuron distributed in any layer, the output neuron data of all neurons in each layer and the input neuron data required for all neurons in the next layer can be stored in a shared memory. It can be understood that the weights can be reused, or the weights of a new layer of neural network can be obtained from a shared memory.

유의할 것은, 각각의 기계 학습 유닛은 하나의 연산 유닛을 포함할 수 있고, 복수의 연산 유닛을 포함할 수도 있다. 각각의 연산 유닛의 구조는 동일하거나 서로 다를 수 있다. 여기서, 각각의 연산 유닛의 구조는 마스트 처리회로와 각 슬레이브 처리회로 사이의 관계에서 체현되며, 트리형, H형, 시스톨릭 어레이형(systolic array type)의 관계를 포함하나 이에 한정되지 않는다. 본 출원에 따른 기술적 방안은 연산 유닛을 one-master-multiple-slave 구조로 구성하고 정방향 연산하는 계산 명령에 의해 데이터를 분할할 수 있다. 이에 따라, 복수의 슬레이브 처리회로를 통해 계산량이 큰 부분에 대해서도 병행 연산을 수행할 수 있으므로 연산 속도를 향상시키고 연산 시간을 절약하여 전력 소모를 낮출 수 있다.It should be noted that each machine learning unit may include one arithmetic unit, or may include plural arithmetic units. The structure of each arithmetic unit may be the same or different. Here, the structure of each arithmetic unit is embodied in the relationship between the master processing circuit and each slave processing circuit, and includes, but is not limited to, tree-type, H-type, and systolic array-type relationships. The technical solution according to the present application configures a calculation unit in a one-master-multiple-slave structure and divides data by a calculation command performing a forward operation. Accordingly, since parallel calculation can be performed even on a large calculation amount through a plurality of slave processing circuits, calculation speed can be improved, calculation time can be saved, and power consumption can be reduced.

다음은 상기 도 43에 도시된 본 실시예에 따른 데이터 처리 장치로 돌아가서 설명한다. 여기서 유니캐스트 판독 동작은 유니캐스트 방식에서의 판독 동작이고, 대응하는 데이터 동작 신호는 유니캐스트 판독 명령, 유니캐스트 판독 요청일 수 있고, 브로드캐스트 동작에 대응하는 데이터 동작 신호는 브로드캐스트 명령, 멀티캐스트 명령, 브로드캐스트 요청, 멀티캐스트 요청일 수 있다. 예시적으로, 유니캐스트 판독 명령은 유니캐스트 방식에서의 판독 명령이고, 어느 기계 학습 유닛에서 송신된, 공유 메모리의 소스 주소의 입력 뉴런 데이터 및 가중치에 대한 판독 명령일 수 있으며, 상기 입력 뉴런 데이터 및 가중치를 상기 기계 학습 유닛으로 리턴해야 한다. 상기 입력 뉴런 데이터 및 가중치는 상기 기계 학습 유닛이 계산 명령에 따라 어느 한층에 분배된 뉴런 계산을 진행하는 과정에서 상기 분배된 뉴런에 필요한 입력 뉴런 데이터 및 가중치일 수 있다. 이와 유사하게, 유니캐스트 판독 요청은 유니캐스트 방식에서의 판독 요청이며; 브로드캐스트 명령은 어느 기계 학습 유닛에서 송신된, 공유 메모리의 소스 주소의 입력 뉴런 데이터 및 가중치에 대한 판독명령이고, 상기 입력 뉴런 데이터 및 가중치를 상기 기계 학습 장치 중의 모든 기계 학습 유닛으로 리턴해야 한다. 상기 입력 뉴런 데이터는 어는 한층의 모든 뉴런에 필요한 입력 뉴런 데이터, 즉, 이전 층의 모든 출력 뉴런 데이터일 수 있고, 상기 가중치는 컨볼루션 커널(convolution kernel)와 같은 재활용 가능한 가중치이고, 멀티캐스트 명령과 브로드캐스트 명령의 차이점은 멀티캐스트 명령의 데이터 리턴 대상이 상기 기계 학습 장치 중의 모든 기계 학습 유닛이 아니라 상기 멀티캐스트 명령 중의 마커 필드(marker field)에 대응되는 복수의 기계 학습 유닛이라는 것이다. 한편, 일반적으로 명령과 요청의 차이점은 명령 수행에 따른 오버헤드가 상대적으로 크지만, 명령에 많은 정보가 포함되어 있고, 요청 수행에 따른 오버헤드가 상대적으로 작지만, 요청에 포함된 정보가 적다는 것이다.Next, the data processing apparatus according to the present embodiment shown in FIG. 43 will be described. Here, the unicast read operation is a read operation in the unicast method, the corresponding data operation signal may be a unicast read command or a unicast read request, and the data operation signal corresponding to the broadcast operation is a broadcast command or a multicast read operation signal. It can be a command, broadcast request, or multicast request. Exemplarily, the unicast read command is a read command in a unicast manner, and may be a command to read input neuron data and weights of a source address of a shared memory transmitted from a machine learning unit, and the input neuron data and The weights must be returned to the machine learning unit. The input neuron data and weights may be input neuron data and weights required for the distributed neurons in a process in which the machine learning unit calculates neurons distributed to a layer according to a calculation command. Similarly, the unicast read request is a read request in the unicast manner; A broadcast command is a command to read input neuron data and weights of a source address of a shared memory sent from a machine learning unit, and must return the input neuron data and weights to all machine learning units in the machine learning device. The input neuron data may be input neuron data necessary for all neurons of a certain layer, that is, all output neuron data of the previous layer, the weights are reusable weights such as convolution kernels, and multicast commands and The difference between the broadcast command is that the data return target of the multicast command is not all machine learning units in the machine learning device, but a plurality of machine learning units corresponding to a marker field in the multicast command. On the other hand, in general, the difference between a command and a request is that the overhead of executing the command is relatively large, but the command contains a lot of information, and the overhead of executing the request is relatively small, but the information included in the request is small. will be.

일반적으로, 기계 학습 유닛이 유니캐스트 판독 동작 및 브로드캐스트 동작에 의해 리턴된 데이터를 수신할 때 적어도 대응하는 두개의 데이터 인터페이스가 필요하며, 이들은 각각 전송 회로에서 유니캐스트 판독 데이터 동작 신호에 대해 리턴하는 유니캐스트 판독 데이터를 수신하며, 전송 회로에서 브로드캐스트 및/또는 멀티캐스트 데이터 동작 신호에 대해 리턴된 브로드캐스트 및/또는 멀티캐스트 데이터를 수신한다. 본 실시예에서, 도 1에 도시된 바와 같이, 기계 학습 유닛0의 수신 인터페이스는 하나뿐이고, 이는 공유 데이터 수신 인터페이스이다. 예를 들어 인터페이스c0는 전송 회로에서 유니캐스트 판독 데이터 동작 신호에 대해 리턴하는 유니캐스트 판독 데이터를 수신하고, 전송 회로가 브로드캐스트 및/또는 멀터캐스트 데이터 동작 신호에 대해 리턴하는 브로드캐스트 및/또는 멀티캐스트 데이터를 수신할 수 있다.Generally, when the machine learning unit receives the data returned by the unicast read operation and the broadcast operation, at least two corresponding data interfaces are required, which respectively return to the unicast read data operation signal from the transmission circuit. Receive unicast read data, and receive broadcast and/or multicast data returned from a transmitting circuit for a broadcast and/or multicast data operation signal. In this embodiment, as shown in Fig. 1, there is only one receiving interface of machine learning unit 0, which is a shared data receiving interface. For example, interface c0 receives unicast read data returned by the transmitting circuit on a unicast read data operation signal, and broadcast and/or multicast data returned by the transmitting circuit on a broadcast and/or multicast data operation signal. Cast data can be received.

전송 회로가 공유 메모리로부터 필요한 입력 뉴런 데이터 및 가중치를 추출한 후 캐시가 존재하면, 캐시에 임시 저장하고 그런 다음 전송 회로가 상기 데이터를 요청한 출처, 즉 상기 데이터와 관련있는 데이터 동작 신호에 대응하는 데이터 리턴 대상(기계 학습 유닛)을 판단하고 상기 데이터를 공유 데이터 수신 인터페이스로 송신할 수 있음을 이해할 수 있다. 유니캐스트 판독 동작일 때, 상기 공유 데이터 수신 인터페이스는 상기 데이터의 리턴 대상에 대응하는 하나의 기계 학습 유닛의 공유 데이터 수신 인터페이스이고, 브로드캐스트 동작일 때, 상기 공유 데이터 수신 인터페이스는 상기 데이터의 리턴 대상에 대응하는 복수의 기계 학습 유닛의 복수의 공유 데이터 수신 인터페이스이다.After the transmission circuit extracts the necessary input neuron data and weights from the shared memory, if the cache exists, temporarily stores them in the cache, and then the transmission circuit returns the data corresponding to the source that requested the data, that is, the data operation signal related to the data. It can be understood that the object (machine learning unit) can be determined and the data can be sent to the shared data receiving interface. In case of unicast reading operation, the shared data receiving interface is the shared data receiving interface of one machine learning unit corresponding to the return target of the data; in the case of broadcast operation, the shared data receiving interface is the return target of the data A plurality of shared data receiving interfaces of a plurality of machine learning units corresponding to .

따라서, 본 실시예에 따른 데이터 처리 장치에서, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 동작 및 브로드캐스트 동작을 수행할 경우 상기 기계 학습 유닛 내의 하나의 데이터 수신 인터페이스를 공유하므로, 기계 학습 유닛에서 데이터를 리턴하는 인터페이스의 개수를 효율적으로 줄이고, 하드웨어의 소스를 절약하고, 하드웨어의 면적과 전력 소모를 낮출 수 있다.Therefore, in the data processing apparatus according to the present embodiment, when at least one or more machine learning units perform a unicast read operation and a broadcast operation, one data reception interface is shared in the machine learning unit, so that the machine learning unit can transmit data It is possible to effectively reduce the number of interfaces that return , save hardware sources, and reduce hardware area and power consumption.

이하 상기 기계 학습 유닛 내의 송신 인터페이스에 대해 상세하게 설명한다. 도 45를 참조하면, 상기 도 43의 기초상에 상기 송신 인터페이스(141)는 유니캐스트 판독 신호 송신 인터페이스(1411) 및 브로드캐스트 신호 송신 인터페이스(1412)를 포함할 수 있으며, 상기 기계 학습 유닛(15)은 상기 유니캐스트 판독 신호 송신 인터페이스(1411) 및 상기 공유 데이터 수신 인터페이스(142)를 통해 상기 전송 회로(12)에 각각 연결되어 유니캐스트 판독 동작을 구현하고 상기 브로드캐스트 신호 송신 인터페이스(1412) 및 상기 공유 데이터 수신 인터페이스(142)를 통해 상기 전송 회로(12)에 각각 연결되어 브로드캐스트 동작을 구현한다. MLU0인 경우, 유니캐스트 판독 신호 송신 인터페이스는 인터페이스a0에 대응되고, 브로드캐스트 신호 송신 인터페이스는 인터페이스b0에 대응되고, 공유 데이터 수신 인터페이스는 인터페이스c0에 대응된다. 여기서 인터페이스a0는 전송 회로에 유니캐스트 판독 데이터 동작 신호를 송신하고, 인터페이스b0는 전송 회로에 브로드캐스트 및/또는 멀티캐스트 데이터 동작 신호를 송신하고, 인터페이스c0는 전송 회로에서 유니캐스트 판독 데이터 동작 신호에 대해 리턴하는 유니캐스트 판독 데이터를 수신하고, 전송 회로가 브로드캐스트 및/또는 멀티캐스트 데이터 동작 신호에 대해 리턴하는 브로드캐스트 및/또는 멀티캐스트 데이터를 수신할 수 있다. 따라서, 본 실시예는 유니캐스트 판독 신호 송신 인터페이스 및 브로드캐스트 신호 송신 인터페이스를 통해 서로 다른 유형의 데이터 동작 신호의 송신을 각각 구현하므로 처리 로직을 간소화하였다.The transmission interface in the machine learning unit will be described in detail below. Referring to FIG. 45 , on the basis of FIG. 43 , the transmission interface 141 may include a unicast read signal transmission interface 1411 and a broadcast signal transmission interface 1412, and the machine learning unit 15 ) are respectively connected to the transmission circuit 12 through the unicast read signal transmission interface 1411 and the shared data reception interface 142 to implement a unicast read operation, and the broadcast signal transmission interface 1412 and Each connected to the transmitting circuit 12 through the shared data reception interface 142 implements a broadcast operation. In the case of MLU0, the unicast read signal transmission interface corresponds to interface a0, the broadcast signal transmission interface corresponds to interface b0, and the shared data reception interface corresponds to interface c0. Here, interface a0 transmits a unicast read data operation signal to the transmission circuit, interface b0 transmits a broadcast and/or multicast data operation signal to the transmission circuit, and interface c0 transmits a unicast read data operation signal from the transmission circuit. and receive broadcast and/or multicast data that the transmit circuitry returns for a broadcast and/or multicast data operation signal. Therefore, this embodiment implements the transmission of different types of data operation signals through the unicast read signal transmission interface and the broadcast signal transmission interface respectively, thereby simplifying the processing logic.

일 실시 형태에서, 상기 유니캐스트 판독 동작 및 브로드캐스트 동작에 상응하여, 도 45를 참조하면, 상기 데이터 처리 장치 내의 전송 회로(12)는 제2 전송 인터페이스(120), 상기 제2 전송 인터페이스(120)에 연결되는 판독/기록 처리회로(121), 및 상기 판독/기록 처리회로(121)에 연결되는 중재 회로(122)를 포함할 수 있다. 상기 판독/기록 처리회로(121)는, 상기 적어도 하나 이상의 기계 학습 유닛(15)이 상기 송신 인터페이스(141) 및 상기 제2 전송 인터페이스(120)를 통해 송신한 데이터 동작 신호를 수신하여 상기 데이터 동작 신호를 상기 중재 회로(122)로 전송하고, 상기 중재 회로(122)가 상기 공유 메모리(13)로부터 획득한 데이터를 상기 제2 전송 인터페이스(120) 및 상기 공유 데이터 수신 인터페이스(142)를 통해 상기 데이터 동작 신호에 대응하는 기계 학습 유닛으로 리턴하는데 사용된다. 상기 중재 회로(122)는, 사전에 설정된 중재 규칙에 따라 상기 판독/기록 처리회로(121)로부터 수신된 데이터 동작 신호를 중재하고, 중재 성공한 데이터 동작 신호에 따라 상기 공유 메모리(13) 내의 데이터를 동작시키는데 사용된다.In one embodiment, corresponding to the unicast read operation and the broadcast operation, referring to FIG. 45 , the transmission circuit 12 in the data processing device includes a second transmission interface 120, the second transmission interface 120 ), and an arbitration circuit 122 connected to the read/write processing circuit 121. The read/write processing circuit 121 receives a data operation signal transmitted from the at least one machine learning unit 15 through the transmission interface 141 and the second transmission interface 120 to perform the data operation. A signal is transmitted to the arbitration circuit 122, and the data acquired by the arbitration circuit 122 from the shared memory 13 is transmitted through the second transmission interface 120 and the shared data reception interface 142. It is used to return to the machine learning unit corresponding to the data operation signal. The arbitration circuit 122 mediates the data operation signal received from the read/write processing circuit 121 according to a preset arbitration rule, and transfers the data in the shared memory 13 according to the data operation signal that succeeds in mediation. used to make it work.

구체적으로, 상기 판독/기록 처리회로(121)는 유니캐스트 판독 신호를 처리할 수 있으며, 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 수도 있다. 일 실시 형태에서, 상기 판독/기록 처리회로(121)는 유니캐스트 판독 처리회로를 포함할 수 있으며, 상기 유니캐스트 판독 처리회로는 유니캐스트 판독 신호를 처리할 수 있고, 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 수도 있다. 여기서, 상기 유니캐스트 판독 처리회로가 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 경우, 적어도 하나 이상의 기계 학습 유닛이 상기 브로드캐스트 신호 송신 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 브로드캐스트 및/또는 멀티캐스트 신호를 수신하여 상기 브로드캐스트 및/또는 멀티캐스트 신호를 상기 중재 회로로 전송하고, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 공유 데이터 수신 인터페이스를 통해 사전에 설정된 순서에 따라 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신한다. 상기 사전에 설정된 순서는 상기 복수의 기계 학습 유닛으로 데이터를 리턴하는 순서이고, 이는 각 기계 학습 유닛의 우선순위에 따라 정열할 수 있고, 복수의 기계 학습 유닛의 번호 순서 또는 다른 순서에 따라 정열할 수도 있다.Specifically, the read/write processing circuit 121 can process a unicast read signal, and can also process a broadcast signal and/or a multicast signal. In one embodiment, the read/write processing circuitry 121 may include unicast read processing circuitry, wherein the unicast read processing circuitry may process unicast read signals, broadcast signals and/or multicast read processing circuitry. You can also process cast signals. Here, when the unicast reading processing circuit processes a broadcast signal and/or a multicast signal, at least one machine learning unit transmits broadcast and/or broadcast signals transmitted through the broadcast signal transmission interface and the second transmission interface. or receiving a multicast signal and transmitting the broadcast and/or multicast signal to the arbitration circuit, wherein the arbitration circuit transmits the acquired data from the shared memory in advance through the second transmission interface and the shared data reception interface. Transmits to a plurality of machine learning units corresponding to the broadcast and/or multicast signal according to a set order. The preset order is an order of returning data to the plurality of machine learning units, which may be arranged according to the priority of each machine learning unit, or according to a number order of the plurality of machine learning units or another order. may be

대안적으로, 상기 판독/기록 처리회로(121)는 유니캐스트 판독 처리회로, 브로드캐스트 처리회로를 포함할 수 있으며, 상기 유니캐스트 판독 처리회로는 유니캐스트 판독 신호를 처리하고, 상기 브로드캐스트 처리회로는 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리한다.Alternatively, the read/write processing circuit 121 may include a unicast read processing circuit, a broadcast processing circuit, wherein the unicast read processing circuit processes a unicast read signal, and the broadcast processing circuit processes a broadcast signal and/or a multicast signal.

여기서, 유니캐스트 판독 처리회로는, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 신호 송신 인터페이스 및 제2 전송 인터페이스를 통해 송신한 유니캐스트 판독 신호를 수신할 수 있으며, 상기 유니캐스트 판독 신호를 상기 중재 회로에 전송하고, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 공유 데이터 수신 인터페이스를 통해 상기 유니캐스트 판독 신호에 대응하는 기계 학습 유닛으로 송신하는데 사용된다. 상기 브로드캐스트 판독 처리회로는, 적어도 하나 이상의 기계 학습 유닛이 상기 브로드캐스트 신호 송신 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 브로드캐스트 및/또는 멀티캐스트 신호를 수신할 수 있고, 상기 브로드캐스트 및/또는 멀티캐스트 신호를 상기 중재 회로로 전송하고, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 공유 데이터 수신 인터페이스를 통해 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신하는데 사용된다.Here, the unicast read processing circuit is configured to receive a unicast read signal transmitted by at least one machine learning unit through a unicast read signal transmission interface and a second transmission interface, and transmit the unicast read signal to the arbitration circuit. and the arbitration circuit is used to transmit data obtained from the shared memory to a machine learning unit corresponding to the unicast read signal through the second transmission interface and the shared data reception interface. The broadcast reading processing circuit is capable of receiving a broadcast and/or multicast signal transmitted by at least one machine learning unit through the broadcast signal transmission interface and the second transmission interface, and the broadcast and/or multicast signal is transmitted. or transmits a multicast signal to the mediation circuit, and the mediation circuit transmits data acquired from the shared memory to a plurality of channels corresponding to the broadcast and/or multicast signal through the second transmission interface and the shared data reception interface. It is used to transmit to the machine learning unit.

여기서, 사전에 설정된 중재 규칙은 중재 회로가 일정한 규칙에 따라 복수의 데이터 동작 신호의 우선수위를 확정하도록 하여 중재 회로가 각각의 데이터 동작 신호의 우선순위에 따라 동작이 필요한 대상을 확정하도록 한다. 즉, 우선순위가 높은 데이터 동작 신호를 선택하여 중재 성공한 데이터 동작 신호로 삼는다. 예를 들어, 전송 속도가 빠른 데이터 동작 신호의 우선순위를 높은 우선순위로 설정하고, 전송 속도가 늦은 데이터 동작 신호의 우선순위를 낮은 우선순위로 설정할 수 있다. 예시적으로, 상기 사전에 설정된 중재 규칙은 라운드 로빈 스케줄링(Round-Robin Scheduling) 중재 규칙, Maximum Carrier-to-Interference Rate 스케줄링 규칙, 비율 공정(Proportional Fair) 규칙 등일 수 있다. 한편, 중재 회로는 기계 학습 유닛과 판독/기록 처리회로 사이의 데이터 경로(인터페이스에서 인터페이스까지)가 아이들(idle) 상태인지 여부를 보조 중재 규칙으로 삼을 수도 있다. 즉, 중재 성공한 데이터 동작 신호에 대응하는 데이터 경로가 아이들(idle) 상태인 경우이다.Here, the arbitration rule set in advance causes the arbitration circuit to determine the priority level of a plurality of data operation signals according to a predetermined rule, so that the arbitration circuit determines an object requiring an operation according to the priority level of each data operation signal. That is, a data operation signal having a high priority is selected and used as a successful data operation signal. For example, the priority of a data operation signal having a high transmission rate may be set to a high priority, and the priority of a data operation signal having a low transmission rate may be set to a low priority. Illustratively, the preset arbitration rule may be a round-robin scheduling arbitration rule, a maximum carrier-to-interference rate scheduling rule, a proportional fair rule, and the like. Meanwhile, the arbitration circuit may use as an auxiliary arbitration rule whether a data path (interface to interface) between the machine learning unit and the read/write processing circuit is in an idle state. That is, this is a case where a data path corresponding to a data operation signal with successful mediation is in an idle state.

구체적으로, 유니캐스트 판독 처리회로는 제2 전송 인터페이스를 통해 복수의 기계 학습 유닛에 연결되어 복수의 기계 학습 유닛의 유니캐스트 판독 동작을 처리하고 복수의 유니캐스트 판독 명령을 유니캐스트 판독 처리회로 내의 유니캐스트 판독 명령 캐시 큐 내에 캐싱할 수 있으며 유니캐스트 판독 명령을 분석하여 상응한 유니캐스트 판독 명령을 얻어 유니캐스트 판독 처리회로 내의 유니캐스트 판독 요청 캐시 큐 내에 저장하여 중재 회로를 통해 중재할 수 있다. 반면에 유니캐스트 판독 요청은 분석을 수행하지 않고 유니캐스트 판독 요청 캐시 큐 내에 캐싱될 수 있다. 이와 유사하게, 브로드캐스트 처리회로는 또한 제2 전송 인터페이스를 통해 복수의 기계 학습 유닛에 연결될 수 있으며, 브로드캐스트 및/또는 멀티캐스트 명령 캐시 큐 및 브로드캐스트 및/또는 멀티캐스트 요청 캐시 큐를 포함할 수 있는데, 여기서 중복 설명하지 않는다. 하나의 선택 가능한 방안에서, 판독/기록 처리회로는 하나의 유니캐스트 판독 처리회로 및 하나의 브로드캐스트 처리회로를 포함할 수 있다.Specifically, the unicast reading processing circuit is connected to the plurality of machine learning units through the second transmission interface to process the unicast reading operations of the plurality of machine learning units, and to send the plurality of unicast reading instructions to the unicast reading processing circuit. The unicast read command can be cached in the cache queue, and the unicast read command is analyzed to obtain a corresponding unicast read command, which is stored in the unicast read request cache queue in the unicast read request processing circuit and arbitrated through the arbitration circuit. On the other hand, unicast read requests may be cached in the unicast read request cache queue without performing analysis. Similarly, the broadcast processing circuitry may also be coupled to the plurality of machine learning units via a second transport interface, and may include a broadcast and/or multicast command cache queue and a broadcast and/or multicast request cache queue. may be, but it is not repeated here. In one possible option, the read/write processing circuit may include one unicast read processing circuit and one broadcast processing circuit.

따라서, 본 실시예는 유니캐스트 판독 처리회로를 통해 유니캐스트 판독 동작을 처리할 수 있고, 브로드캐스트 처리회로를 통해 브로드캐스트 동작을 처리할 수 있는 바, 서로 다른 처리회로를 통해 서로 다른 유형의 데이터 동작을 각각 처리할 수 있어 처리 로직을 간소화하였다.Therefore, in this embodiment, the unicast read operation can be processed through the unicast read processing circuit, and the broadcast operation can be processed through the broadcast processing circuit, so that different types of data are processed through different processing circuits. Each operation can be processed, simplifying the processing logic.

하나의 선택 가능한 방안에서, 도 45를 참조하면, 상기 도 43의 데이터 처리 장치의 기초상에, 제2 전송 인터페이스는 서로 다른 유형의 데이터 동작을 처리하는 인터페이스로 세분화할 수 있다. 구체적으로, 상기 제2 전송 인터페이스(120)는, 상기 유니캐스트 판독 처리회로에 연결된 적어도 한 그룹 이상의 유니캐스트 판독 신호 수신 인터페이스 및 유니캐스트 판독 데이터 송신 인터페이스, 상기 브로드캐스트 처리회로에 연결된 적어도 한 그룹 이상의 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 포함할 수 있고, 상기 유니캐스트 판독 신호 수신 인터페이스는 상기 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결되고, 상기 브로드캐스트 신호 수신 인터페이스는 상기 기계 학습 유닛의 브로드캐스트 신호 송신 인터페이스에 연결되며, 상기 전송 회로 중의 상기 유니캐스트 판독 데이터 송신 인터페이스와 상기 브로드캐스트 데이터 송신 인터페이스는 상기 기계 학습 유닛의 공유 데이터 수신 인터페이스에 연결된다. 본 실시예는 제2 전송 인터페이스 내의 각 인터페이스를 통해 서로 다른 유형의 데이터 동작에 대해 처리할 수 있어 처리 로직을 간소화하였다.As one possible option, referring to FIG. 45 , on the basis of the data processing device of FIG. 43 , the second transmission interface may be subdivided into interfaces that process data operations of different types. Specifically, the second transmission interface 120 includes at least one group of unicast read signal reception interfaces and unicast read data transmission interfaces connected to the unicast read processing circuit, and at least one group connected to the broadcast processing circuit. It may include a broadcast signal receiving interface and a broadcast data transmission interface, wherein the unicast read signal receiving interface is connected to the unicast read signal transmitting interface of the machine learning unit, and the broadcast signal receiving interface comprises the machine learning interface. unit, and the unicast read data transmission interface and the broadcast data transmission interface in the transmission circuit are connected to the shared data reception interface of the machine learning unit. In this embodiment, different types of data operations can be processed through each interface in the second transmission interface, thereby simplifying the processing logic.

일 실시 형태에서, 도 45를 참조하면, 판독/기록 처리회로는 복수의 처리회로 그룹으로 나누어질 수 있고, 하나의 기계 학습 유닛은 하나의 처리회로 그룹에 대응되며, 상기 처리회로 그룹은 적어도 하나의 유니캐스트 판독 처리회로 및 하나의 브로드캐스트 처리회로를 포함한다. 예시적으로, MLU0은 유니캐스트 판독 처리회로0 및 브로드캐스트 처리회로0에 대응되고, MLUn은 유니캐스트 판독 처리회로n 및 브로드캐스트 처리회로n에 대응된다. 마찬가지로, 제2 전송 인터페이스0는 하나의 처리회로 그룹 및 하나의 기계 학습 유닛에 각각 연결된 한 그룹의 인터페이스를 구비하고, 기계 학습 유닛과 유니캐스트 판독 처리회로의 일대일 연결, 그리고 기계 학습 유닛과 브로드캐스트 처리회로의 일대일 연결을 구현하도록 구성된다.In one embodiment, referring to FIG. 45 , read/write processing circuits may be divided into a plurality of processing circuit groups, one machine learning unit corresponds to one processing circuit group, and the processing circuit groups include at least one processing circuit group. unicast read processing circuit and one broadcast processing circuit. Exemplarily, MLU0 corresponds to unicast read processing circuit 0 and broadcast processing circuit 0, and MLUn corresponds to unicast read processing circuit n and broadcast processing circuit n. Similarly, the second transmission interface 0 has a group of interfaces respectively connected to one processing circuit group and one machine learning unit, one-to-one connection between the machine learning unit and the unicast reading processing circuit, and broadcasting with the machine learning unit. It is configured to implement a one-to-one connection of processing circuits.

예를 들어, MLU0 및 유니캐스트 판독 처리회로0에 있어서, 제2 전송 인터페이스 내의 인터페이스d0는, 유니캐스트 판독 신호 수신 인터페이스로서, MLU0의 유니캐스트 판독 신호 송신 인터페이스a0 및 유니캐스트 판독 처리회로0에 각각 연결되어 MLU0에서 송신된 유니캐스트 판독 신호를 수신한 후 유니캐스트 판독 처리회로0로 전송하여 처리하는데 사용될 수 있다. 제2 전송 인터페이스 내의 인터페이스e0는, 유니캐스트 판독 데이터 송신 인터페이스로서, MLU0의 공유 데이터 수신 인터페이스c0 및 유니캐스트 판독 처리회로0에 각각 연결되어 유니캐스트 판독 처리회로0에서 송신된 상기 유니캐스트 판독 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신하고 MLU0 내의 인터페이스c0에 전송하는데 사용될 수 있다. MLU0 및 브로드캐스트 처리회로0에 있어서, 제2 전송 인터페이스 내의 인터페이스f0는 브로드캐스트 신호 수신 인터페이스로서, MLU0의 브로드캐스트 신호 송신 인터페이스b0 및 브로드캐스트 처리회로0에 각각 연결되어 MLU0에서 송신된 브로드캐스트 및/또는 멀티캐스트 신호를 수신한 후 브로드캐스트 처리회로0로 전송하여 처리하는데 사용될 수 있다. 제2 전송 인터페이스 내의 인터페이스g0은 브로드캐스트 데이터 송신 인터페이스로서, 복수의 MLU의 공유 데이터 수신 인터페이스ci 및 브로드캐스트 처리회로0에 각각 연결되어 브로드캐스트 처리회로0으로분터 송신된 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신한 후 복수의 MLU 내의 공유 데이터 수신 인터페이스ci로 전송하는데 사용될 수 있다.For example, in MLU0 and unicast read processing circuit 0, interface d0 in the second transmission interface is a unicast read signal receiving interface, and is connected to unicast read signal transmission interface a0 and unicast read processing circuit 0 of MLU0, respectively. After being connected to receive the unicast read signal sent from MLU0, it can be sent to the unicast read processing circuit 0 for processing. Interface e0 in the second transmission interface is a unicast read data transmission interface, and is connected to the shared data reception interface c0 of MLU0 and the unicast read processing circuit 0, respectively, to receive the unicast read signal transmitted from the unicast read processing circuit 0. It can be used to receive and send the corresponding input neuron data and weights to interface c0 in MLU0. In MLU0 and broadcast processing circuit 0, interface f0 in the second transmission interface is a broadcast signal reception interface, and is connected to broadcast signal transmission interface b0 and broadcast processing circuit 0 of MLU0, respectively, to broadcast and transmit signals transmitted from MLU0. / or after receiving a multicast signal, it can be used for processing by transmitting it to the broadcast processing circuit 0. An interface g0 in the second transmission interface is a broadcast data transmission interface, which is connected to a shared data reception interface ci of a plurality of MLUs and a broadcast processing circuit 0, respectively, to transmit the broadcast and/or multicast data transmitted from the broadcast processing circuit 0. After receiving the input neuron data and weights corresponding to the cast signal, it may be used to transmit them to a shared data reception interface ci in a plurality of MLUs.

따라서, 본 실시예는 기계 학습 유닛과 유니캐스트 판독 처리회로의 일대일 연결; 및 기계 학습 유닛과 브로드캐스트 처리회로의 일대일 연결을 통해 맞춤성 일대일의 데이터 동작 처리를 구현할 수 있어 데이터 동작의 액세스 로직의 복잡성을 낮추고 충돌을 감소시켜 처리 효율을 향상시킬 수 있다.Therefore, the present embodiment provides a one-to-one connection between the machine learning unit and the unicast reading processing circuit; and customized one-to-one data operation processing can be implemented through one-to-one connection between the machine learning unit and the broadcast processing circuit, thereby reducing the complexity of data operation access logic and reducing collisions, thereby improving processing efficiency.

도 46를 참조하면, 하나의 선택 가능한 방안에서, 상기 도 3에 도시된 상기 데이터 처리 장치의 기초상에, 전송 회로 내의 인터페이스의 개수를 줄일 수 있다. 구체적으로, 상기 판독/기록 처리회로(121)는 브로드캐스트 처리회로 및 복수의 유니캐스트 판독 처리회로를 포함할 수 있으며, 상기 복수의 유니캐스트 판독 처리회로와 상기 복수의 기계 학습 유닛은 일대일로 연결되고, 상기 브로드캐스트 처리회로와 상기 복수의 기계 학습 유닛은 일대다로 연결된다. 예시적으로, MLU0은 유니캐스트 판독 처리회로0 및 상기 브로드캐스트 처리회로에 대응되고, MLUn은 유니캐스트 판독 처리회로n과 상기 브로드캐스트 처리회로에 대응된다. 마찬가지로, 제2 전송 인터페이스에는 하나의 유니캐스트 판독 처리회로 및 하나의 기계 학습 유닛에 각각 연결된 한 그룹의 인터페이스가 존재하므로 기계 학습 유닛과 유니캐스트 판독 처리회로의 일대일 연결을 구현한다. 그리고 제2 전송 인터페이스에는 하나의 브로드캐스트 처리회로 및 복수의 기계 학습 유닛에 각각 연결된 한 그룹의 인터페이스가 더 존재하므로 기계 학습 유닛과 브로드캐스트 처리회로의 다대일 연결을 구현한다. 구체적으로, 제2 전송 인터페이스는 브로드캐스트 처리회로에 연결된 한 그룹의 브로드캐스트 인터페이스를 포함할 수 있으며, 상기 브로드캐스트 인터페이스는 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 포함할 수 있으며, 상기 복수의 기계 학습 유닛은 상기 한 그룹의 브로드캐스트 인터페이스를 통해 상기 브로드캐스트 처리회로에 연결된다.Referring to FIG. 46, in one possible option, on the basis of the data processing device shown in FIG. 3, the number of interfaces in the transmission circuit may be reduced. Specifically, the read/write processing circuit 121 may include a broadcast processing circuit and a plurality of unicast read processing circuits, and the plurality of unicast read processing circuits and the plurality of machine learning units are connected one-to-one. The broadcast processing circuit and the plurality of machine learning units are connected in a one-to-many manner. Exemplarily, MLU0 corresponds to the unicast read processing circuit 0 and the broadcast processing circuit, and MLUn corresponds to the unicast read processing circuit n and the broadcast processing circuit. Similarly, in the second transmission interface, there is a group of interfaces respectively connected to one unicast reading processing circuit and one machine learning unit, thereby realizing a one-to-one connection between the machine learning unit and the unicast reading processing circuit. Further, since a group of interfaces respectively connected to one broadcast processing circuit and a plurality of machine learning units are further present in the second transmission interface, a many-to-one connection between the machine learning unit and the broadcast processing circuit is realized. Specifically, the second transmission interface may include a group of broadcast interfaces connected to a broadcast processing circuit, and the broadcast interface may include a broadcast signal reception interface and a broadcast data transmission interface, and the plurality of broadcast interfaces may include a broadcast signal reception interface and a broadcast data transmission interface. The machine learning unit of is connected to the broadcast processing circuit through the broadcast interface of the group.

예를 들어, 복수의 MLU 및 상기 브로드캐스트 처리회로에 있어서, 제2 전송 인터페이스 내의 인터페이스dn+1는, 브로드캐스트 신호 수신 인터페이스로서, 복수의 MLU에서 송신된 브로드캐스트 및/또는 멀티캐스트 신호를 수신한 후 상기 브로드캐스트 처리회로로 전송하여 처리하는데 사용될 수 있다. 제2 전송 인터페이스 내의 인터페이스en+1은, 브로드캐스트 데이터 송신 인터페이스로서, 상기 브로드캐스트 처리회로에서 송신된 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신한 후 복수의 MLU 내의 공유 데이터 수신 인터페이스에 전송하는데 사용될 수 있다.For example, in the plurality of MLUs and the broadcast processing circuit, interface dn+1 in the second transmission interface is a broadcast signal receiving interface, and receives broadcast and/or multicast signals transmitted from the plurality of MLUs. After that, it can be used for processing by transmitting to the broadcast processing circuit. Interface en+1 in the second transmission interface is a broadcast data transmission interface, and after receiving input neuron data and weights corresponding to the broadcast and/or multicast signals transmitted from the broadcast processing circuit, a plurality of MLUs It can be used to transmit to a shared data receiving interface in

이로부터 알 수 있듯이, 복수의 기계 학습 유닛은 하나의 브로드캐스트 처리회로를 공유하는 동시에 한 그룹의 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 공유할 수 있다. 따라서, 본 실시예에 따른 데이터 처리 장치는 기계 학습 유닛 중 리턴하는 데이터 인터페이스 및 전송 회로 내의 인터페이스의 개수를 줄일 수 있음으로써 하드웨어 자원을 추가 절약하여 하드웨어의 면적 및 전력 소모를 감소시킨다.As can be seen from this, a plurality of machine learning units can share one broadcast processing circuit and at the same time share a group of broadcast signal reception interfaces and broadcast data transmission interfaces. Therefore, the data processing apparatus according to the present embodiment can reduce the number of data interfaces returned and interfaces in the transmission circuit among the machine learning units, thereby further saving hardware resources and reducing hardware area and power consumption.

도 47를 참조하면, 하나의 선택 가능한 방안에서, 상기 도 46의 기초상에 전송 회로 내의 인터페이스의 개수를 추가 감소하였다. 상기 제2 전송 인터페이스(120)는 상기 복수의 유니캐스트 판독 처리회로와 일대일 연결된 복수 그룹의 유니캐스트 판독 신호 수신 인터페이스 및 공유 데이터 송신 인터페이스, 상기 브로드캐스트 처리회로에 연결된 브로드캐스트 신호 수신 인터페이스를 포함할 수 있으며, 상기 공유 데이터 송신 인터페이스는 또한 상기 브로드캐스트 처리회로에 연결되고, 상기 유니캐스트 판독 신호 수신 인터페이스는 상기 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결되며, 상기 브로드캐스트 신호 수신 인터페이스는 상기 기계 학습 유닛의 브로드캐스트 신호 송신 인터페이스에 연결되고, 상기 공유 데이터 송신 인터페이스는 상기 기계 학습 유닛의 공유 데이터 수신 인터페이스에 연결된다.Referring to FIG. 47, in one possible solution, the number of interfaces in the transmission circuit is further reduced on the basis of FIG. 46 above. The second transmission interface 120 may include a plurality of groups of unicast read signal reception interfaces and shared data transmission interfaces connected one-to-one with the plurality of unicast read processing circuits, and a broadcast signal reception interface connected to the broadcast processing circuits. The shared data transmission interface is also connected to the broadcast processing circuit, the unicast read signal reception interface is connected to the unicast read signal transmission interface of the machine learning unit, and the broadcast signal reception interface is It is connected to the broadcast signal transmission interface of the machine learning unit, and the shared data transmission interface is connected to the shared data reception interface of the machine learning unit.

예시적으로, 유니캐스트 판독 처리회로0에 있어서, 제2 전송 인터페이스는 상기 유니캐스트 판독 처리회로0와 일대일로 연결된 한 그룹의 유니캐스트 판독 신호 수신 인터페이스d0 및 공유 데이터 송신 인터페이스e0를 포함하고, 유니캐스트 판독 신호 수신 인터페이스d0는 MLU0 내의 유니캐스트 판독 신호 송신 인터페이스a0에 연결되고, 공유 데이터 송신 인터페이스e0는 MLU0 내의 공유 데이터 수신 인터페이스c0에 연결된다. 유니캐스트 판독 처리회로n에 있어서, 제2 전송 인터페이스는, 상기 유니캐스트 판독 처리회로n와 일대일로 연결된 한 그룹의 유니캐스트 판독 신호 수신 인터페이스dn 및 공유 데이터 송신 인터페이스en를 포함하고, 유니캐스트 판독 신호 수신 인터페이스dn는 MLUn 내의 유니캐스트 판독 신호 송신 인터페이스an에 연결되고, 공유 데이터 송신 인터페이스en는 MLUn 내의 공유 데이터 수신 인터페이스cn에 연결되며, 제2 전송 인터페이스는, 브로드캐스트 처리회로에 연결된 브로드캐스트 신호 수신 인터페이스dn+1를 더 포함하고, 브로드캐스트 신호 수신 인터페이스dn+1는, 각각의 MLU의 브로드캐스트 신호 송신 인터페이스(MLUi는 인터페이스bi에 연결됨)에 각각 연결된다. 한편, 가장 유의해야 할 부분은, 전송 회로 중 각 공유 데이터 송신 인터페이스ei가 모두 브로드캐스트 처리회로에 연결되어 브로드캐스트 처리회로에서 송신된 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신하여 복수의 MLU 내의 공유 데이터 수신 인터페이스ci에 전송할 수 있다. 이로부터 알 수 있듯이, 전송 회로에서 각각의 유니캐스트 판독 처리회로i는 브로드캐스트 처리회로와 함께 공유 데이터 송신 인터페이스ei를 각각 공유하였고, MLUi 내의 공유 데이터 수신 인터페이스ci 및 전송 회로 내의 공유 데이터 송신 인터페이스ei로 구성된 데이터 경로는 MLUi과 전송 회로 사이에서 유니캐스트 판독 데이터, 브로드캐스트 및/또는 멀티캐스트 데이터를 전송할 수 있다.For example, in the unicast read processing circuit 0, the second transmission interface includes a group of unicast read signal reception interface d0 and shared data transmission interface e0 connected one-to-one with the unicast read processing circuit 0; The cast read signal reception interface d0 is connected to the unicast read signal transmission interface a0 in MLU0, and the shared data transmission interface e0 is connected to the shared data reception interface c0 in MLU0. In the unicast read processing circuit n, the second transmission interface includes a group of a unicast read signal receiving interface dn and a shared data transmission interface en connected one-to-one with the unicast read processing circuit n; The reception interface dn is connected to the unicast read signal transmission interface an in the MLUn, the shared data transmission interface en is connected to the shared data reception interface cn in the MLUn, and the second transmission interface is connected to the broadcast processing circuit to receive the broadcast signal. It further includes interface dn+1, wherein the broadcast signal receiving interface dn+1 is respectively connected to the broadcast signal transmitting interface (MLUi is connected to interface bi) of each MLU. On the other hand, the most important thing to note is that each shared data transmission interface ei among the transmission circuits is connected to the broadcast processing circuit, so that the input neuron data and weights corresponding to the broadcast and/or multicast signals transmitted from the broadcast processing circuit are connected. may be received and transmitted to a shared data reception interface ci in a plurality of MLUs. As can be seen from this, each unicast read processing circuit i in the transmission circuit shares a shared data transmission interface ei with the broadcast processing circuit, and a shared data reception interface ci in the MLUi and a shared data transmission interface ei in the transmission circuit. A data path consisting of MLUi may transmit unicast read data, broadcast and/or multicast data between the MLUi and the transmit circuitry.

이로부터 알 수 있듯이, 복수의 유니캐스트 판독 처리회로 각각이 브로드캐스트 처리회로와 함께 데이터 송신 인터페이스를 공유하므로 본 실시예에 따른 데이터 처리 장치는 전송 회로의 인터페이스의 수를 줄이므로 하드웨어 자원을 추가 절약하여 하드웨어의 면적 및 전력 소모를 감소시킨다.As can be seen from this, since each of the plurality of unicast read processing circuits shares a data transmission interface with the broadcast processing circuit, the data processing apparatus according to the present embodiment reduces the number of transmission circuit interfaces, thereby further saving hardware resources. This reduces the area and power consumption of hardware.

정보 기술이 끊임없이 발전하고 날로 향상됨에 따라 사람들이 데이터 액세스 및 데이터 처리에 대한 요구가 점점 높아져 일부 데이터를 처리하고 액세스하는 프로세서에 대한 요구도 점점 엄격해지고 있다. 범용 프로세서를 예로 들면, 복수의 범용 프로세서 코어(예를 들어, CPU 코어)로 구성된 멀티 코어 프로세서는 강력한 병행 계산 능력으로 주된 추세가 되었다.With the continuous development and improvement of information technology, people's demands for data access and data processing are getting higher and higher, and the requirements for processors that process and access some data are also getting stricter. Taking general-purpose processors as an example, multi-core processors composed of a plurality of general-purpose processor cores (eg, CPU cores) have become a major trend due to their powerful concurrent computing capabilities.

그러나 현재 기계 학습 알고리즘이 끊임없이 발전함에 따라 점점 많은 구성의 기계 학습 칩이 생기고 있다. 이런 기계 학습 칩은 유니캐스트 판독, 브로드캐스트 등 여러 가지 방식을 통해 공유 저장 중인 데이터를 자주 액세스하거나 처리해야 하므로 이에 상응한 복수의 전송 인터페이스를 설치해야 한다. 이는 기계 학습 칩의 면적을 크게 하였다.However, with machine learning algorithms constantly evolving, machine learning chips with more and more configurations are emerging. Since such a machine learning chip needs to frequently access or process data in shared storage through various methods such as unicast reading and broadcasting, a plurality of corresponding transmission interfaces must be installed. This increased the area of the machine learning chip.

따라서, 기계 학습 칩의 면적을 줄이기 위해 기계 학습 칩의 전송 인터페이스를 간소화시키는 것이 현재 당업자들이 시급히 해결해야 할 기술적 과제이다 Therefore, simplifying the transmission interface of the machine learning chip to reduce the area of the machine learning chip is currently a technical challenge that those skilled in the art urgently need to solve.

상기 문제를 해결하기 위하여 본 출원은 아래 기술적 방안을 제공한다.In order to solve the above problem, the present application provides the following technical solutions.

도 49를 참조하여, 도 48 내의 기계 학습 유닛0을 예로 들어 기계 학습 유닛을 상세하게 설명한다. 일 방안에서, 기계 학습 유닛(15)은 적어도 하나 이상의 송신 인터페이스(141), 적어도 하나 이상의 수신 인터페이스(142), 적어도 하나 이상의 연산 유닛(151), 및 상기 연산 유닛(151)에 연결된 컨트롤러 유닛(152)을 포함할 수 있으며, 상기 연산 유닛(151)은 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)를 포함하고, 상기 연산 유닛(151)은 상기 적어도 하나 이상의 송신 인터페이스(141) 및 적어도 하나 이상의 수신 인터페이스(142)를 통해 상기 전송 회로(12)에 연결되고,Referring to Fig. 49, the machine learning unit will be described in detail by taking machine learning unit 0 in Fig. 48 as an example. In one approach, the machine learning unit 15 includes at least one transmission interface 141, at least one reception interface 142, at least one calculation unit 151, and a controller unit connected to the calculation unit 151 ( 152), wherein the arithmetic unit 151 includes one master processing circuit 151a and a plurality of slave processing circuits 151b, and the arithmetic unit 151 includes the at least one transmission interface ( 141) and connected to the transmission circuit 12 through at least one receive interface 142,

상기 컨트롤러 유닛(152)은, 상기 적어도 하나 이상의 송신 인터페이스(141)을 통해 상기 전송 회로(12)로 상기 데이터 동작 신호 및 상기 출력 뉴런 데이터를 송신하고, 상기 적어도 하나 이상의 수신 인터페이스(142)를 통해 상기 전송 회로(12)가 상기 공유 메모리(13)로부터 획득한 상기 입력 뉴런 데이터 및 상기 가중치를 수신한 후 상기 입력 뉴런 데이터 및 상기 가중치를 상기 마스트 처리회로(151a) 및/또는 상기 슬레이브 처리회로(151b)로 송신하며,The controller unit 152 transmits the data operation signal and the output neuron data to the transmission circuit 12 through the at least one transmission interface 141, and transmits the output neuron data through the at least one reception interface 142. After the transmission circuit 12 receives the input neuron data and the weights obtained from the shared memory 13, the input neuron data and the weights are transmitted to the master processing circuit 151a and/or the slave processing circuit ( 151b),

상기 마스트 처리회로(151a)는, 상기 입력 뉴런 데이터 및/또는 가중치를 상기 복수의 슬레이브 처리회로(151b)로 각각 송신하고 상기 복수의 슬레이브 처리회로(151b)는, 상기 뉴런 데이터 및 가중치에 의해 중간 연산을 병행 수행하여 복수의 중간 결과를 얻은 후 복수의 중간 결과를 상기 마스트 처리회로(151a)로 전송하며, 상기 마스트 처리회로(151a)는 또한 상기 복수의 중간 결과에 대해 후속 처리를 행하여 계산 결과를 얻는데 사용된다. 여기서, 상기 후속 처리는 활성화 연산을 포함할 수 있다. 구체적으로, 상기 컨트롤러 유닛(152)은 계산 명령을 획득하고, 상기 계산 명령을 분석하여 복수의 연산 명령을 얻은 후 상기 복수의 연산 명령을 상기 마스트 처리회로로 송신할 수 있다.The master processing circuit 151a transmits the input neuron data and/or weights to the plurality of slave processing circuits 151b, respectively, and the plurality of slave processing circuits 151b transmits the intermediate neuron data and weights to the plurality of slave processing circuits 151b. After obtaining a plurality of intermediate results by performing calculations in parallel, the plurality of intermediate results are transmitted to the master processing circuit 151a, and the master processing circuit 151a further performs subsequent processing on the plurality of intermediate results to obtain a calculated result. is used to obtain Here, the subsequent processing may include an activation operation. Specifically, the controller unit 152 may obtain a calculation command, analyze the calculation command to obtain a plurality of calculation commands, and then transmit the plurality of calculation commands to the master processing circuit.

본 실시예에서, 기계 학습 유닛에 복수의 연산 유닛이 포함된 경우, 각각의 연산 유닛은 상기 적어도 하나 이상의 송신 인터페이스 및 상기 적어도 하나 이상의 수신 인터페이스를 공용할 수 있다는 것을 이해할 수 있다.In this embodiment, it can be understood that when a machine learning unit includes a plurality of calculation units, each calculation unit may share the at least one transmission interface and the at least one reception interface.

예를 들어 설명하면, 하나의 선택 가능한 기술방안에서, 마스트 처리회로는 하나의 컨트롤러 유닛을 포함할 수 있으며, 상기 컨트롤러 유닛은 마스터 명령 처리 유닛을 포함할 수 있으며, 구체적으로 연산 명령을 마이크로 명령으로 디코딩하는데 사용된다. 다른 선택 가능한 기술방안에서, 슬레이브 처리회로는 다른 하나의 컨트롤러 유닛을 포함할 수 있으며, 상기 다른 하나의 컨트롤러 유닛은 슬레이브 명령 처리 유닛을 포함하고, 구체적으로 마이크로 명령을 수신하고 처리하는데 사용된다. 상기 마이크로 명령은 명령의 다음 레벨 명령일 수 있다. 상기 마이크로 명령은 명령을 분할 또는 디코딩하여 얻을 수 있으며 추가로 각 부재, 각 유닛 또는 각 처리회로의 제어 신호로 디코딩될 수 있다. 예를 들어, 곱셈 마이크로 명령은 컨볼루션 명령의 다음 레벨 명령이다.For example, in one selectable technical solution, the master processing circuit may include a controller unit, and the controller unit may include a master command processing unit, specifically converting operation commands into micro-instructions. used for decoding. In another alternative technical solution, the slave processing circuit may include another controller unit, and the other controller unit includes a slave command processing unit, specifically used to receive and process micro-instructions. The micro-instruction may be a next-level command of the command. The micro-instruction can be obtained by dividing or decoding the command, and can be further decoded into a control signal of each member, each unit or each processing circuit. For example, the multiply microinstruction is the next level instruction of the convolution instruction.

예시적으로, 상기 기계 학습 유닛의 구조를 예를 들어 상기 기계 학습 유닛의 신경망 연산 절차에 대해 상세하게 설명한다. 아래 단계S6101-S6106를 참조하기 바란다.By way of example, the structure of the machine learning unit will be described in detail with respect to a neural network calculation procedure of the machine learning unit, for example. Please refer to steps S6101-S6106 below.

S6101: 컨트롤러 유닛의 명령 저장 유닛의 첫 주소에 하나의 IO명령을 미리 저장하고,S6101: Store one IO command in advance in the first address of the command storage unit of the controller unit,

S6102: 컨트롤러 유닛은 명령 저장 유닛의 첫 주소로부터 상기 IO 명령을 판독한 다음 상기 IO명령으로부터 디코딩된 제어신호에 따라 오프 칩 인터페이스를 통해 오프 칩 메모리로부터 상기 기계 학습 유닛에 대응하는 신경망 연산 명령을 획득하거나, 혹은 전송 회로를 통해 공유 메모리로부터 상기 기계 학습 유닛에 대응하는 신경망 계산 명령을 획득한 후 상기 획득한 계산 명령을 명령 저장 유닛에 저장한다.S6102: The controller unit reads the IO command from the first address of the command storage unit, and obtains the neural network operation command corresponding to the machine learning unit from the off-chip memory through the off-chip interface according to the control signal decoded from the IO command. or, after obtaining a neural network calculation command corresponding to the machine learning unit from a shared memory through a transmission circuit, the acquired calculation command is stored in a command storage unit.

S6103: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 IO명령을 읽어들인 후 상기 IO명령으로부터 디코딩된 데이터 동작 신호에 따라 전송 회로를 통해 공유 메모리로부터 연산 유닛에 필요한 모든 데이터 블록을 판독하며, 상기 데이터 블록은 배분하고자 하는 해당 층 뉴런의 입력 뉴런 데이터 및 가중치를 포함하고, 빠른 활성화 함수 연산에 사용되는 보간 테이블, 연산 장치의 파라미터를 설정하는 상수 테이블, 오프셋 데이터 등을 더 포함할 수 있고, 상기 데이터 동작 신호는 상기 데이터 블록이 공유 메모리에서의 소스 주소를 포함한다.S6103: The controller unit reads the next IO command from the command storage unit, and then reads all the data blocks required by the arithmetic unit from the shared memory through the transfer circuit according to the data operation signal decoded from the IO command, and the data blocks are allocated It may further include input neuron data and weights of a corresponding layer neuron to be processed, an interpolation table used for calculating a fast activation function, a constant table for setting parameters of an arithmetic device, offset data, and the like, wherein the data operation signal is The data block contains a source address in shared memory.

S6104: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 CONFIG(설정)명령을 읽어들인 후, 상기 CONFIG명령으로부터 디코딩된 제어신호에 의해 해당 층의 신경망 계산에 필요한 각종 상수를 설정한다. 예를 들어, 연산 유닛은 활성화 함수에 필요한 상수에 따라 그 인너 레지스터(inner register)의 값을 설정한다.S6104: After reading the next CONFIG (setting) command from the command storage unit, the controller unit sets various constants necessary for calculating the neural network of the corresponding layer by a control signal decoded from the CONFIG command. For example, the arithmetic unit sets the value of its inner register according to the constant required by the activation function.

S6105: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 COMPUTE(계산)명령을 읽어들이고, 상기 COMPUTE 명령으로부터 디코딩된 제어신호(즉, 연산 명령)에 따라, 연산 유닛은 분배된 해당 층 뉴런의 입력 뉴런 데이터, 가중치 및 연산 명령을 마스트 처리회로로 전송하고, 마스트 처리회로는 분배된 해당 층 뉴런의 입력 뉴런 데이터를 브로드캐스트 데이터로, 가중치를 배부 데이터로 확정하고, 하나의 배부 데이터를 복수의 데이터 블록으로 배분하고, 복수의 데이터 블록 중 적어도 하나 이상의 데이터 블록, 브로드캐스트 데이터 및 복수의 연산 명령 중 적어도 하나 이상의 연산 명령을 슬레이브 처리회로로 송신하고, 슬레이브 처리회로로부터 곱셈 처리회로, 누적 처리회로 등을 통해 중간 결과를 얻고, 마스트 처리회로로부터 중간 결과 및 활성화 처리회로 등에 따라 분배된 해당 층의 뉴런이 출력한 뉴런 데이터를 얻을 수 있다.S6105: The controller unit reads the next COMPUTE (calculation) command from the command storage unit, and according to the control signal (ie, calculation command) decoded from the COMPUTE command, the calculation unit calculates the distributed input neuron data and weights of the corresponding layer neurons. and send the operation command to the master processing circuit, the master processing circuit determines the distributed input neuron data of the corresponding layer neurons as broadcast data, weights as distribution data, and distributes one distribution data into a plurality of data blocks; , At least one data block among a plurality of data blocks, broadcast data, and at least one arithmetic command among a plurality of arithmetic commands are transmitted to the slave processing circuit, and an intermediate result is transmitted from the slave processing circuit through a multiplication processing circuit, an accumulation processing circuit, and the like. , and neuron data output by neurons of the corresponding layer distributed according to intermediate results and activation processing circuits from the master processing circuit can be obtained.

S6106: 컨트롤러 유닛은 명령 저장 유닛으로부터 다음 IO명령을 읽어들이고, 상기 IO명령으로부터 디코딩된 데이터 동작 신호에 따라, 전송 회로를 통해 상기 출력한 뉴런 데이터를 공유 메모리로 전송하고 저장하여 다음 층 부분 뉴런의 입력 뉴런 데이터를 얻고, 상기 데이터 동작 신호는 상기 출력의 뉴런 데이터가 공유 메모리에서의 목표 주소를 포함한다.S6106: The controller unit reads the next IO command from the command storage unit, and according to the data operation signal decoded from the IO command, transmits and stores the output neuron data to a shared memory through a transmission circuit, and stores the neuron data of the next layer. The input neuron data is obtained, and the data operation signal includes the target address of the neuron data of the output in the shared memory.

아래 S6105에 대해 예시적으로 설명한다. 신경망 연산 중의 완전 연결 연산을 예로 들면, 어느 층의 신경망 네트워크의 과정은 y=f(wx+b)일 수 있으며, 여기서 x는 입력 뉴런 행렬이고, w는 가중치 행렬이고, b는 오프셋 스칼라이며, f는 활성화 함수이다. 구체적으로, sigmoid함수, tanh, relu, softmax함수 중 임의의 한 함수일 수 있다. 여기서, 마스트 처리회로와 슬레이브 처리회로 사이가 2진 트리 관계(트리 관계)이며, 연산 유닛이 하나의 마스트 처리회로와 8개의 슬레이브 처리회로를 가진다고 가정할 경우, 상기 S5105의 구현 방법은 다음과 같다. 컨트롤러 유닛이 공유 메모리로부터 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 획득하고, 입력 뉴런 행렬x, 가중치 행렬w 및 완전 연결 연산 명령을 마스트 처리회로로 전송하고, 마스트 처리회로는 상기 입력 뉴런 행렬x를 브로드캐스트 데이터로, 가중치 행렬w를 배부 데이터로 확정하고 가중치 행렬w를 8개의 자행렬로 분할한 다음 트리 모듈을 통해 8개의 자행렬을 8개의 슬레이브 처리회로로 배분하고, 입력 뉴런 행렬x를 8개의 슬레이브 처리회로로 브로드캐스팅하고 슬레이브 처리회로는 8개의 자행렬과 입력 뉴런 행렬x의 곱셈 연산 및 누적 연산을 병행 수행하여 8개의 중간 결과를 얻은 후, 8개의 중간 결과를 마스트 처리회로로 송신하며, 마스트 처리회로는 8개의 중간 결과를 정렬하여 wx의 연산 결과를 얻으며 상기 연산 결과에 대해 오프셋b의 연산을 수행한 후 활성화 연산을 진행하여 최종 결과y를 얻는다. S6105 will be described below as an example. Taking fully connected operation in neural network operation as an example, the process of a neural network of a certain layer may be y=f(wx+b), where x is an input neuron matrix, w is a weight matrix, b is an offset scalar, f is the activation function. Specifically, it may be any one of the sigmoid function, tanh, relu, and softmax functions. Here, assuming that there is a binary tree relationship (tree relationship) between the master processing circuit and the slave processing circuit, and that the arithmetic unit has one master processing circuit and eight slave processing circuits, the implementation method of S5105 is as follows. . A controller unit obtains an input neuron matrix x, a weight matrix w, and a fully connected operation command from a shared memory, and transmits the input neuron matrix x, a weight matrix w, and a fully connected operation command to a master processing circuit, wherein the master processing circuit Determine the neuron matrix x as broadcast data and the weight matrix w as distribution data, divide the weight matrix w into 8 child matrices, distribute the 8 child matrices to 8 slave processing circuits through the tree module, and The matrix x is broadcast to 8 slave processing circuits, and the slave processing circuit performs multiplication and accumulation operations of the 8 child matrices and the input neuron matrix x in parallel to obtain 8 intermediate results, and then the 8 intermediate results are master-processed. circuit, the master processing circuit aligns the 8 intermediate results to obtain an operation result of wx, performs an operation of offset b on the operation result, and then performs an activation operation to obtain a final result y.

상기 각각의 기계 학습 유닛이 어느 한층의 배분된 각 뉴런에 대해 병행 계산을 수행할 수 있으므로, 공유 메모리에 각층의 모든 뉴런의 출력 뉴런 데이터, 및 다음 층의 모든 뉴런에 필요한 입력 뉴런 데이터를 저장할 수 있고, 가중치는 재활용하거나, 혹은 공유 메모리로부터 새로운 한층의 신경망의 가중치를 획득할 수 있음을 이해할 수 있다.Since each of the machine learning units can perform parallel computation for each neuron distributed in any layer, it is possible to store output neuron data of all neurons in each layer and input neuron data required for all neurons in the next layer in a shared memory. It can be understood that the weights can be reused or the weights of a new layer of neural network can be acquired from a shared memory.

다음은 상기 도 43에 도시된 본 실시예에 따른 데이터 처리 장치로 돌아가서 설명한다. 여기서, 유니캐스트 판독 동작에 대응하는 데이터 동작 신호는 유니캐스트 판독 명령 및 유니캐스트 판독 요청일 수 있으며, 유니캐스트 기록 동작에 대응하는 데이터 동작 신호는 유니캐스트 기록 명령 및 유니캐스트 기록 요청일 수 있으며, 브로드캐스트 동작에 대응하는 데이터 동작 신호는 브로드캐스트 명령, 멀티캐스트 명령, 브로드캐스트 요청, 멀티캐스트 요청일 수 있다. 예시적으로, 유니캐스트 판독 명령은 어느 기계 학습 유닛으로부터 송신되는, 공유 메모리의 소스 주소의 입력 뉴런 데이터 및 가중치를 판독하는 명령이다. 상기 입력 뉴런 데이터 및 가중치를 상기 기계 학습 유닛으로 리턴할 수 있다. 상기 입력 뉴런 데이터 및 가중치는, 상기 기계 학습 유닛이 계산 명령에 따라 어느 한층에 분배된 뉴런 계산을 진행하는 과정에서 상기 분배된 뉴런에 필요한 입력 뉴런 데이터 및 가중치이며, 유니캐스트 기록 명령은, 어느 기계 학습 유닛으로부터 송신되는, 신경망 계산을 통해 얻은 출력 뉴런 데이터를 공유 메모리의 목표 주소에 기록하는 기록 명령이다. 이전 층 신경망에서 출력 뉴런 데이터는 다음 층 신경망에 필요한 입력 뉴런 데이터로 삼을 수 있으므로 공유 메모리에 기록하여 각 기계 학습 유닛이 공유 메모리로부터 필요한 입력 뉴런 데이터를 획득하도록 한다. 브로드캐스트 명령은 어느 기계 학습 유닛에서 송신된, 공유 메모리의 소스 주소의 입력 뉴런 데이터 및 가중치에 대한 판독 명령이며, 상기 입력 뉴런 데이터 및 가중치를 상기 기계 학습 장치의 모든 기계 학습 유닛으로 리턴해야 한다. 상기 입력 뉴런 데이터는 어느 한층의 모든 뉴런에 필요한 입력 뉴런 데이터, 즉 이전 층의 모든 출력 뉴런 데이터일 수 있다. 상기 가중치는 예를 들어 컨볼루션 커넬과 같은 재활용 가능한 가중치이며, 멀티캐스트 명령의 데이터 리턴 대상은 상기 기계 학습 장치 중의 모든 기계 학습 유닛이 아니라 상기 멀티캐스트 명령 중의 마커 필드(marker field)에 대응하는 복수의 기계 학습 유닛이다. 한편, 일반적으로 명령과 요청의 차이점은 명령 수행에 따른 오버헤드가 상대적으로 크지만, 명령에 많은 정보가 포함되어 있고, 요청 수행에 따른 오버헤드가 상대적으로 작지만, 요청에 포함된 정보가 적다는 것이다.Next, the data processing apparatus according to the present embodiment shown in FIG. 43 will be described. Here, the data operation signal corresponding to the unicast read operation may be a unicast read command and a unicast read request, and the data operation signal corresponding to the unicast write operation may be a unicast write command and a unicast write request; The data operation signal corresponding to the broadcast operation may be a broadcast command, a multicast command, a broadcast request, or a multicast request. Exemplarily, the unicast read command is a command to read input neuron data and weights of a source address of a shared memory transmitted from a machine learning unit. The input neuron data and weights may be returned to the machine learning unit. The input neuron data and weights are input neuron data and weights required for the distributed neurons in the process of the machine learning unit calculating neurons distributed to a certain layer according to a calculation command, and the unicast write command is This is a write command that is sent from the learning unit and writes the output neuron data obtained through neural network calculation to the target address of the shared memory. Since output neuron data from the previous layer neural network can be used as input neuron data required for the next layer neural network, it is written to the shared memory so that each machine learning unit acquires the necessary input neuron data from the shared memory. A broadcast command is a command to read input neuron data and weights of a source address of a shared memory sent from a machine learning unit, and must return the input neuron data and weights to all machine learning units of the machine learning device. The input neuron data may be input neuron data necessary for all neurons in a layer, that is, all output neuron data in a previous layer. The weight is, for example, a weight that can be reused, such as a convolution kernel, and the data return target of the multicast command is not all machine learning units in the machine learning device, but a plurality of data corresponding to marker fields in the multicast command. is a machine learning unit of On the other hand, in general, the difference between a command and a request is that the overhead of executing the command is relatively large, but the command contains a lot of information, and the overhead of executing the request is relatively small, but the information included in the request is small. will be.

일반적으로, 기계 학습 유닛이 유니캐스트 판독 신호, 유니캐스트 기록 신호 및 브로드캐스트 및/또는 멀티캐스트 신호를 송신할 때 적어도 대응하는 3개의 데이터 신호 송신 인터페이스가 필요하다. 이들은 전송 회로로 유니캐스트 판독 신호를 송신하고, 전송 회로로 유니캐스트 기록 신호를 송신하고 전송 회로로 브로드캐스트 및/또는 멀티캐스트 신호를 송신하는데 각각 사용된다. 본 실시예에서, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 동작, 유니캐스트 기록 동작, 브로드캐스트 동작 중 적어도 2종의 데이터 동작을 수행할 때 상기 기계 학습 유닛의 하나의 송신 인터페이스를 공유한다. 도 48를 참조하면, 상기 기계 학습 유닛0의 적어도 하나 이상의 송신 인터페이스(141)는 인터페이스a0 및 인터페이스b0의 두 데이터 동작 신호 송신 인터페이스를 포함할 수 있다. 일 실시 형태에서, 인터페이스a0는 유니캐스트 판독 신호 송신 인터페이스일 수 있으며, 인터페이스b0는 유니캐스트 기록 신호와 브로드캐스트 및/또는 멀티캐스트 신호가 공유하는 신호 송신 인터페이스일 수 있다. 일 실시 형태에서, 인터페이스a0는 유니캐스트 기록 신호 송신 인터페이스이고, 인터페이스b0는 유니캐스트 판독 신호와 브로드캐스트 및/또는 멀티캐스트 신호가 공유하는 신호 송신 인터페이스일 수 있다. 일 실시 형태에서, 인터페이스a0는 브로드캐스트 및/또는 멀티캐스트 신호 송신 인터페이스이고, 인터페이스b0는 유니캐스트 판독 신호와 유니캐스트 기록 신호가 공유한 신호 송신 인터페이스일 수 있다. 한편, 하나의 선택 가능한 방안에서, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 동작, 유니캐스트 기록 동작, 브로드캐스트 동작을 수행할 때 상기 기계 학습 유닛 중의 하나의 송신 인터페이스를 공유할 수 있다. 즉 상기 송신 인터페이스는 유니캐스트 판독 신호, 유니캐스트 기록 신호, 브로드캐스트 및/또는 멀티캐스트 신호를 송신할 수 있다.Generally, when a machine learning unit transmits a unicast read signal, a unicast write signal, and a broadcast and/or multicast signal, at least three corresponding data signal transmission interfaces are required. They are used to transmit unicast read signals to the transmitting circuit, transmit unicast write signals to the transmitting circuit, and transmit broadcast and/or multicast signals to the transmitting circuit, respectively. In this embodiment, when at least one machine learning unit performs at least two types of data operations among a unicast read operation, a unicast write operation, and a broadcast operation, one transmission interface of the machine learning unit is shared. Referring to FIG. 48 , at least one transmission interface 141 of the machine learning unit 0 may include two data operation signal transmission interfaces, interface a0 and interface b0. In one embodiment, interface a0 may be a unicast read signal transmission interface, and interface b0 may be a signal transmission interface shared by a unicast write signal and a broadcast and/or multicast signal. In one embodiment, interface a0 is a unicast write signal transmission interface, and interface b0 may be a signal transmission interface shared by a unicast read signal and a broadcast and/or multicast signal. In one embodiment, interface a0 is a broadcast and/or multicast signal transmission interface, and interface b0 may be a signal transmission interface shared by a unicast read signal and a unicast write signal. Meanwhile, in one selectable scheme, when at least one or more machine learning units perform a unicast read operation, a unicast write operation, and a broadcast operation, one of the machine learning units may share a transmission interface. That is, the transmission interface can transmit a unicast read signal, a unicast write signal, a broadcast and/or a multicast signal.

따라서, 본 실시예에 따른 데이터 처리 장치에서, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 동작, 유니캐스트 기록 동작, 브로드캐스트 동작 중의 적어도 2개 데이터 동작을 수행할 때 상기 기계 학습 유닛 중의 하나의 송신 인터페이스를 공유할 수 있으므로 기계 학습 유닛 중 데이터 동작 신호 송신 인터페이스의 개수를 효율적으로 줄이고, 하드웨어의 소스를 절약하여 하드웨어의 면적과 전력 소모를 감소시킬 수 있다.Therefore, in the data processing apparatus according to the present embodiment, transmission of one of the at least one machine learning unit when the at least one machine learning unit performs at least two data operations of a unicast read operation, a unicast write operation, and a broadcast operation. Since interfaces can be shared, the number of data operation signal transmission interfaces among machine learning units can be effectively reduced, and hardware sources can be saved to reduce hardware area and power consumption.

하나의 선택 가능한 방안에서, 상기 유니캐스트 판독 동작, 유니캐스트 기록 동작 및 브로드캐스트 동작에 상응하여, 도 3을 참조하면, 상기 도 48의 기초상에 상기 데이터 처리 장치 중의 전송 회로(12)는, 제2 전송 인터페이스(120), 상기 제2 전송 인터페이스(120)에 연결되어 있는 판독/기록 처리회로(121), 및 상기 판독/기록 처리회로(121)에 연결되어 있는 중재 회로(122)를 포함할 수 있다. 상기 판독/기록 처리회로(121)는, 상기 적어도 하나 이상의 기계 학습 유닛(15)이 상기 적어도 하나 이상의 송신 인터페이스(141) 및 상기 제2 전송 인터페이스(120)를 통해 송신한 데이터 동작 신호를 수신하여 상기 데이터 동작 신호를 상기 중재 회로(122)로 전송하고, 상기 중재 회로(122)가 상기 공유 메모리(13)로부터 획득한 데이터를 상기 제2 전송 인터페이스(120) 및 상기 적어도 하나 이상의 수신 인터페이스(142)를 통해 상기 데이터 동작 신호에 대응하는 기계 학습 유닛으로 리턴하는데 사용된다. 상기 중재 회로(122)는, 사전에 설정된 중재 규칙에 따라 상기 판독/기록 처리회로(121)로부터 수신된 데이터 동작 신호를 중재하고, 중재 성공한 데이터 동작 신호에 의해 상기 공유 메모리(13) 내의 데이터를 동작시키는데 사용된다.In one possible option, referring to Fig. 3, according to the unicast read operation, the unicast write operation and the broadcast operation, on the basis of the above Fig. 48, the transmission circuit 12 in the data processing device: a second transmission interface 120, a read/write processing circuit 121 connected to the second transmission interface 120, and an arbitration circuit 122 connected to the read/write processing circuit 121; can do. The read/write processing circuit 121 receives a data operation signal transmitted from the at least one machine learning unit 15 through the at least one transmission interface 141 and the second transmission interface 120, The data operation signal is transmitted to the mediation circuit 122, and the mediation circuit 122 transfers data obtained from the shared memory 13 to the second transmission interface 120 and the at least one reception interface 142. ) through which the data is used to return to the machine learning unit corresponding to the operation signal. The mediation circuit 122 mediates the data operation signal received from the read/write processing circuit 121 according to a preset mediation rule, and transfers the data in the shared memory 13 by the data operation signal that has succeeded in mediation. used to make it work.

구체적으로, 상기 판독/기록 처리회로(121)는 유니캐스트 판독 신호를 처리할 수 있으며, 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 수도 있다. 일 실시 형태에서, 상기 판독/기록 처리회로(121)는, 유니캐스트 판독 처리회로 및 유니캐스트 기록 처리회로를 포함할 수 있으며, 상기 유니캐스트 판독 처리회로는 유니캐스트 판독 신호를 처리할 수 있고, 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 수도 있다. 여기서, 기계 학습 유닛이 수행하는 유니캐스트 기록 동작과 브로드캐스트 동작에서 상기 기계 학습 유닛의 하나의 송신 인터페이스를 공유하는 것을 예로 들면, 상기 적어도 하나 이상의 송신 인터페이스는 유니캐스트 판독 신호 송신 인터페이스 및 공유 신호 송신 인터페이스를 포함한다. 상기 유니캐스트 판독 처리회로가 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리할 때 적어도 하나 이상의 기계 학습 유닛이 상기 공유 신호 송신 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 브로드캐스트 및/또는 멀티캐스트 신호를 수신하여 상기 브로드캐스트 및/또는 멀티캐스트 신호를 상기 중재 회로로 전송할 수 있으며, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 상기 적어도 하나 이상의 수신 인터페이스를 통해 사전에 설정한 순서에 따라 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신한다. 상기 사전에 설정된 순서는 상기 복수의 기계 학습 유닛으로 데이터를 리턴하는 순서이고, 이는 각각의 기계 학습 유닛의 우선순위에 따라 정열할 수 있고, 복수의 기계 학습 유닛의 번호 순서 혹은 다른 순서에 따라 정열할 수도 있다. Specifically, the read/write processing circuit 121 can process a unicast read signal, and can also process a broadcast signal and/or a multicast signal. In one embodiment, the read/write processing circuit 121 may include a unicast read processing circuit and a unicast write processing circuit, wherein the unicast read processing circuit can process a unicast read signal; It may also process broadcast signals and/or multicast signals. Here, taking as an example that one transmission interface of the machine learning unit is shared in a unicast recording operation and a broadcast operation performed by the machine learning unit, the at least one transmission interface includes a unicast read signal transmission interface and a shared signal transmission contains the interface Broadcast and/or multicast signals transmitted by at least one machine learning unit through the shared signal transmission interface and the second transmission interface when the unicast reading processing circuit processes the broadcast signal and/or multicast signal. and transmits the broadcast and/or multicast signal to the arbitration circuit, wherein the arbitration circuit pre-sets the data obtained from the shared memory through the second transmission interface and the at least one reception interface. Transmit to a plurality of machine learning units corresponding to the broadcast and/or multicast signal according to a sequence. The preset order is an order of returning data to the plurality of machine learning units, which may be arranged according to the priority of each machine learning unit, or according to a number order or other order of the plurality of machine learning units. You may.

대안적으로, 상기 판독/기록 처리회로(121)는 유니캐스트 판독 처리회로, 유니캐스트 기록 처리회로 및 브로드캐스트 처리회로를 포함할 수 있으며, 상기 유니캐스트 판독 처리회로는 유니캐스트 판독 신호를 처리하고, 상기 유니캐스트 기록 처리회로는 유니캐스트 기록 신호를 처리하며, 상기 브로드캐스트 처리회로는 브로드캐스트 신호 및/또는 멀티캐스트 신호를 처리한다.Alternatively, the read/write processing circuit 121 may include a unicast read processing circuit, a unicast write processing circuit, and a broadcast processing circuit, wherein the unicast read processing circuit processes a unicast read signal and , The unicast recording processing circuit processes a unicast recording signal, and the broadcast processing circuit processes a broadcast signal and/or a multicast signal.

동일하게 기계 학습 유닛이 수행하는 유니캐스트 기록 동작과 브로드캐스트 동작에서 상기 기계 학습 유닛 중 하나의 송신 인터페이스를 공유하는 것을 예로 들어 설명한다. 여기서, 유니캐스트 판독 처리회로는, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 신호 송신 인터페이스 및 제2 전송 인터페이스를 통해 송신한 유니캐스트 판독 신호를 수신하여 상기 유니캐스트 판독 신호를 상기 중재 회로에 전송하고, 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를, 상기 제2 전송 인터페이스 및 상기 적어도 하나 이상의 수신 인터페이스를 통해 상기 유니캐스트 판독 신호에 대응하는 기계 학습 유닛으로 송신하는데 사용될 수 있다. 유니캐스트 기록 처리회로는, 적어도 하나 이상의 기계 학습 유닛이 공유 신호 송신 인터페이스 및 제2 전송 인터페이스를 통해 송신한 유니캐스트 기록 신호를 수신하고 상기 유니캐스트 기록 신호를 상기 중재 회로로 전송하며, 상기 유니캐스트 기록 신호에 대응하는 유니캐스트 기록 데이터를 상기 공유 메모리에 기록하는데 사용될 수 있다. 그리고, 상기 브로드캐스트 기록 처리회로는 적어도 하나 이상의 기계 학습 유닛이 상기 공유 신호 송신 인터페이스 및 상기 제2 전송 인터페이스를 통해 송신한 브로드캐스트 및/또는 멀티캐스트 신호를 수신하고 상기 브로드캐스트 및/또는 멀티캐스트 신호를 상기 중재 회로로 전송하고 상기 중재 회로가 상기 공유 메모리로부터 획득한 데이터를 상기 제2 전송 인터페이스 및 상기 적어도 하나 이상의 수신 인트페이스를 통해 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 복수의 기계 학습 유닛으로 송신하는데 사용될 수 있다. 유의할 것은, 일반적으로 유니캐스트 기록 신호는 유니캐스트 기록 데이터를 포함할 수 있으며, 유니캐스트 기록 신호를 송신한 후 동일한 데이터 경로를 사용하여 유니캐스트 기록 데이터를 전송할 수도 있다.In the same way, a transmission interface of one of the machine learning units is shared in a unicast recording operation and a broadcast operation performed by the machine learning unit as an example. Here, the unicast read processing circuit receives a unicast read signal sent by at least one machine learning unit through a unicast read signal transmission interface and a second transmission interface, and transmits the unicast read signal to the arbitration circuit; , the arbitration circuit may be used to transmit data obtained from the shared memory to a machine learning unit corresponding to the unicast read signal through the second transmission interface and the at least one reception interface. A unicast recording processing circuit receives a unicast recording signal transmitted by at least one machine learning unit through a shared signal transmission interface and a second transmission interface, transmits the unicast recording signal to the arbitration circuit, and It can be used to record unicast record data corresponding to the record signal to the shared memory. The broadcast recording processing circuit receives broadcast and/or multicast signals transmitted by at least one machine learning unit through the shared signal transmission interface and the second transmission interface, and the broadcast and/or multicast signals are transmitted. A plurality of machines corresponding to the broadcast and/or multicast signal, transmitting signals to the arbitration circuit and transmitting data obtained by the arbitration circuit from the shared memory through the second transmission interface and the at least one reception interface. It can be used to transmit to a learning unit. It should be noted that, in general, a unicast recording signal may include unicast recording data, and unicast recording data may be transmitted using the same data path after transmitting the unicast recording signal.

여기서, 사전에 설정된 중재 규칙은 중재 회로가 일정한 규칙에 따라 복수의 데이터 동작 신호의 우선수위를 확정하도록 하여 중재 회로가 각각의 데이터 동작 신호의 우선순위에 따라 동작이 필요한 대상을 확정하도록 한다. 즉, 우선순위가 높은 데이터 동작 신호를 선택하여 중재 성공한 데이터 동작 신호로 삼는다. 예를 들어, 전송 속도가 빠른 데이터 동작 신호의 우선순위를 높은 우선순위로 설정하고, 전송 속도가 늦은 데이터 동작 신호의 우선순위를 낮은 우선순위로 설정할 수 있다. 예시적으로, 상기 사전에 설정된 중재 규칙은 라운드 로빈 스케줄링(Round-Robin Scheduling) 중재 규칙, Maximum Carrier-to-Interference Rate 스케줄링 규칙, 비율 공정(Proportional Fair) 규칙 등일 수 있다. 한편, 중재 회로는 기계 학습 유닛과 판독/기록 처리회로 사이의 데이터 경로(인터페이스에서 인터페이스까지)가 아이들(idle) 여부를 보조 중재 규칙으로 사용할 수도 있다. 즉, 중재 성공한 데이터 동작 신호에 대응하는 데이터 경로는 아이들(idle) 상태이다.Here, the arbitration rule set in advance allows the arbitration circuit to determine the priority level of a plurality of data operation signals according to a predetermined rule, so that the arbitration circuit determines an object requiring an operation according to the priority level of each data operation signal. That is, a data operation signal having a high priority is selected and used as a successful data operation signal. For example, the priority of a data operation signal having a high transmission rate may be set to a high priority, and the priority of a data operation signal having a low transmission rate may be set to a low priority. Illustratively, the preset arbitration rule may be a round-robin scheduling arbitration rule, a maximum carrier-to-interference rate scheduling rule, a proportional fair rule, and the like. Meanwhile, the arbitration circuit may use whether or not a data path (interface to interface) between the machine learning unit and the read/write processing circuit is idle as an auxiliary arbitration rule. That is, a data path corresponding to a data operation signal with successful mediation is in an idle state.

기계 학습 유닛에 연결되어 복수의 기계 학습 유닛의 유니캐스트 판독 동작을 처리하고 복수의 유니캐스트 판독 명령을 유니캐스트 판독 처리회로 내의 유니캐스트 판독 명령 캐시 큐 내에 캐싱할 수 있으며 유니캐스트 판독 명령을 분석하여 상응한 유니캐스트 판독 명령을 얻어 유니캐스트 판독 처리회로 내의 유니캐스트 판독 요청 캐시 큐 내에 저장하여 중재 회로를 통해 중재할 수 있다. 반면에 유니캐스트 판독 요청은 분석 동작을 수행하지 않고 유니캐스트 판독 요청 캐시 큐 내에 캐싱될 수 있다. 이와 유사하게, 브로드캐스트 처리회로는 또한 제2 전송 인터페이스를 통해 복수의 기계 학습 유닛에 연결될 수 있으며, 브로드캐스트 및/또는 멀티캐스트 명령 캐시 큐 및 브로드캐스트 및/또는 멀티캐스트 요청 캐시 큐를 포함할 수 있다. 마찬가지로 유니캐스트 기록 처리회로는 또한 제2 전송 인터페이스를 통해 복수의 기계 학습 유닛에 연결될 수 있으며, 유니캐스트 기록 명령 캐시 큐 및 유니캐스트 기록 요청 캐시 큐를 포함할 수 있는 바, 여기서 중복 설명하지 않는다. 하나의 선택 가능한 방안에서, 판독/기록 처리회로는 하나의 유니캐스트 판독 처리회로, 하나의 유니캐스트 기록 처리회로 및 하나의 브로드캐스트 처리회로를 포함할 수 있다.It is connected to the machine learning unit to process the unicast read operations of the plurality of machine learning units, cache the plurality of unicast read commands in the unicast read command cache queue in the unicast read processing circuit, and analyze the unicast read commands. The corresponding unicast read command is obtained and stored in the unicast read request cache queue in the unicast read processing circuit for arbitration through the arbitration circuit. On the other hand, unicast read requests may be cached in the unicast read request cache queue without performing a parsing operation. Similarly, the broadcast processing circuitry may also be coupled to the plurality of machine learning units via a second transport interface, and may include a broadcast and/or multicast command cache queue and a broadcast and/or multicast request cache queue. can Likewise, the unicast recording processing circuit may also be connected to a plurality of machine learning units via a second transmission interface, and may include a unicast write command cache queue and a unicast write request cache queue, which is not described herein again. In one possible option, the read/write processing circuit may include one unicast read processing circuit, one unicast write processing circuit, and one broadcast processing circuit.

따라서, 본 실시예는 유니캐스트 판독 처리회로를 통해 유니캐스트 판독 동작을 처리할 수 있고, 유니캐스트 기록 처리회로를 통해 유니캐스트 기록 동작을 처리할 수 있고, 브로드캐스트 처리회로를 통해 브로드캐스트 동작을 처리할 수 있는 바, 서로 다른 처리회로를 통해 서로 다른 유형의 데이터 동작을 각각 처리할 수 있어 처리 로직을 간소화시켰다.Therefore, in this embodiment, a unicast read operation can be processed through the unicast read processing circuit, a unicast write operation can be processed through the unicast write processing circuit, and a broadcast operation can be processed through the broadcast processing circuit. Since different types of data operations can be processed through different processing circuits, the processing logic is simplified.

대안적으로, 도 50를 참조하면, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 기록 동작, 브로드캐스트 동작을 수행할 때 상기 기계 학습 유닛 내의 하나의 송신 인터페이스를 공유한다. 즉 상기 적어도 하나 이상의 송신 인터페이스(141)가 유니캐스트 기록 동작 및 브로드캐스트 동작이 공유하는 공유 신호 송신 인터페이스 및 유니캐스트 판독 신호 송신 인터페이스를 포함할 수 있다. 예시적으로, MLU0인 경우, 인터페이스a0는 유니캐스트 판독 신호 송신 인터페이스이고, 인터페이스b0는 공유 신호 송신 인터페이스이므 유니캐스트 기록 신호, 브로드캐스트 및/또는 멀티캐스트 신호를 송신할 수 있으며, 인터페이스c0는 유니캐스트 판독 데이터 수신 인터페이스이고, 인터페이스d0는 브로드캐스트 및/또는 멀티캐스트 데이터 수신 인터페이스이다. 설명의 편의를 위하여, 이하 실시예에서는 모두 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 기록 동작, 브로드캐스트 동작을 수행할 때 상기 기계 학습 유닛 내의 하나의 송신 인터페이스를 공유하는 경우를 예로 들어 설명한다. 물론, 이하 실시예는 다른 신호 송신 인터페이스를 공유하는 방안에 적용할 수도 있다.Alternatively, referring to FIG. 50 , when at least one or more machine learning units perform a unicast write operation or a broadcast operation, one transmission interface within the machine learning unit is shared. That is, the at least one transmission interface 141 may include a shared signal transmission interface shared by a unicast write operation and a broadcast operation, and a unicast read signal transmission interface. Exemplarily, in the case of MLU0, interface a0 is a unicast read signal transmission interface, interface b0 is a shared signal transmission interface, and can transmit unicast write signals, broadcast and/or multicast signals, and interface c0 is a unicast signal transmission interface. is a cast read data receiving interface, and interface d0 is a broadcast and/or multicast data receiving interface. For convenience of description, in the following embodiments, a case in which at least one or more machine learning units share one transmission interface when performing a unicast recording operation or a broadcast operation will be described as an example. Of course, the following embodiments may also be applied to schemes for sharing other signal transmission interfaces.

하나의 선택 가능한 방안에서, 도 50를 참조하면, 상기 판독/기록 처리회로는 복수의 처리회로 그룹으로 나누어질 수 있고, 하나의 기계 학습 유닛은 하나의 처리회로 그룹에 대응하며, 상기 처리회로 그룹은 하나의 유니캐스트 판독 처리회로, 하나의 유니캐스트 기록 처리회로 및 하나의 브로드캐스트 처리회로를 포함한다. 예시적으로, MLU0은 유니캐스트 판독 처리회로0, 유니캐스트 기록 처리회로0 및 브로드캐스트 처리회로0에 대응하고, MLUn은 유니캐스트 판독 처리회로n, 유니캐스트 기록 처리회로n 및 브로드캐스트 처리회로n에 대응한다. 마찬가지로, 제2 전송 인터페이스에 하나의 처리회로 그룹 및 하나의 기계 학습 유닛에 각각 연결된 한 그룹의 인터페이스가 존재하고, 기계 학습 유닛과 유니캐스트 판독 처리회로의 일대일 연결, 기계 학습 유닛과 유니캐스트 기록 처리회로의 일대일 연결, 및 기계 학습 유닛과 브로드캐스트 처리회로의 일대일 연결을 구현한다.In one possible option, referring to Fig. 50, the read/write processing circuit may be divided into a plurality of processing circuit groups, one machine learning unit corresponds to one processing circuit group, and the processing circuit group includes one unicast read processing circuit, one unicast write processing circuit, and one broadcast processing circuit. Exemplarily, MLU0 corresponds to unicast read processing circuit 0, unicast write processing circuit 0, and broadcast processing circuit 0, and MLUn corresponds to unicast read processing circuit n, unicast write processing circuit n, and broadcast processing circuit n. respond to Similarly, in the second transmission interface, there is a group of interfaces respectively connected to one processing circuit group and one machine learning unit, one-to-one connection between the machine learning unit and the unicast read processing circuit, and the machine learning unit and the unicast write processing. It implements the one-to-one connection of the circuit, and the one-to-one connection between the machine learning unit and the broadcast processing circuit.

구체적으로, 도 50를 참조하면, 상기 제2 전송 인터페이스(120)는 복수의 인터페이스 그룹을 포함하고, 상기 하나의 처리회로 그룹은 하나의 인터페이스 그룹에 대응하고, 상기 하나의 인터페이스 그룹은, 상기 유니캐스트 판독 처리회로에 연결된 유니캐스트 판독 신호 수신 인터페이스 및 유니캐스트 판독 데이터 송신 인터페이스, 상기 유니캐스트 기록 처리회로에 연결된 유니캐스트 판독 신호 수신 인터페이스, 상기 브로드캐스트 처리회로에 연결된 브로드캐스트 신호 수신 인터페이스 및 브로드캐스트 데이터 송신 인터페이스를 포함한다.Specifically, referring to FIG. 50 , the second transmission interface 120 includes a plurality of interface groups, the one processing circuit group corresponds to one interface group, and the one interface group corresponds to the unit A unicast read signal reception interface and a unicast read data transmission interface connected to the cast read processing circuit, a unicast read signal reception interface connected to the unicast write processing circuit, a broadcast signal reception interface connected to the broadcast processing circuit, and a broadcast It includes a data transmission interface.

예를 들어, MLU0에 있어서, 그 상응한 처리회로 그룹에 대응하는 인터페이스 그룹은 인터페이스e0, 인터페이스f0, 인터페이스g0, 인터페이스h0, 인터페이스i0를 포함한다. 예를 들어, MLU0 및 유니캐스트 판독 처리회로0에 있어서, 제2 전송 인터페이스 내의 인터페이스e0는 유니캐스트 판독 신호 수신 인터페이스로서, MLU0의 유니캐스트 판독 신호 송신 인터페이스a0 및 유니캐스트 판독 처리회로0에 각각 연결되어 MLU0에서 송신된 유니캐스트 판독 신호를 수신한 후 유니캐스트 판독 처리회로0에 전송하여 처리하는데 사용될 수 있다. 제2 전송 인터페이스 내의 인터페이스f0는 유니캐스트 판독 데이터 송신 인터페이스로서, MLU0의 유니캐스트 판독 데이터 수신 인터페이스c0 및 유니캐스트 판독 처리회로0에 각각 연결되어 유니캐스트 판독 처리회로0에서 송신된 상기 유니캐스트 판독 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신한 후 MLU0 내의 인터페이스c0에 전송하는데 사용될 수 있다. MLU0 및 유니캐스트 기록 처리회로0에 있어서, 제2 전송 인터페이스 내의 인터페이스g0는 유니캐스트 기록 신호 인터페이스로서, MLU0의 공유 신호 송신 인터페이스b0 및 유니캐스트 기록 처리회로0에 각각 연결되어 MLU0에서 송신된 유니캐스트 기록 신호를 수신하여 유니캐스트 기록 처리회로0로 전송하여 처리하는데 사용될 수 있다. MLU0 및 브로드캐스트 처리회로0에 있어서, 제2 전송 인터페이스 내의 인터페이스h0는 브로드캐스트 신호 수신 인터페이스로서, MLU0의 공유 신호 송신 인터페이스b0 및 브로드캐스트 처리회로0에 각각 연결되어 MLU0에서 송신된 브로드캐스트 및/또는 멀티캐스트 신호를 수신한 후 브로드캐스트 처리회로0로 전송하여 처리하는데 사용될 수 있다. 제2 전송 인터페이스 내의 인터페이스i0은 브로드캐스트 데이터 송신 인터페이스로서, 복수의 MLU의 브로드캐스트 데이터 수신 인터페이스di 및 브로드캐스트 처리회로0에 연결되어 브로드캐스트 처리회로0에서 송신된 상기 브로드캐스트 및/또는 멀티캐스트 신호에 대응하는 입력 뉴런 데이터 및 가중치를 수신한 후 복수의 MLU 내의 브로드캐스트 데이터 수신 인터페이스di로 전송하는데 사용될 수 있다.For example, in MLU0, the interface group corresponding to the corresponding processing circuit group includes interface e0, interface f0, interface g0, interface h0, and interface i0. For example, in MLU0 and unicast read processing circuit 0, interface e0 in the second transmission interface is a unicast read signal receiving interface, and is respectively connected to unicast read signal transmission interface a0 and unicast read processing circuit 0 of MLU0. After receiving the unicast read signal transmitted from MLU0, it can be transmitted to the unicast read processing circuit 0 for processing. An interface f0 in the second transmission interface is a unicast read data transmission interface, which is connected to the unicast read data reception interface c0 of MLU0 and the unicast read processing circuit 0, respectively, to transmit the unicast read signal from the unicast read processing circuit 0. After receiving input neuron data and weights corresponding to MLU0, it can be used to transmit to interface c0 in MLU0. In the MLU0 and the unicast recording processing circuit 0, interface g0 in the second transmission interface is a unicast recording signal interface, which is connected to the shared signal transmission interface b0 and the unicast recording processing circuit 0 of the MLU0, respectively, and transmits the unicast data from the MLU0. It can be used to receive a recording signal and transmit it to the unicast recording processing circuit 0 for processing. In MLU0 and broadcast processing circuit 0, interface h0 in the second transmission interface is a broadcast signal reception interface, which is connected to the shared signal transmission interface b0 and broadcast processing circuit 0 of MLU0, respectively, to transmit broadcast and/or broadcast signals transmitted from MLU0. Alternatively, after receiving the multicast signal, it may be used to transmit and process the multicast signal to the broadcast processing circuit 0. An interface i0 in the second transmission interface is a broadcast data transmission interface, which is connected to the broadcast data reception interface di of the plurality of MLUs and the broadcast processing circuit 0, so that the broadcast and/or multicast transmitted by the broadcast processing circuit 0 is connected. After receiving the input neuron data and weights corresponding to the signal, it may be used to transmit them to a broadcast data receiving interfacedi in a plurality of MLUs.

따라서, 본 실시예는 기계 학습 유닛과 유니캐스트 판독 처리회로의 일대일 연결; 기계 학습 유닛과 유니캐스트 기록 처리회로의 일대일 연결; 및 기계 학습 유닛과 브로드캐스트 처리회로의 일대일 연결을 통해 맞춤성 일대일의 데이터 동작 처리를 구현할 수 있어 데이터 동작의 액세스 로직의 복잡성을 낮추고 충돌을 감소시켜 처리 효율을 향상시킬 수 있다.Therefore, the present embodiment provides a one-to-one connection between the machine learning unit and the unicast reading processing circuit; one-to-one connection between the machine learning unit and the unicast recording processing circuit; and customized one-to-one data operation processing can be implemented through one-to-one connection between the machine learning unit and the broadcast processing circuit, thereby reducing the complexity of data operation access logic and reducing collisions, thereby improving processing efficiency.

하나의 선택 가능한 방안에서, 도 51를 참조하면, 상기 도 50을 바탕으로 전송 회로 내의 인터페이스의 개수를 줄인다. 상기 하나의 처리회로 그룹 내의 유니캐스트 기록 처리회로와 브로드캐스트 처리회로는 상기 대응하는 인터페이스 그룹 내의 하나의 공유 신호 수신 인터페이스를 공유하고, 상기 처리회로 그룹에 대응하는 공유 신호 수신 인터페이스는 상기 처리회로 그룹에 대응하는 기계 학습 유닛의 공유 신호 송신 인터페이스에 연결되고, 상기 처리회로 그룹 내의 유니캐스트 판독 신호 수신 인터페이스는 상기 처리회로 그룹에 대응하는 기계 학습 유닛의 유니캐스트 판독 신호 송신 인터페이스에 연결된다. 도 4를 참조하면, MLU0의 상응한 처리회로 그룹에 있어서, 그 유니캐스트 기록 처리회로와 브로드캐스트 처리회로는 하나의 공유 신호 수신 인터페이스g0를 공유하고, 공유 신호 수신 인터페이스g0는 MLU0 내의 공유 신호 송신 인터페이스b0에 연결되어 공유 신호 송신 인터페이스b0에서 송신된 유니캐스트 기록 신호, 브로드캐스트 및/또는 멀티캐스트 신호를 수신하고, 유니캐스트 기록 신호, 브로드캐스트 및/또는 멀티캐스트 신호를 유니캐스트 기록 처리회로0 및 브로드캐스트 처리회로0로 송신하여 처리한다. 이로부터 알 수 있듯이, 전송 회로에서, 유니캐스트 기록 처리회로i와 브로드캐스트 처리회로i는 공유 신호 수신 인터페이스gi를 공유하며, MLUi 내의 공유 신호 송신 인터페이스bi와 전송 회로 내의 공유 신호 수신 인터페이스ei로 이루어진 데이터 경로는 MLUi와 전송 회로 사이에서 유니캐스트 기록 신호, 브로드캐스트 및/또는 멀티캐스트 신호를 전송할 수 있다.As one selectable method, referring to FIG. 51, the number of interfaces in the transmission circuit is reduced based on FIG. 50. The unicast recording processing circuit and the broadcast processing circuit in the one processing circuit group share one shared signal reception interface in the corresponding interface group, and the shared signal reception interface corresponding to the processing circuit group is the processing circuit group A unicast read signal reception interface in the processing circuit group is connected to a unicast read signal transmission interface of the machine learning unit corresponding to the processing circuit group. Referring to Fig. 4, in the corresponding processing circuit group of MLU0, the unicast recording processing circuit and the broadcast processing circuit share a shared signal receiving interface g0, and the shared signal receiving interface g0 transmits a shared signal in MLU0. Connected to interface b0 to receive a unicast recording signal, broadcast and/or multicast signal transmitted from the shared signal transmission interface b0, and to transmit the unicast recording signal, broadcast and/or multicast signal to the unicast recording processing circuit 0 and transmitted to the broadcast processing circuit 0 for processing. As can be seen from this, in the transmitting circuit, the unicast recording processing circuit i and the broadcast processing circuit i share a shared signal receiving interface gi, and are composed of a shared signal transmitting interface bi in the MLUi and a shared signal receiving interface ei in the transmitting circuit. The data path may carry unicast write signals, broadcast and/or multicast signals between the MLUi and the transmit circuitry.

이로부터 알 수 있듯이, 하나의 처리회로 그룹 내의 유니캐스트 기록 처리회로와 브로드캐스트 처리회로는 신호 수신 인터페이스를 공유하므로 본 실시예에 따른 데이터 처리 장치는 기계 학습 유닛 중 데이터 동작 신호 송신 인터페이스 및 전송 회로 내의 인터페이스의 개수를 줄여 하드웨어 자원을 추가 절약하므로 하드웨어의 면적 및 전력 소모를 감소시켰다.As can be seen from this, since the unicast recording processing circuit and the broadcast processing circuit in one processing circuit group share a signal reception interface, the data processing device according to the present embodiment has a data operation signal transmission interface and a transmission circuit in the machine learning unit. Hardware resources are additionally saved by reducing the number of interfaces within the interface, thereby reducing hardware area and power consumption.

일 실시 형태에서, 상기 처리회로 그룹에 대응하는 공유 신호 수신 인터페이스는 상기 처리회로 그룹 내의 유니캐스트 기록 처리회로와 브로드캐스트 처리회로에 각각 연결되어 상기 기계 학습 유닛의 공유 신호 송신 인터페이스에서 송신된 데이터 동작 신호를 수신하고, 상기 데이터 동작 신호를 두 갈래 동일한 데이터 동작 신호로 나누어 상기 유니캐스트 기록 처리회로 및 상기 브로드캐스트 처리회로로 각각 보낸다. 도 51를 참조하면, 공유 신호 수신 인터페이스g0를 예로 들면, 이는 수신된 데이터 동작 신호(유니캐스트 판독 신호, 브로드캐스트 및/또는 멀티캐스트 신호)를 두 갈래 동일한 데이터 동작 신호로 나누어 유니캐스트 기록 처리회로0 및 브로드캐스트 처리회로0로 송신하여 처리한다. 예시적으로, 상기 공유 신호 수신 인터페이스는 하드웨어 회로를 통해 유니캐스트 기록 처리회로0 및 브로드캐스트 처리회로0에 각각 연결되어 한 갈래의 데이터 동작 신호를 두 갈래 동일한 데이터 동작 신호로 나눌 수 있다. 상기 데이터 동작 신호가 하이 레벨 및 로우 레벨 신호(high or low level signal)일 수 있다.In an embodiment, a shared signal reception interface corresponding to the processing circuit group is connected to a unicast recording processing circuit and a broadcast processing circuit in the processing circuit group, respectively, to operate data transmitted from the shared signal transmission interface of the machine learning unit. signal is received, and the data operation signal is divided into two identical data operation signals and sent to the unicast recording processing circuit and the broadcast processing circuit, respectively. Referring to Fig. 51, take a shared signal reception interface g0 as an example, which divides the received data operation signal (unicast read signal, broadcast and/or multicast signal) into two identical data operation signals, and a unicast write processing circuit. 0 and broadcast processing circuit 0 for processing. Exemplarily, the shared signal receiving interface is connected to the unicast recording processing circuit 0 and the broadcast processing circuit 0 through hardware circuits, so that one data operation signal can be divided into two identical data operation signals. The data operation signal may be a high or low level signal.

각각의 처리회로는 데이터 동작 신호를 분석하여 데이터 동작 신호의 유형을 판단할 수 있다. 예를 들어 유니캐스트 기록 신호인 경우, 유니캐스트 기록 처리회로가 처리를 수행하고, 브로드캐스트 처리회로는 처리를 수행하지 않고, 예를 들어 브로드캐스트 및/또는 멀티캐스트 신호인 경우, 브로드캐스트 처리회로가 처리를 수행하고, 유니캐스트 기록 처리회로는 처리를 수행하지 않는다. 구체적으로, 각 처리회로는 데이터 동작 신호의 동작 코드를 통해 동작 신호의 유형을 판단하는 데, 예를 들어 "write"는 데이터 동작 신호가 유니캐스트 기록 신호임을 의미하고, "cast"는 데이터 동작 신호가 브로드캐스트 및/또는 멀티캐스트 신호임을 의미하고 마커 필드(marker field)에 표기된 기계 학습 유닛(데이터 리턴 대상)의 개수를 통해 동작 신호 유형을 판단할 수도 있다. 예를 들어, 0개의 리턴 대상은 데이터 동작 신호가 유니캐스트 기록 신호임을 의미하고, 1개의 리턴 대상은 데이터 동작 신호가 유니캐스트 판독 신호임을 의미하고, 복수의 (n+1보다 작음) 리턴 대상은 데이터 동작 신호가 멀티캐스트 신호임을 나타내며, n+1개의 리턴 대상은 데이터 동작 신호가 브로드캐스트 신호임을 나타낸다.Each processing circuit may analyze the data operation signal to determine the type of data operation signal. For example, in the case of a unicast recording signal, the unicast recording processing circuit performs processing, and the broadcast processing circuit does not perform processing. For example, in the case of a broadcast and/or multicast signal, the broadcast processing circuit performs processing, and the unicast recording processing circuit does not perform processing. Specifically, each processing circuit determines the type of operation signal through the operation code of the data operation signal. For example, "write" means that the data operation signal is a unicast write signal, and "cast" means that the data operation signal means a broadcast and/or multicast signal, and the operation signal type may be determined through the number of machine learning units (data return targets) marked in a marker field. For example, 0 return destination means the data operation signal is a unicast write signal, 1 return destination means the data operation signal is a unicast read signal, and multiple (less than n+1) return destinations It indicates that the data operation signal is a multicast signal, and n+1 return objects indicate that the data operation signal is a broadcast signal.

하나의 선택 가능한 방안에서, 도 52를 참조하면, 상기 도 51의 기초상에 기계 학습 유닛 내의 인터페이스의 개수를 추가 줄였다. 기계 학습 유닛이 유니캐스트 판독 동작 및 브로드캐스트 동작을 수행할 때 상기 기계 학습 유닛 상의 하나의 데이터 수신 인터페이스를 공유한다. 즉, 상기 처리회로 그룹 내의 유니캐스트 판독 처리회로와 브로드캐스트 처리회로가 리턴한 데이터는 상기 기계 학습 유닛 내의 하나의 공유 데이터 수신 인터페이스를 공유한다. 도 51과 비교할 때, MLU0을 예로 들면, 적어도 하나 이상의 수신 인터페이스(142)는 하나의 인터페이스c0를 포함하며 이전의 인터페이스c0와 인터페이스d0를 포함하지 않는다. 도 52의 인터페이스c0는 처리회로의 인터페이스f0에 연결되어 유니캐스트 판독 처리회로0가 리턴한 유니캐스트 판독 데이터를 수신할 수 있으며, 처리회로의 복수의 인터페이스ii에 연결되어 복수의 브로드캐스트 처리회로i가 리턴한 브로드캐스트 및/또는 멀티캐스트 데이터를 수신할 수도 있다.As one possible option, referring to FIG. 52 , the number of interfaces in the machine learning unit is further reduced on the basis of FIG. 51 above. When the machine learning unit performs unicast read operation and broadcast operation, one data receiving interface on the machine learning unit is shared. That is, the data returned by the unicast read processing circuit and the broadcast processing circuit in the processing circuit group share one shared data receiving interface in the machine learning unit. Compared with FIG. 51 , taking MLU0 as an example, at least one receiving interface 142 includes one interface c0 and does not include the previous interface c0 and interface d0. Interface c0 of FIG. 52 is connected to the interface f0 of the processing circuit to receive unicast read data returned by the unicast read processing circuit 0, and is connected to a plurality of interfaces ii of the processing circuit to receive a plurality of broadcast processing circuits i. may receive broadcast and/or multicast data returned by

따라서, 본 실시예에 따른 데이터 처리 장치에서, 적어도 하나 이상의 기계 학습 유닛이 유니캐스트 판독 동작 및 브로드캐스트 동작을 수행할 때 상기 기계 학습 유닛 내의 하나의 데이터 수신 인터페이스를 공유하므로, 기계 학습 유닛 중 리턴하는 데이터의 인터페이스의 개수를 추가 줄이므로 하드웨어 자원을 절약하여 하드웨어의 면적과 전력 소모를 감소시켰다.Therefore, in the data processing apparatus according to the present embodiment, when at least one or more machine learning units perform a unicast read operation and a broadcast operation, one data reception interface in the machine learning unit is shared, so that one of the machine learning units returns By further reducing the number of data interfaces, hardware resources are saved, reducing hardware area and power consumption.

더 나아가, 도 53를 참조하면, 상기 도 52의 기초상에 전송 회로의 인터페이스의 개수를 추가 줄였다. 상기 하나의 처리회로 그룹 내의 유니캐스트 판독 처리회로와 브로드캐스트 처리회로는 상기 대응하는 인터페이스 그룹 내의 하나의 공유 데이터 송신 인터페이스를 공유하고, 상기 처리회로 그룹에 대응하는 공유 데이터 송신 인터페이스는 상기 처리회로 그룹에 대응하는 기계 학습 유닛의 공유 데이터 수신 인터페이스에 연결된다. 도 52와 비교할 때, MLU0에 상응하는 처리회로 그룹에 있어서, 그 유니캐스트 판독 처리회로와 브로드캐스트 처리회로는 하나의 공유 데이터 송신 인터페이스i0를 공유하고, 도 53 내의 인터페이스i0는 처리회로 중 유니캐스트 판독 처리회로0에 연결되어 유니캐스트 판독 처리회로0가 리턴한 유니캐스트 판독 데이터를 수신할 수 있고, 처리회로 내의 복수의 브로드캐스트 처리회로i에 연결되어 복수의 브로드캐스트 처리회로i가 리턴한 브로드캐스트 및/또는 멀티캐스트 데이터를 수신할 수도 있다.Furthermore, referring to FIG. 53, the number of interfaces of the transmission circuit is further reduced on the basis of FIG. 52. The unicast read processing circuit and the broadcast processing circuit in the processing circuit group share a shared data transmission interface in the corresponding interface group, and the shared data transmission interface corresponding to the processing circuit group is the processing circuit group. It is connected to the shared data receiving interface of the machine learning unit corresponding to . Compared with FIG. 52, in the processing circuit group corresponding to MLU0, the unicast read processing circuit and the broadcast processing circuit share a shared data transmission interface i0, and the interface i0 in FIG. 53 is a unicast processing circuit. It is connected to the read processing circuit 0 to receive unicast read data returned by the unicast read processing circuit 0, and is connected to a plurality of broadcast processing circuits i in the processing circuit to receive the broadcast data returned by the plurality of broadcast processing circuits i. It may also receive cast and/or multicast data.

따라서, 본 실시예에 따른 데이터 처리 장치에서, 하나의 처리회로 그룹 내의 유니캐스트 판독 처리회로와 브로드캐스트 처리회로는 상기 대응하는 인터페이스 그룹 내의 하나의 공유 데이터 송신 인터페이스를 공유하므로 기계 학습 유닛 중 리턴하는 데이터 인터페이스의 개수를 추가 줄일 뿐만 아니라 하드웨어 자원을 절약하여 하드웨어의 면적과 전력 소모를 감소시켰다.Therefore, in the data processing apparatus according to the present embodiment, the unicast read processing circuit and the broadcast processing circuit in one processing circuit group share one shared data transmission interface in the corresponding interface group, so that one of the machine learning units returns Not only does it further reduce the number of data interfaces, but it also reduces hardware area and power consumption by saving hardware resources.

하나의 선택 가능한 방안에서, 도 54를 참조하면, 상기 도 53의 기초상에, 기계 학습 유닛에 적어도 하나 이상의 비공유 데이터 리턴 인터페이스의 연산 유닛이 존재하므로 상기 기계 학습 유닛의 적어도 하나 이상의 송신 인터페이스는, 적어도 하나 이상의 독립적인 데이터 수신 인터페이스를 더 포함할 수 있다. 상기 독립적인 데이터 수신 인터페이스는 상기 기계 학습 유닛 내의 하나의 연산 유닛에 연결되며, 상기 제2 전송 인터페이스는 상기 독립적인 데이터 수신 인터페이스에 연결된 독립적인 데이터 송신 인터페이스를 더 포함하고, 상기 연산 유닛은 상기 독립적인 데이터 수신 인터페이스와 상기 독립적인 데이터 송신 인터페이스를 통해 상기 기계 학습 유닛에 대응하는 처리회로 그룹과의 연결을 구현한다. 예시적으로, 도 54를 참조하면, MLU0인 경우, 복수의 연산 유닛을 포함하고, 그 중의 적어도 하나 이상의 연산 유닛은 인터페이스j0에 연결되고, 다른 연산 유닛은 인터페이스c0에 각각 연결된다. 즉, 인터페이스c0가 상기 다른 연산 유닛이 공유한 공유 데이터 수신 인터페이스이며, 인터페이스j0는 독립적인 데이터 수신 인터페이스이다. 따라서, 제2 전송 인터페이스(120)는 인터페이스j0에 연결된 독립적인 데이터 송신 인터페이스h0를 더 포함한다. 도 54에서, 독립적인 데이터 송신 인터페이스h0는 유니캐스트 판독 처리회로0와 복수의 브로드캐스트 처리회로i에 연결되어 유니캐스트 판독 데이터, 브로드캐스트 및/또는 멀티캐스트 데이터를 수신할 수 있으며 독립적인 데이터 수신 인터페이스j0를 통해 상기 비공유 데이터 리턴 인터페이스의 연산 유닛으로 송신한다.In one possible option, referring to FIG. 54, on the basis of FIG. 53, since there are at least one calculation unit of one or more non-shared data return interfaces in the machine learning unit, at least one transmission interface of the machine learning unit, At least one or more independent data reception interfaces may be further included. The independent data receiving interface is connected to one computing unit in the machine learning unit, the second transmission interface further includes an independent data transmitting interface connected to the independent data receiving interface, and the computing unit is connected to the independent data receiving interface. A connection with a processing circuit group corresponding to the machine learning unit is implemented through the in-data reception interface and the independent data transmission interface. Illustratively, referring to FIG. 54 , in the case of MLU0, it includes a plurality of arithmetic units, at least one arithmetic unit among which is connected to interface j0, and other arithmetic units are respectively connected to interface c0. That is, interface c0 is a shared data reception interface shared by the other computing units, and interface j0 is an independent data reception interface. Accordingly, the second transmission interface 120 further includes an independent data transmission interface h0 connected to interface j0. In Fig. 54, an independent data transmission interface h0 is connected to a unicast read processing circuit 0 and a plurality of broadcast processing circuits i to receive unicast read data, broadcast and/or multicast data, and can receive independent data. to the computing unit of the non-shared data return interface via interface j0.

하나의 선택 가능한 방안에서, 도 55를 참조하면, 상기 도 53의 기초상에, 각각의 기계 학습 유닛은 상기 처리회로 내의 하나의 브로드캐스트 처리회로를 공유할 수 있다. 상기 공유하는 브로드캐스트 처리회로는 각각의 공유 신호 수신 인터페이스gi 및 각 공유 데이터 송신 인터페이스ii에 연결될 수 있다. 따라서, 본 실시예에 따른 데이터 처리 장치에서, 각각의 기계 학습 유닛은 상기 처리회로 내의 하나의 브로드캐스트 처리회로를 공유할 수 있으므로 브로드캐스트 처리회로의 개수를 줄이고 전송 회로를 간소화시키고, 하드웨어의 면적 및 전력 소모를 감소시켰다.In one possible option, referring to Fig. 55, on the basis of Fig. 53, each machine learning unit may share one broadcast processing circuit in the processing circuit. The shared broadcast processing circuit may be connected to each shared signal reception interface gi and each shared data transmission interface ii. Therefore, in the data processing device according to the present embodiment, each machine learning unit can share one broadcast processing circuit in the processing circuit, thereby reducing the number of broadcast processing circuits, simplifying the transmission circuit, and reducing the hardware area. and reduced power consumption.

그러나 현재 기계 학습 알고리즘이 끊임없이 발전함에 따라 점점 많은 구성의 기계 학습 칩이 생기고 있으나 이러한 기계 학습 칩은 공유 저장 중인 데이터를 액세스하거나 처리할 때 동일한 문제가 존재한다, 즉 데이터 액세스 로직이 극히 복잡하여 기계 학습시 데이터 처리 효율을 저하시킨다.However, with the continuous development of current machine learning algorithms, there are more and more configurations of machine learning chips, but these machine learning chips have the same problem when accessing or processing data in shared storage, that is, the data access logic is extremely complex, so machine learning chips Reduces data processing efficiency during learning.

따라서, 기계 학습 칩의 데이터 액세스 로직을 간소화하는 것이 현재 기술자가 시급히 해결해야 할 기술적 과제이다.Therefore, simplifying the data access logic of machine learning chips is an urgent technical challenge for current engineers.

상기 문제를 해결하기 위하여 본 출원은 이하 기술적 방안을 제공한다.In order to solve the above problem, the present application provides the following technical solution.

먼저, 본 출원에서 사용된 데이터 처리 장치를 소개한다. 도 56을 참조하면 하드웨어 또는 소프트웨어와의 결합방식을 통해 구현 가능한 데이터 처리 장치를 제공한다. 상기 데이터 처리 장치는 기계 학습 데이터를 처리하는데 사용된다. 도 56에 도시된 바와 같이, 상기 데이터 처리 장치는 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)를 포함하고, 기계 학습 장치(11)는 제1 전송 인터페이스(14)를 통해 전송 회로(12)에 연결되고, 전송 회로(12)는 공유 메모리(13)에 연결된다.First, the data processing apparatus used in this application is introduced. Referring to FIG. 56, a data processing device that can be implemented through a combination with hardware or software is provided. The data processing device is used to process machine learning data. As shown in FIG. 56, the data processing device includes a machine learning device 11, a transmission circuit 12 and a shared memory 13, and the machine learning device 11 includes a first transmission interface 14. is connected to the transfer circuit 12 through, and the transfer circuit 12 is connected to the shared memory 13.

상기 전송 회로(12)는 기계 학습 장치로부터 보내진 데이터 동작 신호에 따라 공유 메모리(13)로부터 기계 학습 장치(11)에 필요한 입력 데이터를 획득한 후 입력 데이터를 기계 학습 장치(11)로 리턴한다. 여기서 데이터 동작 신호는 공유 메모리(13) 내의 데이터에 대한 동작 방식을 나타낸다.The transmission circuit 12 obtains input data necessary for the machine learning apparatus 11 from the shared memory 13 according to a data operation signal sent from the machine learning apparatus, and then returns the input data to the machine learning apparatus 11. Here, the data operation signal indicates an operation method for data in the shared memory 13 .

대안적으로, 기계 학습 장치(11)는 입력 데이터에 따라 기계 학습 연산을 수행하여 출력 데이터를 얻는다. 대안적으로, 기계 학습 장치(11)는 전송 회로(12)를 통해 출력 데이터를 공유 메모리(13)에 전송하여 저장한다. 구체적으로, 기계 학습 장치(11)가 일 신경망 연산을 수행하게 되면, 기계 학습 장치(11)는 입력 뉴런 데이터 및 가중치에 따라 인공 신경망 연산을 수행하여 출력 뉴런 데이터를 얻으며, 또한 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 사용하고 전송 회로(12)를 통해 공유 메모리(13)에 전송하여 저장한다. 유의할 것은, 상기 기계 학습 장치(11), 전송 회로(12), 공유 메모리(13) 및 제1 전송 인터페이스(14)는 모두 하드웨어 회로에 의한 방식으로 구현될 수 있다. 예시적으로, 전송 회로(12)는 브로드캐스트 버스(broadcast bus)일 수 있으며, 공유 메모리(13)는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 캐시 메모리 등을 포함하나 이에 한정되지 않는다. 제1 전송 인터페이스(14)는 하나 또는 복수의 데이터I/O(in/out, 리드인-리드아웃) 인터페이스 또는 I/O 핀에 대응할 수 있다.Alternatively, the machine learning device 11 performs machine learning operations according to input data to obtain output data. Alternatively, the machine learning device 11 transmits and stores the output data to the shared memory 13 through the transmission circuit 12. Specifically, when the machine learning device 11 performs a neural network operation, the machine learning device 11 performs an artificial neural network operation according to input neuron data and weights to obtain output neuron data, and output neuron data to a new It is used as input neuron data and transmitted to the shared memory 13 through the transmission circuit 12 and stored. It should be noted that the machine learning device 11, the transmission circuit 12, the shared memory 13, and the first transmission interface 14 may all be implemented in a hardware circuit manner. Illustratively, the transmission circuit 12 may be a broadcast bus, the shared memory 13 may be non-volatile and/or volatile memory, random access memory (RAM), cache memory, and the like. Including but not limited to The first transmission interface 14 may correspond to one or a plurality of data I/O (in/out, lead-in-lead-out) interfaces or I/O pins.

대안적으로, 기계 학습 장치(11)는 하나의 제1 전송 인터페이스(14)를 포함할 수 있으며, 복수의 제1 전송 인터페이스를 포함할 수도 있다. 제1 전송 인터페이스(14)는 송신 인터페이스일 수 있고, 수신 인터페이스일 수도 있다. 제1 전송 인터페이스(14)가 송신 인터페이스인 경우, 기계 학습 장치(11)는 송신 인터페이스에 연결된 전송 회로(12)로 데이터 동작 신호 또는 데이터를 송신할 수 있다. 제1 전송 인터페이스(14)가 수신 인터페이스인 경우, 기계 학습 장치(11)는 전송 회로(12)가 리턴한 데이터를 수신할 수 있다.Alternatively, the machine learning device 11 may include one first transmission interface 14 or may include a plurality of first transmission interfaces. The first transmission interface 14 may be a transmission interface or may be a reception interface. When the first transmission interface 14 is a transmission interface, the machine learning device 11 may transmit a data operation signal or data to the transmission circuit 12 connected to the transmission interface. When the first transmission interface 14 is a reception interface, the machine learning device 11 may receive data returned by the transmission circuit 12 .

여기서, 데이터 동작 신호는 공유 메모리(13) 내의 데이터에 대한 동작 방식을 나타낸다. 하나의 선택 가능한 방안에서, 구체적으로, 데이터 동작 신호는 공유 메모리(13) 내의 데이터에 대한 판독 동작을 나타낼 수 있으며, 공유 메모리(13) 내의 데이터에 대한 기록 동작을 나타낼 수도 있다. 따라서, 기계 학습 장치(11)에서 송신된 데이터 동작 신호가 판독 동작인 경우, 전송 회로(12)는 공유 메모리(13)로부터 상응한 주소에 대응되는 데이터를 찾아 판독한 후 해당 데이터를 적어도 하나 이상의 기계 학습 장치(11)로 리턴한다. 기계 학습 장치(11)에서 송신된 데이터 동작 신호가 기록 동작인 경우, 전송 회로(12)는 기계 학습 장치(11)의 출력 쓰기 데이터를 공유 메모리(13)에 기록한다.Here, the data operation signal indicates an operation method for data in the shared memory 13 . In one possible option, specifically, the data operation signal may indicate a read operation on data in the shared memory 13 and may indicate a write operation on data in the shared memory 13 . Therefore, when the data operation signal transmitted from the machine learning device 11 is a read operation, the transmission circuit 12 finds and reads data corresponding to a corresponding address from the shared memory 13, and then converts the corresponding data to at least one or more Return to machine learning device 11. When the data operation signal transmitted from the machine learning device 11 is a write operation, the transmission circuit 12 writes output write data of the machine learning device 11 into the shared memory 13 .

상기 입력 데이터는 기계 학습 연산 시 기계 학습 장치(11)에 입력해야 하는 데이터이다. 이상 데이터는 공유 메모리(13)에 미리 저장된 초기 데이터일 수 있으며, 기계 학습 장치(11)가 기계 학습 연산을 수행할 때 출력되는 중간 결과 또는 최종 결과일 수 있으며, 공유 메모리(13)에 기록하는 데이터일 수도 있다.The input data is data to be input to the machine learning device 11 during a machine learning operation. The abnormal data may be initial data stored in advance in the shared memory 13, may be an intermediate result or a final result output when the machine learning device 11 performs a machine learning operation, and may be recorded in the shared memory 13. may be data.

대안적으로, 상기 입력 데이터는 입력 뉴런 데이터 및/또는 가중치를 포함할 수 있으며, 상기 입력 뉴런 데이터 및 가중치는 인공 신경망 연산을 수행할 때 기계 학습 장치에 입력 필요한 데이터를 포함할 수 있다. 따라서, 상기 출력 데이터는 출력 뉴런 데이터를 포함할 수 있으며, 상기 출력 뉴런 데이터는 인공 신경망 연산을 수행할 때 기계 학습 장치의 중간 결과 또는 최종 결과이다.Alternatively, the input data may include input neuron data and/or weights, and the input neuron data and weights may include data required to be input to the machine learning device when performing an artificial neural network operation. Accordingly, the output data may include output neuron data, and the output neuron data is an intermediate result or a final result of the machine learning device when performing an artificial neural network operation.

유의할 것은, 본 출원에 사용되는 데이터 처리 장치는 아래 구조 형식 중의 적어도 1종 이상일 수 있다. 즉, 기계 학습 장치(11)는 복수의 제1 전송 인터페이스(14)를 통해 하나의 전송 회로(12)에 연결된 후 상기 하나의 전송 회로(12)를 통해 하나의 공유 메모리(13)에 연결되어 상기 데이터를 획득할 수 있다. 대안적으로, 기계 학습 장치(11)는 또한 복수의 제1 전송 인터페이스(14)를 통해 복수의 전송 회로(12)에 연결된 후 이들 전송 회로(12)를 통해 하나의 공유 메모리(13)에 연결되어 상기 데이터를 획득할 수 있다. 대안적으로, 기계 학습 장치(11)는 또한 복수의 제1 전송 인터페이스(14)를 통해 하나의 전송 회로(12)에 연결된 후 상기 하나의 전송 회로(12)를 통해 복수의 공유 메모리(13)에 연결되어 상기 데이터를 획득할 수도 있다.It should be noted that the data processing device used in this application may be at least one of the following structural types. That is, the machine learning device 11 is connected to one transmission circuit 12 through a plurality of first transmission interfaces 14 and then connected to one shared memory 13 through the one transmission circuit 12. The data can be obtained. Alternatively, the machine learning device 11 is also connected to a plurality of transmission circuits 12 via a plurality of first transmission interfaces 14 and then to one shared memory 13 via these transmission circuits 12. and obtain the data. Alternatively, the machine learning device 11 is also connected to one transmission circuit 12 through a plurality of first transmission interfaces 14 and then a plurality of shared memories 13 through the one transmission circuit 12. It may be connected to and obtain the data.

대안적으로, 기계 학습 장치(11)가 인공 신경망 연산을 수행할 때, 다층의 신경망 연산에 대해 정방향 연산이든지 역방향 연산이든지 막론하고 기계 학습 장치(11)는 각층의 신경망이 출력한 뉴런 데이터를 계산할 수 있다. 구체적으로 각층의 신경망 입력단에 대응하는 복수의 입력 뉴런 데이터와 가중치에 대해 곱셈 연산, 합산 연산, 컨볼루션 연산 및 활성화 연산 등 일련의 인공 신경망 연산에 필요한 연산 집합을 진행하여 연산 결과를 얻을 수 있다. 기계 학습 장치(11)는 인공 신경망 연산을 통해 현재 층의 출력 뉴런 데이터를 얻은 후, 상기 출력 뉴런 데이터를 다음 층 신경망의 입력 뉴런 데이터로 사용하여 인공 신경망 연산을 재차 진행한다. 이를 수행하기 이전에, 먼저 전송 회로(12)를 통해 현재 층의 출력 뉴런 데이터를 공유 메모리(13)에 기록하여 저장할 수 있으며, 이를 통해 기계 학습 장치(11)는 상기 데이터를 실시간 판독할 수 있으므로 다음 층의 인공 신경망 연산을 편리하게 진행할 수 있다.Alternatively, when the machine learning device 11 performs an artificial neural network operation, regardless of whether it is a forward or backward operation for multi-layer neural network operations, the machine learning device 11 may calculate neuron data output by the neural networks of each layer. can Specifically, a set of operations required for a series of artificial neural network operations, such as multiplication operation, summation operation, convolution operation, and activation operation, can be performed on a plurality of input neuron data and weights corresponding to the neural network input terminal of each layer to obtain an operation result. The machine learning apparatus 11 obtains output neuron data of a current layer through artificial neural network calculation, and then performs artificial neural network calculation again by using the output neuron data as input neuron data of a next layer neural network. Before performing this, first, the output neuron data of the current layer can be recorded and stored in the shared memory 13 through the transmission circuit 12, and through this, the machine learning device 11 can read the data in real time, The artificial neural network calculation of the next layer can be conveniently performed.

상기 실시예에 따른 기계 학습 연산을 수행하는 데이터 처리 장치는, 기계 학습 장치, 기계 학습 장치의 제1 전송 인터페이스를 통해 연결된 전송 회로 및 전송 회로에 연결된 공유 메모리를 포함한다. 여기서, 전송 회로는 기계 학습 장치에서 송신된 데이터 동작 신호에 의해 공유 메모리로부터 기계 학습 장치에 필요한 입력 데이터를 획득한 후, 입력 데이터를 기계 학습 장치로 리턴한다. 상기 데이터의 조작 과정에서, 기계 학습 연산을 수행할 때 많은 데이터가 공용되는 상황을 고려하여 본 출원에 사용된 데이터 처리 장치는 이에 상응한 전송 회로를 설치하여 기계 학습 장치가 공유 메모리로부터 데이터를 판독하고 기록하도록 구현한다. 종래 CPU는 메모리의 데이터를 직접 액세스하는 과정에서 다음과 같은 문제점이 존재한다. 즉, CPU는 병행 연산을 진행할 때 병행 데이터 액세스 로직이 복잡하여 막힘과 데드 록(deadlock)이 쉽게 발생한다. 종래의 CPU에 비해 본 출원에 사용되는 데이터 처리 장치는, 기계 학습 장치가 공유 메모리의 데이터를 액세스하는 로직을 간소화시키고 데이터의 액세스 효율을 향상시켜 기계 학습 장치의 기계 학습 연산 속도를 더욱 향상시켰다.A data processing device for performing a machine learning operation according to the above embodiment includes a machine learning device, a transmission circuit connected through a first transmission interface of the machine learning device, and a shared memory connected to the transmission circuit. Here, the transmission circuit acquires input data necessary for the machine learning apparatus from the shared memory by a data operation signal transmitted from the machine learning apparatus, and then returns the input data to the machine learning apparatus. In the process of manipulating the data, considering the situation in which a lot of data is shared when performing machine learning operations, the data processing device used in the present application installs a corresponding transmission circuit so that the machine learning device reads data from the shared memory. and implement it to record. Conventional CPUs have the following problems in the process of directly accessing memory data. That is, when the CPU performs parallel operation, concurrency data access logic is complicated, so clogging and deadlock easily occur. Compared to a conventional CPU, the data processing device used in the present application simplifies the logic for accessing data in the shared memory by the machine learning device and improves data access efficiency, thereby further improving the machine learning operation speed of the machine learning device.

도 56A는 본 출원의 실시예에 따른 기계 학습 장치의 구성도이다. 상기 실시예를 기초하여, 도 56A를 참조하면, 상기 기계 학습 장치(11)는 적어도 하나 이상의 연산 유닛(151), 및 연산 유닛(151)에 연결된 컨트롤러 유닛(152)을 포함하는 적어도 하나 이상의 기계 학습 유닛(15)를 포함한다. 연산 유닛(151)은 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)을 포함하고, 연산 유닛(151)은 제1 전송 인터페이스(14)을 통해 전송 회로(12)에 연결된다.56A is a configuration diagram of a machine learning device according to an embodiment of the present application. Based on the above embodiment, referring to FIG. 56A , the machine learning device 11 includes at least one or more machines including at least one calculation unit 151 and a controller unit 152 connected to the calculation unit 151. It includes a learning unit (15). The arithmetic unit 151 includes one master processing circuit 151a and a plurality of slave processing circuits 151b, and the arithmetic unit 151 is connected to the transmission circuit 12 through the first transmission interface 14. .

여기서, 상기 컨트롤러 유닛(152)은, 제1 전송 인터페이스(14) 내의 송신 인터페이스를 통해 데이터 동작 신호 및 출력 데이터를 전송 회로(12)로 송신하고, 제1 전송 인터페이스(14) 내의 수신 인터페이스를 통해 전송 회로(12)가 공유 메모리(13)로부터 획득한 입력 데이터를 수신하고, 또한 입력 데이터를 마스트 처리회로(151a) 및/또는 슬레이브 처리회로(151b)로 송신한다. 마스트 처리회로(151a)는 입력 데이터를 복수의 슬레이브 처리회로(151b)로 분배한다. 복수의 슬레이브 처리회로(151b)는 마스트 처리회로(151a)로부터 전송된 데이터에 따라 중간 연산을 병행 수행하여 복수의 중간 결과를 얻으며, 복수의 중간 결과를 마스트 처리회로(151a)로 전송한다. 마스트 처리회로(151a)는 또한 복수의 중간 결과에 대해 후속 처리를 진행하여 계산 결과를 얻는다.Here, the controller unit 152 transmits the data operation signal and the output data to the transmission circuit 12 through a transmission interface in the first transmission interface 14, and via a reception interface in the first transmission interface 14. The transmission circuit 12 receives input data obtained from the shared memory 13, and also transmits the input data to the master processing circuit 151a and/or the slave processing circuit 151b. The master processing circuit 151a distributes input data to a plurality of slave processing circuits 151b. The plurality of slave processing circuits 151b perform intermediate operations in parallel according to the data transmitted from the master processing circuit 151a to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit 151a. The master processing circuit 151a also performs subsequent processing on a plurality of intermediate results to obtain calculation results.

대안적으로, 상기 기계 학습 장치(11)는 하나의 기계 학습 유닛(15)을 포함할 수 있으며, 이러한 기계 학습 장치(11)는 인공 신경망 연산을 수행할 때 적용될 수 있다. 관련된 신경망 구조에 포함된 뉴런 개수가 적을 경우, 하나의 기계 학습 유닛(15)을 이용하여 전체 신경망의 연산을 한번에 수행할 수 있다. 구체적인 연산 과정은 다음과 같다. 기계 학습 유닛(15)이 신경망의 각층 뉴런에 대응하는 입력 뉴런 데이터 및 가중치에 따라 인공 신경망 연산을 수행하여 각층 뉴런에 대응하는 출력 뉴런 데이터를 얻고, 전체 신경망의 연산이 완료되어 최종 연산 결과를 얻을 때까지 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 삼아 다시 다음 층 신경망 연산을 수행할 수 있다. 이런 과정에서, 기계 학습 장치(11)는 전송 회로(12)를 통해 기계 학습 유닛(15)이 각층에 대해 연산하여 얻은 출력 뉴런 데이터 또는 최종 연산 결과를 공유 메모리(13)로 전송하여 저장할 수 있다.Alternatively, the machine learning device 11 may include one machine learning unit 15, and this machine learning device 11 may be applied when performing artificial neural network calculations. When the number of neurons included in the related neural network structure is small, calculation of the entire neural network can be performed at once using one machine learning unit 15 . The specific calculation process is as follows. The machine learning unit 15 performs an artificial neural network operation according to input neuron data and weights corresponding to neurons of each layer of the neural network to obtain output neuron data corresponding to neurons of each layer, and the calculation of the entire neural network is completed to obtain a final calculation result. Until then, the output neuron data can be used as new input neuron data and the next layer neural network operation can be performed again. In this process, the machine learning device 11 transmits and stores output neuron data obtained by calculating the machine learning unit 15 for each layer through the transmission circuit 12 or the final calculation result to the shared memory 13. .

대안적으로, 상기 기계 학습 장치(11)는 복수의 기계 학습 유닛(15)을 포함할 수 있으며, 이러한 기계 학습 장치(11)는 인공 신경망 연산을 수행할 때 관련있는 신경망 구조에 포함된 뉴런 개수가 많은 경우에 적용될 수 있다. 예를 들어, 다층 신경망의 연산에 대해, 정방향 연산 중 어느 한층 신경망 연산을 예로 들어 설명한다. 해당 층의 뉴런 개수가 많을 때, 선택 가능한 계산 방식에 따라 상기 기계 학습 장치(11)는 그 중의 복수의 기계 학습 유닛(15)을 이용하여 한층 신망의 부분 뉴런의 출력 뉴런 데이터를 각각 병행 계산할 수 있다. 예를 들어, 하나의 기계 학습 장치(11)에 4개 기계 학습 유닛(15)이 있고 1층 신경망에 100개 뉴런이 있으면, 기계 학습 장치(11)는 기계 학습 유닛(15)마다 그 중의 25개 뉴런을 각각 처리하도록 배분하고 상응한 출력 뉴런 데이터를 대응 출력하도록 구성할 수 있다. 이와 같이 한층 또 한층의 신경망에 대해 병행 계산을 수행한다. 이러한 계산 방법은 신경망 계산의 병행처리가 가능하므로 처리 효율을 향상시킬 수 있다.Alternatively, the machine learning device 11 may include a plurality of machine learning units 15, and when the machine learning device 11 performs an artificial neural network operation, the number of neurons included in the relevant neural network structure can be applied in many cases. For example, regarding multi-layer neural network calculations, one of the forward calculations will be explained by exemplifying neural network calculations. When the number of neurons in the corresponding layer is large, the machine learning device 11 can calculate the output neuron data of the selected partial neurons in parallel using a plurality of machine learning units 15 according to a selectable calculation method. there is. For example, if there are 4 machine learning units 15 in one machine learning unit 11 and there are 100 neurons in a layer 1 neural network, the machine learning unit 11 will generate 25 of them for every machine learning unit 15. It can be configured to distribute each neuron to process and output corresponding output neuron data. In this way, parallel computation is performed for each layer of neural networks. This calculation method can improve processing efficiency because parallel processing of neural network calculation is possible.

대안적으로, 상기 기계 학습 유닛(15)에서 컨트롤러 유닛(152)은 하나의 명령 저장 유닛(152a) 및 하나의 명령 처리 유닛(152b)을 포함할 수 있다. 대안적으로, 컨트롤러 유닛(152)도 복수의 명령 저장 유닛(152a) 및 복수의 명령 처리 유닛(152b)을 포함할 수 있다.Alternatively, the controller unit 152 in the machine learning unit 15 may include one instruction storage unit 152a and one instruction processing unit 152b. Alternatively, the controller unit 152 may also include a plurality of instruction storage units 152a and a plurality of instruction processing units 152b.

여기서, 명령 저장 유닛(152a)은, 기계 학습 유닛(15)이 기계 학습 연산을 수행할 때 관련있는 모든 연산 명령과, 데이터 판독/기록 동작을 진행해야 할 경우 대응하는 데이터 판독/기록 동작 명령을 저장하는데 사용된다. 여기서, 명령 처리 유닛(152b)은 명령 저장 유닛(152a) 내의 모든 명령을 처리하는데 사용되며 구체적으로 다음 동작을 포함할 수 있다. 즉, 명령 저장 유닛(152a) 내의 연산 명령을 연산 유닛(151)으로 보내어 연산 유닛(151)이 연산 명령에 따라 상응한 연산 동작을 수행하도록 하며, 명령 저장 유닛(152a) 내의 데이터 판독/기록 동작 명령을 분석하여 데이터 동작 신호를 얻은 후 상기 데이터 동작 신호를 제1 전송 인터페이스(14)에 전송하여 제1 전송 인터페이스(14)가 상기 데이터 동작 신호를 통해 공유 메모리(13)로부터 데이터를 판독하고 기록하도록 할 수도 있다.Here, the instruction storage unit 152a stores all operation instructions related when the machine learning unit 15 performs machine learning operations, and corresponding data read/write operation instructions when a data read/write operation is to be performed. used to save Here, the instruction processing unit 152b is used to process all instructions in the instruction storage unit 152a and may specifically include the following operations. That is, the arithmetic instruction in the instruction storage unit 152a is sent to the arithmetic unit 151 so that the arithmetic unit 151 performs a corresponding arithmetic operation according to the arithmetic instruction, and the data read/write operation in the instruction storage unit 152a After analyzing the command to obtain a data operation signal, the data operation signal is transmitted to the first transmission interface 14 so that the first transmission interface 14 reads and writes data from the shared memory 13 through the data operation signal. might as well make it happen.

대안적으로, 상기 기계 학습 유닛(15)에서 연산 유닛(151)은 하나의 마스트 처리회로(151a) 및 하나의 슬레이브 처리회로(151b)를 포함할 수 있다. 대안적으로, 연산 유닛(151)도 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)를 포함할 수 있다. 이러한 구조 설계는 큰 데이터 량을 처리하는 경우에 사용하기 적합하다. 특히 기계 학습 연산 과정에 대량의 병행 연산을 진행해야 할 경우에 사용하기 적합하다. 그러므로 본 출원에 따른 이러한 연산 구조는 연산 속도를 높이고 연산 시간을 절약하여 전력 소모를 낮출 수 있다.Alternatively, in the machine learning unit 15, the arithmetic unit 151 may include one master processing circuit 151a and one slave processing circuit 151b. Alternatively, the arithmetic unit 151 may also include one master processing circuit 151a and a plurality of slave processing circuits 151b. This structural design is suitable for processing large amounts of data. In particular, it is suitable for use when a large amount of concurrent computation is required in the machine learning computation process. Therefore, this calculation structure according to the present application can increase calculation speed, save calculation time, and lower power consumption.

유의할 것은, 상기 구조 내의 각 슬레이브 처리회로(151b)는 마스트 처리회로(151a)에서 송신된 입력 데이터에 따라 병행 연산을 직접 수행할 수 있다. 대안적으로, 각각의 슬레이브 처리회로(151b)는 또한 컨트롤러 유닛(152)에서 송신된 입력 데이터에 따라 병행 연산을 직접 수행할 수도 있다.It should be noted that each slave processing circuit 151b in the above structure can directly perform parallel operation according to the input data transmitted from the master processing circuit 151a. Alternatively, each slave processing circuit 151b may also directly perform parallel operations according to input data transmitted from the controller unit 152.

상기 각각의 연산 유닛(151)에 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)가 있는 경우, 연산 유닛(151) 내의 마스트 처리회로(151a)와 복수의 슬레이브 처리회로(151b)의 구조는 동일할 수도 있고 다를 수도 있다. 구체적으로 H형, 시스톨릭 어레이형(systolic array type) 및 트리형 구조 중 하나를 포함할 수 있는 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)의 구조를 포함할 수 있다.When there is one master processing circuit 151a and a plurality of slave processing circuits 151b in each of the calculation units 151, the master processing circuit 151a and the plurality of slave processing circuits 151b in the calculation unit 151 ) may have the same or different structures. Specifically, it may include a structure of a master processing circuit 151a and a plurality of slave processing circuits 151b, which may include one of an H-type structure, a systolic array type structure, and a tree-type structure.

상기 실시예에 따른 기계 학습 장치는 적어도 하나 이상의 기계 학습 유닛을 포함하고, 각각의 기계 학습 유닛은 적어도 하나 이상의 연산 유닛, 및 연산 유닛에 연결된 컨트롤러 유닛을 포함한다. 동시에, 연산 유닛은 하나의 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함하며, 제1 전송 인터페이스를 통해 전송 회로에 연결된다. 상기 기계 학습 장치 내의 컨트롤러 유닛은 제1 전송 인터페이스 내의 송신 인터페이스를 통해 전송 회로로 데이터 동작 신호 및 출력 데이터를 송신하고, 제1 전송 인터페이스 내의 수신 인터페이스를 통해 전송 회로가 공유 메모리로부터 획득한 입력 데이터를 수신한 후 입력 데이터를 마스트 처리회로 및/또는 슬레이브 처리회로로 전송할 수 있다. 상기 기계 학습 장치에 마스트 처리회로 및 복수의 슬레이브 처리회로를 포함하므로 마스트 처리회로는 획득한 데이터를 복수의 슬레이브 처리회로로 동시에 배분하여 복수의 슬레이브 처리회로가 병행 연산을 수행하도록 하고 중간 연산 결과를 다시 마스트 처리회로로 리턴하여 마스트 처리회로가 중간 결과를 연산하도록 함으로써 기계 학습 연산을 구현할 수 있다. 기계 학습 연산에 사용되는 프로세서 중 하나의 처리회로만 데이터에 대해 연산하는 하는 전통적인 방식에 비해, 본 출원에 따른 기계 학습 장치는 데이터 동작 및 데이터 연산하는 속도가 빠른 편이다.The machine learning apparatus according to the above embodiment includes at least one machine learning unit, and each machine learning unit includes at least one arithmetic unit and a controller unit connected to the arithmetic unit. At the same time, the arithmetic unit includes a master processing circuit and a plurality of slave processing circuits, and is connected to the transmission circuit through a first transmission interface. The controller unit in the machine learning device transmits a data operation signal and output data to a transmission circuit through a transmission interface in the first transmission interface, and transmits input data acquired by the transmission circuit from the shared memory through a reception interface in the first transmission interface. After receiving, the input data may be transmitted to the master processing circuit and/or the slave processing circuit. Since the machine learning device includes a master processing circuit and a plurality of slave processing circuits, the master processing circuit simultaneously distributes the acquired data to the plurality of slave processing circuits so that the plurality of slave processing circuits perform parallel operations, and the intermediate operation result is Machine learning operations can be implemented by returning to the master processing circuit and allowing the master processing circuit to calculate intermediate results. Compared to a traditional method in which only one processing circuit among processors used for machine learning operations operates on data, the machine learning apparatus according to the present application has a high speed of data operation and data operation.

도 57은 본 출원의 실시예에 따른 전송 회로의 구성도이다. 도 57를 참조하면, 상기 전송 회로(12)는 제2 전송 인터페이스(120), 제2 전송 인터페이스(120)에 연결된 적어도 하나 이상의 판독/기록 처리회로(121), 그리고 판독/기록 처리회로(121)에 연결된 중재 회로(122)를 포함하고, 적어도 하나 이상의 기계 학습 유닛(15)은 제1 전송 인터페이스(14)를 통해 제2 전송 인터페이스(120)에 연결되어 적어도 하나 이상의 기계 학습 유닛(15)과 전송 회로(12) 사이의 연결을 구현한다.57 is a configuration diagram of a transmission circuit according to an embodiment of the present application. Referring to FIG. 57 , the transmission circuit 12 includes a second transmission interface 120, at least one read/write processing circuit 121 connected to the second transmission interface 120, and a read/write processing circuit 121 ), wherein the at least one machine learning unit 15 is connected via the first transmission interface 14 to the second transmission interface 120 so that the at least one machine learning unit 15 and implements the connection between the transmission circuit 12.

상기 판독/기록 처리회로(121)는 적어도 하나 이상의 기계 학습 유닛(15)이 제1 전송 인터페이스(14) 및 제2 전송 인터페이스(120)를 통해 송신한 데이터 동작 신호를 수신한 후 데이터 동작 신호를 상기 중재 회로(122)로 전송하고, 공유 메모리(13)로부터 판독된 데이터를 제2 전송 인터페이스(120)를 통해 적어도 하나 이상의 기계 학습 유닛(15)으로 송신하는데 사용된다. 상기 중재 회로(122)는 사전에 설정된 중재 규칙에 따라 적어도 하나 이상의 판독/기록 처리회로(121)로부터 수신된 데이터 동작 신호를 중재하고 중재 성공한 데이터 동작 신호에 따라 공유 메모리(13) 내의 데이터를 동작시키는데 사용된다.The read/write processing circuit 121 receives the data operation signal transmitted from at least one machine learning unit 15 through the first transmission interface 14 and the second transmission interface 120, and then converts the data operation signal. It is used to transmit data read from the shared memory 13 to the arbitration circuit 122 and to at least one machine learning unit 15 through the second transmission interface 120 . The mediation circuit 122 mediates data operation signals received from at least one read/write processing circuit 121 according to a preset mediation rule and operates data in the shared memory 13 according to the data operation signal that has succeeded in mediation. used to do

대안적으로, 전송 회로(12)는 복수의 제2 전송 인터페이스(120)를 포함할 수 있으며, 제2 전송 인터페이스(120)는 송신 인터페이스일 수 있으며, 수신 인터페이스일 수도 있다. 제2 전송 인터페이스(120)가 송신 인터페이스인 경우, 전송 회로(12)는 상기 송신 인터페이스에 연결된 기계 학습 유닛(15)으로 데이터를 송신하며, 제2 전송 인터페이스(120)가 수신 인터페이스인 경우, 전송 회로(12)는 기계 학습 유닛(15)이 상기 수신 인터페이스로 송신한 데이터 동작 신호 및/또는 데이터를 수신할 수 있다. 대안적으로, 제2 전송 인터페이스(120)의 송신 인터페이스는 제1 전송 인터페이스(14) 내의 수신 인터페이스에 연결되고, 제2 전송 인터페이스(120)의 수신 인터페이스는 제1 전송 인터페이스(14) 내의 송신 인터페이스에 연결된다.Alternatively, the transmission circuit 12 may include a plurality of second transmission interfaces 120, and the second transmission interfaces 120 may be transmission interfaces or may be reception interfaces. When the second transmission interface 120 is a transmission interface, the transmission circuit 12 transmits data to the machine learning unit 15 connected to the transmission interface, and when the second transmission interface 120 is a reception interface, transmission The circuit 12 may receive a data operation signal and/or data transmitted by the machine learning unit 15 to the receiving interface. Alternatively, the transmit interface of the second transmit interface 120 is connected to the receive interface in the first transmit interface 14, and the receive interface of the second transmit interface 120 is connected to the transmit interface in the first transmit interface 14. connected to

대안적으로, 도 57A를 참조하면, 상기 전송 회로(12)는 복수의 판독/기록 처리회로(121)를 포함할 수 있으며, 상기 복수의 판독/기록 처리회로(121)의 입력단은 복수의 제2 전송 인터페이스(120)와 일일히 대응하여 연결될 수 있다. 대안적으로, 도 2B를 참조하면, 상기 전송 회로(12)는 하나의 판독/기록 처리회로(121)만 포함할 수도 있으며, 상기 하나의 판독/기록 처리회로(121)의 입력단은 복수의 제2 전송 인터페이스(120)와 일대다(一對多)로 연결될 수 있다. 즉, 하나의 판독/기록 처리회로(121)는 복수의 제2 전송 인터페이스(120)에 연결된다.Alternatively, referring to Fig. 57A, the transmission circuit 12 may include a plurality of read/write processing circuits 121, and input terminals of the plurality of read/write processing circuits 121 may include a plurality of second circuits. 2 may be connected to the transmission interface 120 in correspondence with each other. Alternatively, referring to FIG. 2B, the transmission circuit 12 may include only one read/write processing circuit 121, and the input terminal of the one read/write processing circuit 121 may include a plurality of second circuits. 2 may be connected to the transmission interface 120 in a one-to-many manner. That is, one read/write processing circuit 121 is connected to a plurality of second transmission interfaces 120 .

대안적으로, 상기 복수의 판독/기록 처리회로(121)가 복수의 제2 전송 인터페이스(120)와 일대일로 대응하여 연결될 때, 각각의 판독/기록 처리회로(121)는 그와 연결된 하나의 제2 전송 인터페이스(120)를 통해 데이터를 하나의 기계 학습 유닛(15)으로 송신하거나, 그와 연결된 하나의 제2 전송 인터페이스(120)를 통해 데이터를 복수의 기계 학습 유닛(15)으로 송신할 수 있다. 상기 하나의 판독/기록 처리회로(121)가 복수의 제2 전송 인터페이스(120)와 일대다로 연결될 때, 상기 판독/기록 처리회로(121)는 그와 연결된 복수의 제2 전송 인터페이스(120)를 통해 데이터를 복수의 기계 학습 유닛(15)으로 송신하거나, 그 중의 하나의 제2 전송 인터페이스(120)를 통해 데이터를 하나의 기계 학습 유닛(15)으로 송신할 수 있다.Alternatively, when the plurality of read/write processing circuits 121 are connected in a one-to-one correspondence with the plurality of second transmission interfaces 120, each read/write processing circuit 121 is connected to one second transmission interface 120. Data may be transmitted to one machine learning unit 15 through two transmission interfaces 120, or data may be transmitted to a plurality of machine learning units 15 through one second transmission interface 120 connected thereto. there is. When the one read/write processing circuit 121 is connected to the plurality of second transmission interfaces 120 in a one-to-many manner, the read/write processing circuit 121 connects to the plurality of second transmission interfaces 120 connected thereto. Data may be transmitted to a plurality of machine learning units 15 through , or data may be transmitted to one machine learning unit 15 through one of the second transmission interfaces 120 .

대안적으로, 상기 전송 회로(12)의 구조에 하나의 중재 회로(122)가 포함되며, 상기 하나의 중재 회로(122)의 입력단은 복수의 판독/기록 처리회로(121)에 연결될 수 있다. 상기 중재 회로(122)의 출력단은 공유 메모리(13)에 연결될 수 있으며 다른 메모리 장치 또는 제어 장치에 연결될 수도 있다.Alternatively, one arbitration circuit 122 is included in the structure of the transmission circuit 12, and an input terminal of the one arbitration circuit 122 may be connected to a plurality of read/write processing circuits 121. An output terminal of the mediation circuit 122 may be connected to the shared memory 13 and may also be connected to other memory devices or control devices.

상기 실시예로부터 알수 있듯이, 본 출원에 사용된 전송 회로(12)는 복수의 판독/기록 처리회로(121)를 포함할 수 있다. 이러한 복수의 판독/기록 처리회로(121)의 유형은 동일할 수도 있고 다를 수도 있다. 이하 실시예에서는 판독/기록 처리회로(121)의 유형, 및 판독/기록 처리회로(121)가 수신한 데이터 신호의 유형에 따라 데이터의 전송방식에 대해 전개하여 설명한다.As can be seen from the above embodiment, the transmission circuit 12 used in this application may include a plurality of read/write processing circuits 121. The types of these plurality of read/write processing circuits 121 may be the same or different. In the following embodiments, a method of transmitting data according to the type of the read/write processing circuit 121 and the type of the data signal received by the read/write processing circuit 121 will be developed and described.

구체적으로, 판독/기록 처리회로(121)는 유니캐스트 판독 처리회로, 유니캐스트 기록 처리회로, 브로드캐스트 처리회로와 같은 처리회로 중의 하나 이상을 포함할 수 있다. 데이터 동작 신호는 유니캐스트 판독 요청, 유니캐스트 기록 요청, 유니캐스트 판독 명령, 유니캐스트 기록 명령, 멀티캐스트 명령, 브로드캐스트 명령 중 하나 이상을 포함한다.Specifically, the read/write processing circuit 121 may include one or more of processing circuits such as a unicast read processing circuit, a unicast write processing circuit, and a broadcast processing circuit. The data operation signal includes one or more of a unicast read request, a unicast write request, a unicast read command, a unicast write command, a multicast command, and a broadcast command.

여기서, 유니캐스트 유형의 처리회로는 유니캐스트 유형의 신호를 처리한다. 예를 들어, 상기 실시예의 유니캐스트 판독 처리회로는 대응하는 유니캐스트 판독 요청 또는 유니캐스트 판독 명령을 처리할 수 있고, 유니캐스트 기록 처리회로는 대응하는 유니캐스트 기록 요청 또는 유니캐스트 기록 명령을 처리할 수 있다. 따라서 멀티캐스트 및 브로드캐스트 유형의 처리회로는 멀티캐스트 또는 브로드캐스트 유형의 신호를 처리한다. 예를 들어, 상기 실시예의 브로드캐스트 처리회로는 대응하는 멀티캐스트 명령 또는 브로드캐스트 명령을 처리할 수 있다.Here, the unicast type processing circuit processes a unicast type signal. For example, the unicast read processing circuit in the above embodiment can process a corresponding unicast read request or unicast read command, and the unicast write processing circuit can process a corresponding unicast write request or unicast write command. can Accordingly, multicast and broadcast type processing circuitry processes multicast or broadcast type signals. For example, the broadcast processing circuit in the above embodiment may process a corresponding multicast command or broadcast command.

유의할 것은, 데이터 동작 신호는 본 실시예의 유니캐스트 판독 명령, 유니캐스트 기록 명령, 멀티캐스트 명령, 브로드캐스트 명령과 같은 명령형 신호인 경우, 판독/기록 처리회로(121)는 구체적으로 명령형 신호를 분석하여 요청형 신호를 생성하고, 요청형 신호를 중재 회로(122)로 전송한다. 데이터 동작 신호가 본 실시예의 유니캐스트 판독 요청, 유니캐스트 기록 요청과 같은 요청형 신호인 경우, 판독/기록 처리회로(121)는 상기 요청형 신호에 대해 임시 저장 작업을 수행하여 상기 요청형 신호를 중재 회로(122)로 송신한다.Note that, if the data operation signal is a command type signal such as a unicast read command, a unicast write command, a multicast command, or a broadcast command in this embodiment, the read/write processing circuit 121 specifically analyzes the command type signal Generates the requested signal and transmits the requested signal to the arbitration circuit 122 . If the data operation signal is a request-type signal such as a unicast read request or a unicast write request in this embodiment, the read/write processing circuit 121 performs a temporary storage operation on the request-type signal to store the request-type signal. to the arbitration circuit 122.

대안적으로, 데이터 동작 신호가 멀티캐스트 명령일 때, 상기 멀티캐스트 명령에 데이터를 수신할 복수의 목표 기계 학습 유닛의 식별자가 포함되면, 전송 회로(12) 내의 판독/기록 처리회로(121)가 하나의 멀티캐스트 명령을 수신할 경우, 판독/기록 처리회로(121)는 상기 멀티캐스트 명령에 있는 식별자에 따라 복수의 목표 기계 학습 유닛을 식별하여 최종적으로 식별된 복수의 목표 기계 학습 유닛으로 리턴이 필요한 데이터를 송신할 수 있다.Alternatively, when the data operation signal is a multicast command, if the multicast command includes identifiers of a plurality of target machine learning units to receive data, the read/write processing circuit 121 in the transmission circuit 12 When receiving one multicast command, the read/write processing circuit 121 identifies a plurality of target machine learning units according to the identifiers in the multicast command, and returns to the finally identified plurality of target machine learning units. The necessary data can be transmitted.

대안적으로, 데이터 동작 신호가 브로드캐스트 명령인 경우, 상기 브로드캐스트 명령에는 데이터를 수신하는 목표 기계 학습 유닛의 식별자를 1개도 포함하지 않을 수 있다. 그러나 판독/기록 처리회로(121)가 하나의 브로드캐스트 명령을 수신할 경우, 판독/기록 처리회로(121)는 중재 회로(122)가 공유 메모리(13)로부터 획득한 데이터를 기계 학습 장치(11)에 포함된 모든 기계 학습 유닛(15)으로 송신할 수 있다.Alternatively, if the data operation signal is a broadcast command, the broadcast command may not include any identifier of a target machine learning unit receiving data. However, when the read/write processing circuit 121 receives one broadcast command, the read/write processing circuit 121 transfers the data acquired by the arbitration circuit 122 from the shared memory 13 to the machine learning device 11 ) to all machine learning units 15 included in.

대안적으로, 사전에 설정된 중재 규칙은 중재 회로(122)가 일정한 규칙에 따라 복수의 데이터 동작 신호의 우선순위를 확정할 수 있도록 함으로써 중재 회로(122)는 각각의 데이터 동작 신호의 우선순위에 따라 중재 성공한 데이터 동작 신호를 확정할 수 있다. 예를 들어, 1#판독/기록 처리회로(121)가 송신한 데이터 동작 신호의 전송 속도가 2#판독/기록 처리회로(121)가 송신한 데이터 동작 신호의 전송 속도보다 빠르면, 중재 회로(122)는 전송 속도가 큰 데이터 동작 신호의 우선순위를 높은 우선순위로 설정할 수 있으며, 전송 속도가 늦은 데이터 동작 신호의 우선순위를 낮은 우선순위로 설정할 수 있다. 그런 다음, 중재 회로(122)는 상기 우선순위에 따라 높은 우선순위의 데이터 동작 신호를 선택하여 다음 동작을 수행하되, 상기 데이터 동작 신호에 따라 공유 메모리(13)로부터 데이터를 획득한다.Alternatively, the arbitration rule set in advance allows the arbitration circuit 122 to determine priorities of a plurality of data operation signals according to a certain rule, so that the arbitration circuit 122 determines the priority of each data operation signal. A data operation signal with successful mediation can be determined. For example, if the transmission rate of the data operation signal transmitted by the 1# read/write processing circuit 121 is faster than the transmission rate of the data operation signal transmitted by the 2# read/write processing circuit 121, the arbitration circuit 122 ) may set the priority of a data operation signal with a high transmission rate to a high priority, and set the priority of a data operation signal with a low transmission rate to a low priority. Then, the arbitration circuit 122 performs the next operation by selecting a data operation signal having a higher priority according to the priority order, and acquires data from the shared memory 13 according to the data operation signal.

상기 실시예에서, 전송 회로는 제2 전송 인터페이스, 제2 전송 인터페이스에 연결된 적어도 하나 이상의 판독/기록 처리회로, 그리고 판독/기록 처리회로에 연결된 중재 회로를 포함하며, 공유 메모리로부터 판독된 데이터를 제2 전송 인터페이스를 통해 적어도 하나 이상의 기계 학습 유닛으로 전송한다. 여기서, 판독/기록 처리회로는 적어도 하나 이상의 기계 학습 유닛이 제1 전송 인터페이스 및 제2 전송 인터페이스를 통해 송신한 데이터 동작 신호를 수신하고 데이터 동작 신호를 중재 회로로 전송한다. 그러므로 중재 회로가 사전에 설정된 중재 규칙에 따라 적어도 하나 이상의 판독/기록 처리회로로부터 수신된 데이터 동작 신호를 중재하고, 중재 성공한 데이터 동작 신호에 의해 공유 메모리 내의 데이터를 동작시킨다. 상기 전송 회로에서, 복수의 판독/기록 처리회로는 복수의 제2 전송 인터페이스를 통해 기계 학습 장치에 연결되어 중재 회로에 의해 중재를 진행함으로써 데이터를 효과적으로 전송할 수 있어 기계 학습 장치가 복수의 데이터 동작 신호를 동시에 송신할 때 생기는 데이터 충돌 및 막힘 현상을 방지할 수 있다. 한편, 본 실시예의 전송 회로는 여러 유형의 명령 또는 요청을 처리할 수 있어 데이터 처리 장치의 적용범위를 크게 확대하였다.In the above embodiment, the transmission circuit includes a second transmission interface, at least one read/write processing circuit connected to the second transmission interface, and an arbitration circuit connected to the read/write processing circuit, and transmits data read from the shared memory. 2 to at least one machine learning unit via a transmission interface. Here, the read/write processing circuit receives a data operation signal transmitted from at least one machine learning unit through the first transmission interface and the second transmission interface, and transmits the data operation signal to the mediation circuit. Therefore, an arbitration circuit arbitrates data operation signals received from at least one read/write processing circuit according to a preset arbitration rule, and operates data in the shared memory according to the data operation signal that has succeeded in mediation. In the transmission circuit, the plurality of read/write processing circuits are connected to the machine learning device through the plurality of second transmission interfaces, and mediation is performed by the mediation circuit, so that data can be effectively transmitted, so that the machine learning device can transmit a plurality of data operation signals. It is possible to prevent data collisions and blockages occurring when simultaneously transmitting. Meanwhile, the transmission circuit of this embodiment can process various types of commands or requests, greatly expanding the application range of the data processing device.

일 실시예에서, 본 실시예의 데이터 처리 장치는 적어도 하나 이상의 클러스터로 나눌 수 있으며 각 클러스터는 복수의 기계 학습 유닛(15), 하나의 전송 회로(12) 및 적어도 하나 이상의 공유 메모리(13)를 포함한다. 복수의 클러스터가 존재하는 경우에, 도 58을 참조하면, 전송 회로(12)는, 소속된 클러스터 내의 중재 회로(122)와 클러스터 내의 공유 메모리(13)에 연결된 제1 유형의 직접 메모리 액세스 제어기DMA123, 및/또는 소속된 클러스터 내의 중재 회로(122) 및 다른 클러스터 내의 공유 메모리(13)에 연결된 제2 유형 DMA124을 더 포함할 수 있다.In one embodiment, the data processing apparatus of this embodiment can be divided into one or more clusters, and each cluster includes a plurality of machine learning units 15, one transmission circuit 12, and one or more shared memories 13. do. When there are multiple clusters, referring to FIG. 58 , the transfer circuit 12 is a direct memory access controller DMA 123 of the first type connected to the arbitration circuit 122 in the cluster to which it belongs and the shared memory 13 in the cluster. , and/or a second type DMA124 connected to the arbitration circuit 122 in the cluster to which it belongs and the shared memory 13 in the other cluster.

상기 제1 유형의 DMA123은 클러스터 내의 중재 회로(122) 및 클러스터 내의 공유 메모리(13) 사이에서 데이터 상호 작용을 제어한다. 제2 유형의 DMA124는 클러스터 내의 중재 회로(122) 및 다른 클러스터 내의 공유 메모리(13) 사이에서 데이터 상호 작용을 제어하고, 클러스터 내의 중재 회로(122) 및 오프칩 메모리 사이에서 데이터 상호 작용을 제어한다.The first type of DMA123 controls data interaction between the arbitration circuit 122 in the cluster and the shared memory 13 in the cluster. The second type of DMA124 controls data interaction between arbitration circuitry 122 in a cluster and shared memory 13 in other clusters, and controls data interaction between arbitration circuitry 122 in a cluster and off-chip memory. .

대안적으로, 제1 유형의 DMA123와 제2 유형의 DMA124는 주로 중재 회로(122)가 적어도 하나 이상의 공유 메모리(13)에 연결되도록 제어하고, 연결된 적어도 하나 이상의 공유 메모리(13)로부터 데이터를 신속하게 판독하거나 기록하는 역할을 한다.Alternatively, the DMA123 of the first type and the DMA124 of the second type mainly control the arbitration circuit 122 to be connected to at least one or more shared memories 13, and quickly transfer data from the connected at least one or more shared memories 13. It plays a role in reading or writing.

전송 회로에 제1 유형의 DMA123 또는 제2 유형의 DMA124가 존재할 때, 도 59를 참조하면, 본 출원에 사용된 전송 회로(12)는 제1 유형의 DMA123에 연결된 제1 선택 전송 회로(125), 제2 유형의 DMA에 연결된 제2 선택 전송 회로(126)를 더 포함할 수 있다. 여기서, 제1 선택 전송 회로(125)는 소속된 클러스터 내의 공유 메모리(13)에 선택적으로 연결된다. 제2 선택 전송 회로(126)는 소속된 클러스터 및 다른 클러스터 내의 공유 메모리(13) 및 오프칩 메모리에 선택적으로 연결된다.When there is a first type DMA123 or a second type DMA124 in the transfer circuit, referring to FIG. 59, the transfer circuit 12 used in this application is a first select transfer circuit 125 connected to the first type DMA123 , a second selection transfer circuit 126 coupled to a second type of DMA. Here, the first selection transmission circuit 125 is selectively connected to the shared memory 13 in the belonging cluster. The second selection transmission circuit 126 is selectively connected to the shared memory 13 and the off-chip memory in the cluster to which it belongs and other clusters.

대안적으로, 제1 선택 전송 회로(125)와 제2 선택 전송 회로(126)는 크로스바 스위치(crossbar switch), 디버터 스위치(diverter switch)등 타입의 회로일 수 있으며, 온/오프 전류 또는 온/오프 신호를 설정하여 각 회로 사이의 연결 여부를 제어하는 회로일 수 있다. 본 실시예는 이에 대해 제한하지 않는다.Alternatively, the first select transmission circuit 125 and the second select transmission circuit 126 may be a crossbar switch, diverter switch, or the like type circuit, and may have an on/off current or an on/off current. / It may be a circuit that controls whether each circuit is connected by setting an off signal. This embodiment is not limited in this respect.

대안적으로, 도 60을 참조하면, 전송 회로(12)가 공유 메모리(13)에 데이터를 기록하거나 공유 메모리(13)가 판독된 데이터를 전송 회로(12)로 리턴할 때, 전송 회로(12)는 우선 기록하고자 하는 데이터 또는 리턴하고자 하는 데이터를 임시 저장하여 처리를 대기한다. 따라서, 이러한 수요에 의해 본 출원에 사용된 전송 회로(12)는 중재 회로(122) 및 공유 메모리(13)에 연결된 캐싱 회로(127)를 더 포함한다. 상기 캐싱 회로(127)는 중재 회로(122)가 공유 메모리(13)로부터 획득한 데이터를 임시 저장하고 중재 회로(122)가 공유 메모리(13)에 기록한 데이터를 임시 저장하도록 구성된다.Alternatively, referring to FIG. 60 , when the transmission circuit 12 writes data to the shared memory 13 or the shared memory 13 returns read data to the transmission circuit 12, the transmission circuit 12 ) first temporarily stores data to be recorded or data to be returned and waits for processing. Accordingly, the transmission circuit 12 used in this application due to this need further includes an arbitration circuit 122 and a caching circuit 127 connected to the shared memory 13. The caching circuit 127 is configured to temporarily store data acquired by the arbitration circuit 122 from the shared memory 13 and to temporarily store data written by the arbitration circuit 122 to the shared memory 13 .

대안적으로, 캐싱 회로(127)는 데이터 교환용 버퍼를 제공하며, 캐싱 회로는 램덤 액세스 메모리(random access memory, RAM)일 수 있으며, 이는 종래기술에 속하므로 여기서 중복 설명하지 않는다.Alternatively, the caching circuit 127 provides a buffer for exchanging data, and the caching circuit may be a random access memory (RAM), which is conventional and will not be described redundantly here.

본 출원에 사용된 데이터 처리 장치에 있어서, 각 회로 사이의 데이터 전송 대역폭은 다를 수 있다. 대안적으로, 전송 회로(12)와 공유 메모리(13) 사이의 전송 대역폭은 전송 회로(12)와 기계 학습 유닛(15) 사이의 전송 대역폭 보다 크다.In the data processing device used in this application, data transmission bandwidth between circuits may be different. Alternatively, the transmission bandwidth between the transmission circuit 12 and the shared memory 13 is greater than the transmission bandwidth between the transmission circuit 12 and the machine learning unit 15.

예를 들어 설명하면, 하나의 기계 학습 장치(11)에 N(N은 1보다 크거나 같은 정수)개의 기계 학습 유닛(15), 하나의 전송 회로(12), 및 하나의 공유 메모리(13)를 포함하고, 전송 회로(12)에서 각 기계 학습 유닛(15)까지의 대역폭이 M이라고 가정할 경우, 전송 회로(12) 내의 브로드캐스트 처리회로에서 공유 메모리(13)까지의 대역폭은 M*N으로 설정할 수 있다. 상기 설계의 장점은 극단적인 상황에서도 충돌을 피할 수 있다는 것이다. 예를 들어, 복수의 기계 학습 유닛(15)이 전송 회로(12)로 브로드캐스트 명령을 동시에 전송하면, 전송 회로(12) 내의 중재 회로(122)가 공유 메모리(13)로 상기 명령을 순서대로 송신할 때, 대역폭이 충분하므로 충돌이 발생하지 않는다. 한편, 전송 회로(12) 내의 중재 회로(122)는 사전에 설정된 중재 규칙에 따라 우선순위가 높은 하나의 브로드캐스트 명령을 선택하여 처리한 후, 공유 메모리(13)로부터 데이터의 리턴을 대기하면서 다른 하나의 브로드캐스트 명령을 계속 처리할 수 있다. 이러한 설계는 데이터 처리 시간을 가속화시키고, 데이터 전송하는 대역폭을 효율적으로 이용할 수 있다. 유의할 것은, 실제 회로 설계에서, 전송 회로(12)와 공유 메모리(13) 사이의 대역폭은 전송 회로(12)와 각 기계 학습 유닛(15) 사이의 대역폭의 2배, 4배, 6배 등일 수 있으며, 전송 회로(12)와 각 기계 학습 유닛(15) 사이의 대역폭보다 큰 조건을 만족하면 된다. 본 실시예는 이에 대해 제한하지 않는다.For example, in one machine learning device 11, N (N is an integer greater than or equal to 1) machine learning units 15, one transmission circuit 12, and one shared memory 13 , and assuming that the bandwidth from the transmission circuit 12 to each machine learning unit 15 is M, the bandwidth from the broadcast processing circuit in the transmission circuit 12 to the shared memory 13 is M*N can be set to The advantage of this design is that collisions can be avoided even in extreme situations. For example, when a plurality of machine learning units 15 simultaneously transmit broadcast commands to the transmission circuit 12, the arbitration circuit 122 in the transmission circuit 12 sequentially transmits the commands to the shared memory 13. When transmitting, there is sufficient bandwidth so no collisions occur. Meanwhile, the arbitration circuit 122 in the transmission circuit 12 selects and processes one broadcast command with a high priority according to a preset arbitration rule, and then waits for data to be returned from the shared memory 13 while waiting for another broadcast command. One broadcast command can be processed continuously. This design accelerates the data processing time and can efficiently use the bandwidth for data transmission. It should be noted that, in actual circuit design, the bandwidth between the transmission circuit 12 and the shared memory 13 may be 2 times, 4 times, 6 times, etc. of the bandwidth between the transmission circuit 12 and each machine learning unit 15. , and a condition greater than the bandwidth between the transmission circuit 12 and each machine learning unit 15 needs to be satisfied. This embodiment is not limited in this respect.

현재 인공 신경망이 끊임없이 발전함에 따라 점점 많은 멀티 아키텍처(Multi-architecture)의 기계 학습 칩이 점차 출시되고 있다. 이러한 기계 학습 칩은 공유 메모리 내의 데이터를 액세스하거나 처리할 때, 대량의 데이터를 필요로 하고, 데이터의 처리 속도에 대한 요구도 높다. 따라서 데이터를 액세스하거나 동작하는 과정에 종종 하드웨어의 개수를 증가하는 방식을 통해 데이터 전송의 대역폭을 넓혀 데이터의 처리 속도를 향상시키므로 높은 데이터의 처리속도를 요구하는 기계 학습 칩의 특성을 만족하고 있다.Currently, with the continuous development of artificial neural networks, more and more multi-architecture machine learning chips are gradually being released. Such a machine learning chip requires a large amount of data when accessing or processing data in a shared memory, and demands for data processing speed are also high. Therefore, in the process of accessing or operating data, the data processing speed is improved by widening the bandwidth of data transmission through a method of increasing the number of hardware, thereby satisfying the characteristics of the machine learning chip requiring high data processing speed.

그러나, 상기 방법을 사용하여 기계 학습 칩이 데이터를 액세스하거나 동작할 경우, 하드웨어의 오버헤드가 크고 하드웨어의 리던던시(redundancy)를 초래할 수 있다.However, when the machine learning chip accesses data or operates using the above method, hardware overhead is large and hardware redundancy may be caused.

상기 문제를 해결하기 위하여 본 출원은 이하 기술방안을 제공한다.In order to solve the above problem, the present application provides the following technical solution.

본 출원의 실시예에 따른 데이터 처리 장치는 소프트웨어, 하드웨어 또는 이들을 결합하는 방식을 통해 구현할 수 있다. 상기 데이터 처리 장치는 도 43에 도시된 일부 또는 전체가 될 수 있다. 상기 데이터 처리 장치는 기계 학습 데이터를 처리하고, 기계 학습 장치(11), 전송 회로(12) 및 공유 메모리(13)를 포함하고, 전송 회로(12)는 복수의 판독/기록 처리회로(121)와 하나의 중재 회로(122)를 포함하고, 중재 회로(122)는 복수의 기계 학습 유닛(15)에서 송신된 데이터 동작 신호에 대해 중재한 후 중재 성공한 데이터 동작 신호에 의해 공유 메모리(13)로부터 기계 학습 장치(11)에 필요한 입력 데이터를 획득한다. 판독/기록 처리회로(121)는 중재 성공한 데이터 동작 신호에 포함된 주소 정보 또는 데이터 동작 신호의 유형에 따라 복수의 기계 학습 유닛 중의 목표 기계 학습 유닛 또는 목표 연산 유닛을 확정하고 입력 데이터를 목표 기계 학습 유닛 또는 목표 연산 유닛으로 리턴한다. 기계 학습 장치(11)는 복수의 기계 학습 유닛(15)을 포함하고, 각각의 기계 학습 유닛(15)은 적어도 하나 이상의 연산 유닛(151)을 포함하며, 복수의 기계 학습 유닛은 제1 전송 인터페이스(14)를 통해 전송 회로(12)에 연결되며, 전송 회로(12)는 공유 메모리(13)에 연결된다.A data processing apparatus according to an embodiment of the present application may be implemented through software, hardware, or a combination thereof. The data processing device may be part or all of the data shown in FIG. 43 . The data processing device processes machine learning data, and includes a machine learning device 11, a transmission circuit 12 and a shared memory 13, wherein the transmission circuit 12 includes a plurality of read/write processing circuits 121 and one arbitration circuit 122, wherein the arbitration circuit 122 mediates data operation signals transmitted from the plurality of machine learning units 15, and then transfers data operation signals from the shared memory 13 by mediation successful data operation signals. Input data necessary for the machine learning device 11 is acquired. The read/write processing circuit 121 determines a target machine learning unit or a target arithmetic unit among a plurality of machine learning units according to the address information or the type of the data operation signal included in the data operation signal that has succeeded in mediation, and converts the input data to the target machine learning unit. Return to unit or target computational unit. The machine learning device 11 includes a plurality of machine learning units 15, each machine learning unit 15 includes at least one or more arithmetic units 151, and the plurality of machine learning units includes a first transmission interface It is connected to the transfer circuit 12 through (14), and the transfer circuit 12 is connected to the shared memory 13.

대안적으로, 상기 기계 학습 장치(11)는 입력 데이터에 의해 기계 학습 연산을 수행하여 출력 데이터를 얻을 수 있다. 대안적으로, 상기 기계 학습 장치(11)는 전송 회로(12)를 통해 출력 데이터를 공유 메모리(13)에 전송하여 저장할 수도 있다. 구체적으로, 기계 학습 장치(11)가 일 신경망 연산을 수행할 경우, 기계 학습 장치(11)는 입력 뉴런 데이터 및 가중치에 의해 인공 신경망의 연산을 수행하여 출력 뉴런 데이터를 얻고, 출력 뉴런 데이터를 새로운 입력 뉴런 데이터로 삼아 전송 회로(12)를 통해 공유 메모리(13)에 전송하여 저장한다.Alternatively, the machine learning device 11 may obtain output data by performing a machine learning operation on input data. Alternatively, the machine learning device 11 may transmit and store output data to the shared memory 13 through the transmission circuit 12 . Specifically, when the machine learning device 11 performs a neural network operation, the machine learning device 11 performs an artificial neural network operation based on input neuron data and weights to obtain output neuron data, and converts the output neuron data to a new one. It is taken as input neuron data and transmitted to the shared memory 13 through the transmission circuit 12 and stored.

유의할 것은, 상기 기계 학습 유닛, 전송 회로, 공유 메모리 및 각 유형의 인터페이스는 모두 하드웨어 회로에 의한 방식으로 구현될 수 있다. 예시적으로, 전송 회로는 브로드캐스트 버스(broadcast bus)일 수 있다. 공유 메모리는 비-휘발성 및/또는 휘발성 메모리일 수 있으며, 램덤 액세스 메모리(RAM), 고속 캐시 메모리 등을 포함하나 이에 한정되지 않는다. 각종 인터페이스는 하나 또는 복수의 데이터 I/O(in/out, 리드인 리드아웃)인터페이스 또는 I/O 핀과 대응할 수 있다.It should be noted that the above machine learning unit, transmission circuit, shared memory and each type of interface may all be implemented in a manner by a hardware circuit. Illustratively, the transmission circuit may be a broadcast bus. Shared memory can be non-volatile and/or volatile memory, and includes, but is not limited to, random access memory (RAM), high-speed cache memory, and the like. Various interfaces may correspond to one or a plurality of data I/O (in/out, lead-in lead-out) interfaces or I/O pins.

도 61을 참조하면, 하나의 선택 가능한 방안에서, 상기 기계 학습 장치(11)는 복수의 기계 학습 유닛(15)을 포함할 수 있다. 다층 신경망의 연산에 있어서, 정방향 연산 중의 한층 신경망의 계산을 예를 들어 설명한다. 일 실시 형태에서, 상기 기계 학습 장치는 복수의 기계 학습 유닛(MLU, Machine Learning Unit)을 통해 신경망 중 해당 층의 모든 뉴런의 출력 뉴런 데이터를 병행 계산할 수 있다. 예시적으로, 상기 기계 학습 장치가 4개의 기계 학습 유닛을 포함하고 해당 층의 신경망이 100개의 뉴런을 가지면, 각각의 기계 학습 유닛이 그 중 25개의 뉴런을 처리하도록 배분할 수 있고, 대응하는 연산 명령을 설정하여 구현할 수 있다. 상기 과정에서, 각각의 기계 학습 유닛은 모두 전송 회로를 통해 공유 메모리로부터 분배된 해당 층의 25개 뉴런에 각각 대응하는 입력 뉴런 데이터 및 가중치를 획득함으로써 분배된 해당 층의 25개 뉴런의 출력 뉴런 데이터를 계산하고, 전송 회로를 통해 분배된 해당 층의 25개 뉴런의 출력 뉴런 데이터를 공유 메모리에 전송하여 저장할 수 있다. 상기 각각의 기계 학습 유닛이 분배된 해당 층의 복수의 뉴런 데이터를 처리할 때, 병행 계산을 통해 처리할 수 있음을 이해할 수 있다. 이와 같이 한층 또 한층의 신경망 병행 계산을 진행하므로 신경망 계산의 병행 처리를 구현할 수 있어 처리 효율을 향상시킬 수 있다.Referring to FIG. 61 , in one possible option, the machine learning device 11 may include a plurality of machine learning units 15 . In the calculation of the multilayer neural network, the calculation of the one-layer neural network during forward calculation will be described as an example. In one embodiment, the machine learning apparatus may calculate output neuron data of all neurons of a corresponding layer in a neural network in parallel through a plurality of machine learning units (MLUs). Exemplarily, if the machine learning device includes 4 machine learning units and the neural network of the corresponding layer has 100 neurons, each machine learning unit may allocate 25 neurons among them to process, and corresponding operation instructions. can be implemented by setting In the above process, each machine learning unit obtains input neuron data and weights respectively corresponding to the 25 neurons of the corresponding layer distributed from the shared memory through the transmission circuit, thereby obtaining the output neuron data of the 25 neurons of the distributed layer. , and output neuron data of 25 neurons of the layer distributed through the transmission circuit can be transmitted and stored in the shared memory. It can be understood that when each of the machine learning units processes the distributed neuron data of the corresponding layer, it can be processed through parallel calculation. In this way, since one layer of parallel calculation of the neural network is performed, parallel processing of the calculation of the neural network can be realized and processing efficiency can be improved.

복수의 기계 학습 유닛(15)이 동시에 제1 전송 인터페이스(14)를 통해 전송 회로(12)로 데이터 동작 신호를 전송할 때, 제1 전송 인터페이스(14)를 통해 판독/기록 처리회로(121)로 데이터 동작 신호를 전송할 수 있다. 판독/기록 처리회로(121)는 하나의 판독/기록 처리회로일 수 있으며, 복수의 판독/기록 처리회로일 수도 있다. 판독/기록 처리회로(121)가 복수의 판독/기록 처리회로인 경우, 하나의 기계 학습 유닛(15)이 하나의 판독/기록 처리회로에 대응할 수 있고 하나의 기계 학습 유닛(15)이 복수의 판독/기록 처리회로에 대응할 수도 있다. 판독/기록 처리회로(121)가 데이터 동작 신호를 중재 회로(122)로 송신하면 중재 회로(122)가 복수의 데이터 동작 신호를 중재하여 중재 성공한 데이터 동작 신호에 따라 공유 메모리(13)로부터 데이터 동작 신호에 대응하는 기계 학습 유닛이 필요한 입력 뉴런 데이터 및 가중치를 획득한다. 판독/기록 처리회로(121)는 데이터 동작 신호에 포함된 주소 정보 또는 데이터 동작 신호의 유형에 따라 목표 기계 학습 유닛 또는 목표 연산 유닛을 확정하고, 입력 뉴런 데이터와 가중치를 목표 기계 학습 유닛 또는 목표 연산 유닛으로 리턴한다.When the plurality of machine learning units 15 simultaneously transmit data operation signals to the transmission circuit 12 via the first transmission interface 14, to the read/write processing circuit 121 via the first transmission interface 14 A data operation signal may be transmitted. The read/write processing circuit 121 may be one read/write processing circuit or may be a plurality of read/write processing circuits. If the read/write processing circuit 121 is a plurality of read/write processing circuits, one machine learning unit 15 may correspond to one read/write processing circuit, and one machine learning unit 15 may correspond to a plurality of read/write processing circuits. It may correspond to a read/write processing circuit. When the read/write processing circuit 121 transmits a data operation signal to the mediation circuit 122, the mediation circuit 122 mediates a plurality of data operation signals and operates data from the shared memory 13 according to the data operation signal that has succeeded in mediation. A machine learning unit corresponding to the signal acquires the necessary input neuron data and weights. The read/write processing circuit 121 determines a target machine learning unit or a target arithmetic unit according to the address information included in the data operation signal or the type of the data operation signal, and converts input neuron data and weights to the target machine learning unit or target arithmetic unit. return to unit

예시적으로, 기계 학습 장치는 기계 학습 유닛0, 기계 학습 유닛1, 기계 학습 유닛2 및 기계 학습 유닛3의 4개 기계 학습 유닛을 포함하며, 이들은 판독/기록 처리회로0, 판독/기록 처리회로1, 판독/기록 처리회로2 및 판독/기록 처리회로3의 4개 판독/기록 처리회로에 각각 대응된다. 여기서, 기계 학습 유닛0, 기계 학습 유닛1, 기계 학습 유닛2 및 기계 학습 유닛3은 제1 전송 인터페이스(14)를 통해 판독/기록 처리회로0, 판독/기록 처리회로1, 판독/기록 처리회로2 및 판독/기록 처리회로3로 데이터 동작 신호를 송신한다. 구체적으로 판독/기록 처리회로0로 데이터 동작 신호0를, 판독/기록 처리회로1로 데이터 동작 신호1를, 판독/기록 처리회로2로 데이터 동작 신호2를, 판독/기록 처리회로3로 데이터 동작 신호3를 송신할 수 있다. 판독/기록 처리회로0, 판독/기록 처리회로1, 판독/기록 처리회로2 및 판독/기록 처리회로3는 각각 데이터 동작 신호0, 데이터 동작 신호1, 데이터 동작 신호2 및 데이터 동작 신호3를 중재 회로(122)에 보내어 중재를 진행하도록 하며, 중재 회로(122)는 복수의 데이터 동작 신호를 중재하여 데이터 동작 신호2를 중재 성공한 데이터 동작 신호로 확정하고, 데이터 동작 신호2에 따라 공유 메모리(13)로부터 입력 뉴런 데이터 및 가중치를 획득한다. 판독/기록 처리회로2는 데이터 동작 신호2에 포함된 주소 정보, 예를 들어 기계 학습 유닛1 및 기계 학습 유닛2의 주소를 포함하는 주소 정보에 의해 기계 학습 유닛1과 기계 학습 유닛2를 목표 기계 학습 유닛으로 확정한 후, 데이터 동작 신호2에 따라 입력 뉴런 데이터 및 가중치를 기계 학습 유닛1 및 기계 학습 유닛2으로 리턴한다.Exemplarily, the machine learning device includes four machine learning units of machine learning unit 0, machine learning unit 1, machine learning unit 2, and machine learning unit 3, which include a read/write processing circuit 0 and a read/write processing circuit. 1, the read/write processing circuit 2 and the read/write processing circuit 3 respectively correspond to the four read/write processing circuits. Here, machine learning unit 0, machine learning unit 1, machine learning unit 2, and machine learning unit 3 are read/write processing circuit 0, read/write processing circuit 1, and read/write processing circuit via the first transmission interface 14. 2 and a data operation signal to the read/write processing circuit 3. Specifically, data operation signal 0 is transmitted to the read/write processing circuit 0, data operation signal 1 is transmitted to the read/write processing circuit 1, data operation signal 2 is transmitted to the read/write processing circuit 2, and data operation is performed to the read/write processing circuit 3. Signal 3 can be transmitted. Read/write processing circuit 0, read/write processing circuit 1, read/write processing circuit 2, and read/write processing circuit 3 mediate data operation signal 0, data operation signal 1, data operation signal 2, and data operation signal 3, respectively. The mediation circuit 122 mediates the plurality of data operation signals, determines the data operation signal 2 as a successful data operation signal, and according to the data operation signal 2, the shared memory 13 ) to obtain input neuron data and weights. The read/write processing circuit 2 transfers the machine learning unit 1 and the machine learning unit 2 to the target machine based on the address information included in the data operation signal 2, for example, the addresses of the machine learning unit 1 and the machine learning unit 2. After being determined by the learning unit, the input neuron data and weights are returned to the machine learning unit 1 and the machine learning unit 2 according to the data operation signal 2.

다른 하나의 선택 가능한 방안에서, 상기 기계 학습 장치는 복수의 기계 학습 유닛을 이용하여 일정한 선후 순서에 따라 신경망 중 각층의 모든 뉴런의 출력 뉴런 데이터를 각각 계산할 수 있다. 상기 과정에서, 직전의 기계 학습 유닛은 전송 회로를 통해 해당 층의 모든 뉴런의 출력 뉴런 데이터를 공유 메모리로 전송하여 저장할 수 있으므로 다음 기계 학습 유닛은 해당 층의 모든 뉴런의 출력 뉴런 데이터를 추출하여 다음 층의 입력 뉴런 데이터로 사용하여 계산할 수 있다. 상기 적용은 각층 신경망의 계산량이 크지 않는 경우, 예를 들어 각층의 뉴런 개수가 적은 신경망의 계산에 적용된다는 것을 이해할 수 있다.In another alternative option, the machine learning apparatus may calculate output neuron data of all neurons in each layer of the neural network according to a predetermined sequential order by using a plurality of machine learning units. In the above process, since the previous machine learning unit can transmit and store the output neuron data of all neurons in the corresponding layer to the shared memory through the transmission circuit, the next machine learning unit extracts the output neuron data of all neurons in the corresponding layer and then It can be calculated using the layer's input neuron data. It can be understood that the above application is applied to calculation of a neural network having a small number of neurons in each layer when the computation amount of each layer is not large.

도 62를 참조하여 기계 학습 유닛(15)에 대해 상세하게 설명한다. 일 방안에서, 기계 학습 유닛(15)은 적어도 하나 이상의 연산 유닛(151), 및 연산 유닛(151)에 연결된 컨트롤러 유닛(152)을 포함한다. 연산 유닛(151)은 하나의 마스트 처리회로(151a) 및 복수의 슬레이브 처리회로(151b)를 포함하며, 연산 유닛(151)은 제1 전송 인터페이스(14)를 통해 전송 회로(12)에 연결된다.Referring to Fig. 62, the machine learning unit 15 will be described in detail. In one approach, the machine learning unit 15 includes at least one computing unit 151 and a controller unit 152 coupled to the computing unit 151 . The arithmetic unit 151 includes one master processing circuit 151a and a plurality of slave processing circuits 151b, and the arithmetic unit 151 is connected to the transmission circuit 12 through the first transmission interface 14. .

컨트롤러 유닛(152)은, 제1 전송 인터페이스(14)를 통해 전송 회로(12)로 데이터 동작 신호 및 출력 뉴런 데이터를 송신하며, 제1 전송 인터페이스(14)를 통해 전송 회로(12)가 공유 메모리(13)로부터 획득한 입력 뉴런 데이터 및 가중치를 수신한 후 입력 뉴런 데이터 및 가중치를 마스트 처리회로(151a) 및/또는 슬레이브 처리회로(151b)로 송신한다.The controller unit 152 transmits the data operation signal and the output neuron data to the transmission circuit 12 through the first transmission interface 14, and the transmission circuit 12 transmits the shared memory via the first transmission interface 14. After receiving the input neuron data and weights obtained from (13), the input neuron data and weights are transmitted to the master processing circuit 151a and/or the slave processing circuit 151b.

마스트 처리회로(151a)는, 입력 뉴런 데이터 및 가중치를 복수의 슬레이브 처리회로(151b)로 배분한다. 복수의 슬레이브 처리회로(151b)는 뉴런 데이터 및 가중치에 따라 중간 연산을 병행 수행하여 복수의 중간 결과를 얻은 후, 복수의 중간 결과를 마스트 처리회로(151a)로 전송한다. 마스트 처리회로(151a)는, 또한 복수의 중간 결과를 활성화 연산을 포함하는 후속 처리를 수행하여 계산 결과를 얻는다. 구체적으로, 상기 컨트롤러 유닛(152)은 계산 명령을 더 획득하고 상기 계산 명령을 분석하여 복수의 연산 명령을 얻은 후 상기 복수의 연산 명령을 마스트 처리회로로 송신한다.The master processing circuit 151a distributes input neuron data and weights to a plurality of slave processing circuits 151b. The plurality of slave processing circuits 151b perform intermediate calculations in parallel according to neuron data and weights to obtain a plurality of intermediate results, and then transmit the plurality of intermediate results to the master processing circuit 151a. The master processing circuit 151a further performs subsequent processing including activation calculation on a plurality of intermediate results to obtain calculation results. Specifically, the controller unit 152 further obtains calculation instructions, analyzes the calculation instructions, obtains a plurality of calculation instructions, and sends the plurality of calculation instructions to the master processing circuit.

본 실시예에서, 기계 학습 유닛이 복수의 연산 유닛을 포함하고, 각각의 연산 유닛이 상기 제1 전송 인터페이스를 통해 데이터를 송신하거나 수신할 수 있음을 이해할 수 있다.In this embodiment, it can be understood that the machine learning unit includes a plurality of arithmetic units, and each arithmetic unit may transmit or receive data through the first transmission interface.

예를 들어 설명하면, 하나의 선택 가능한 기술방안에서, 마스트 처리회로는 하나의 컨트롤러 유닛을 더 포함할 수 있으며, 상기 컨트롤러 유닛은 마스터 명령 처리 유닛을 포함할 수 있으며, 구체적으로 연산 명령을 마이크로 명령으로 디코딩한다. 다른 선택 가능한 기술방안에서, 슬레이브 처리회로는 다른 하나의 컨트롤러 유닛을 포함할 수 있으며, 상기 다른 하나의 컨트롤러 유닛은 슬레이브 명령 처리 유닛을 포함하고, 구체적으로 마이크로 명령을 수신하고 처리한다. 상기 마이크로 명령은 명령의 다음 레벨 명령일 수 있다. 상기 마이크로 명령은 명령을 분할 또는 디코딩하여 얻을 수 있으며, 추가로 각 부재, 각 유닛 또는 각 처리회로의 제어 신호로 디코딩될 수 있다. 예를 들어, 곱셈 마이크로 명령은 컨볼루션 명령의 다음 레벨 명령이다. For example, in one selectable technical solution, the master processing circuit may further include a controller unit, and the controller unit may include a master command processing unit, specifically processing instructions for micro-instructions. decode with In another alternative technical solution, the slave processing circuit may include another controller unit, and the other controller unit includes a slave command processing unit, and specifically receives and processes micro-instructions. The micro-instruction may be a next-level command of the command. The micro-instruction can be obtained by dividing or decoding the command, and can be further decoded into a control signal of each member, each unit, or each processing circuit. For example, the multiply microinstruction is the next level instruction of the convolution instruction.

본 실시예에 따른 기계 학습 데이터를 처리하는 데이터 처리 장치는 기계 학습 장치, 전송 회로 및 공유 메모리를 포함한다. 전송 회로는 복수의 판독/기록 처리회로, 및 하나의 중재 회로를 포함한다. 기계 학습 장치는 복수의 기계 학습 유닛을 포함하되, 각각의 기계 학습 유닛은 적어도 하나 이상의 연산 유닛을 포함하고, 복수의 기계 학습 유닛은 제1 전송 인터페이스를 통해 전송 회로에 연결되고, 전송 회로는 공유 메모리에 연결된다. 본 실시예에서, 데이터 처리 장치는 중재 회로를 통해 복수의 기계 학습 유닛이 송신한 데이터 동작 신호를 중재하고, 중재 결과에 의해 공유 메모리로부터 기계 학습 장치에 필요한 입력 뉴런 데이터 및 가중치를 획득한다. 따라서 데이터 처리 장치가 데이터 동작할 때 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 데이터 동작을 하도록 하고, 중재 회로를 통해 복수의 데이터 동작 신호를 중재하여 하드웨어의 오버헤드를 줄이는 동시에 복수의 데이터 동작 신호의 막힘(blocking)을 방지한다.A data processing device for processing machine learning data according to the present embodiment includes a machine learning device, a transmission circuit, and a shared memory. The transmission circuit includes a plurality of read/write processing circuits and one arbitration circuit. The machine learning device includes a plurality of machine learning units, each machine learning unit including at least one arithmetic unit, and the plurality of machine learning units are connected to a transmission circuit through a first transmission interface, and the transmission circuit is shared. connected to memory In this embodiment, the data processing device mediates data operation signals transmitted by a plurality of machine learning units through an arbitration circuit, and acquires input neuron data and weights required for the machine learning device from a shared memory according to the mediation result. Therefore, when the data processing device performs a data operation, a plurality of machine learning units perform a data operation on the shared memory through one transmission circuit, and mediates a plurality of data operation signals through an arbitration circuit to reduce hardware overhead and to prevent blocking of the data operation signal of

일 실시예에서, 도 61을 계속하여 참조하면, 판독/기록 처리회로는 유니캐스트 판독 처리회로 및 브로드캐스트 처리회로 중 임의의 하나를 포함 하고, 데이터 동작 신호는 유니캐스트 판독 요청, 유니캐스트 판독 명령, 멀티캐스트 명령, 브로드캐스트 명령 중 1종 이상을 포함한다. 여기서, 유니캐스트 유형의 처리회로는 유니캐스트 유형의 신호를 처리하고, 브로드캐스트 유형의 처리회로는 멀티캐스트 또는 브로드캐스트 유형의 신호를 처리한다.In one embodiment, with continued reference to FIG. 61 , the read/write processing circuitry includes any one of unicast read processing circuitry and broadcast processing circuitry, and the data operation signal is a unicast read request, a unicast read command. , multicast command, and broadcast command. Here, the unicast type processing circuit processes a unicast type signal, and the broadcast type processing circuit processes a multicast or broadcast type signal.

예시적으로, 유니캐스트 판독 명령은 어느 기계 학습 유닛으로부터 송신되는, 공유 메모리의 소스 주소의 입력 뉴런 데이터와 가중치를 판독하는 명령이다. 유니캐스트 판독 명령을 통해 상기 기계 학습 유닛으로 입력 뉴런 데이터 및 가중치를 리턴할 수 있고, 상기 입력 뉴런 데이터 및 가중치는 상기 기계 학습 유닛이 계산 명령에 따라 어느 한층에 분배된 뉴런 계산을 진행하는 과정에서 상기 분배된 뉴런에 필요한 입력 뉴런 데이터 및 가중치일 수 있다. 브로드캐스트 명령은 어느 기계 학습 유닛으로부터 송신되는 공유 메모리로부터 소스 주소에 해당되는 입력 뉴런 데이터 및 가중치를 판독하는 명령이다. 브로드캐스트 명령을 통해 상기 기계 학습 장치 내의 모든 기계 학습 유닛으로 상기 입력 뉴런 데이터 및 가중치를 리턴할 수 있고, 상기 입력 뉴런 데이터는 어는 한층의 모든 뉴런에 필요한 입력 뉴런 데이터, 즉, 이전 층의 모든 출력 뉴런 데이터일 수 있고, 상기 가중치는 컨볼루션 커널(convolution kernel)와 같은 재활용 가능한 가중치이다. 멀티캐스트 명령과 브로드캐스트 명령의 차이점은 멀티캐스트 명령의 데이터의 리턴 대상이 상기 기계 학습 장치 중의 모든 기계 학습 유닛이 아니라 상기 멀티캐스트 명령 중의 마커 필드(marker field)에 대응하는 복수의 기계 학습 유닛이라는 것이다. 한편, 일반적으로 명령과 요청의 차이점은 명령 수행에 따른 오버헤드가 상대적으로 크지만, 명령에 많은 정보가 포함되어 있고, 요청 수행에 따른 오버헤드가 상대적으로 작지만, 요청에 포함된 정보가 적다는 것이다.Exemplarily, the unicast read command is a command to read input neuron data and weights of a source address of a shared memory transmitted from a certain machine learning unit. It is possible to return input neuron data and weights to the machine learning unit through a unicast read command, and the input neuron data and weights are in the process of calculating neurons distributed to a layer according to the machine learning unit according to the calculation command. It may be input neuron data and weights required for the distributed neurons. A broadcast command is a command for reading input neuron data and weights corresponding to a source address from a shared memory transmitted from a machine learning unit. The input neuron data and weights may be returned to all machine learning units in the machine learning device through a broadcast command, and the input neuron data is input neuron data required for all neurons in a layer, that is, all outputs of the previous layer. It may be neuron data, and the weights are reusable weights such as convolution kernels. The difference between a multicast command and a broadcast command is that the return target of data of the multicast command is not all machine learning units in the machine learning device, but a plurality of machine learning units corresponding to a marker field in the multicast command. will be. On the other hand, in general, the difference between a command and a request is that the overhead of executing the command is relatively large, but the command contains a lot of information, and the overhead of executing the request is relatively small, but the information included in the request is small. will be.

본 실시예에서, 데이터 처리 장치는 중재 회로를 통해 복수의 기계 학습 유닛에서 송신된 데이터 동작 신호를 중재하되, 중재 결과에 의해 공유 메모리로부터 기계 학습 장치에 필요한 입력 뉴런 데이터 및 가중치를 획득한다. 따라서 데이터 처리 장치가 데이터 동작할 때 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 대해 데이터 조작하도록 하므로 하드웨어 오버헤드를 줄이고 리던던시(redundancy)를 방지한다. 아래 실시예를 통해 중재 모듈이 복수의 판독/기록 처리회로에서 송신된 데이터 동작 신호의 우선순위를 확정하는 구체적인 과정을 상세히 설명한다.In this embodiment, the data processing device mediates data operation signals transmitted from a plurality of machine learning units through an arbitration circuit, and obtains input neuron data and weights required for the machine learning device from a shared memory according to mediation results. Therefore, when the data processing device operates data, a plurality of machine learning units operate data on the shared memory through one transmission circuit, thereby reducing hardware overhead and preventing redundancy. A specific process of determining the priority of the data operation signals transmitted by the arbitration module from the plurality of read/write processing circuits will be described in detail in the following embodiments.

일 실시예에서, 상기 중재 회로(122)는 구체적으로 복수의 판독/기록 처리회로(121)에서 송신된 데이터 동작 신호의 우선순위를 확정하고, 우선순위가 가장 높은 데이터 동작 신호를 중재 성공한 데이터 동작 신호로 사용한다.In one embodiment, the mediation circuit 122 specifically determines the priority of the data operation signals transmitted from the plurality of read/write processing circuits 121, and arbitrates the data operation signal with the highest priority for successful data operation. use as a signal

여기서, 중재 회로(122)는 사전에 설정된 규칙에 따라 복수의 데이터 동작 신호의 우선순위를 확정함으로써 중재 회로(122)가 각각의 데이터 동작 신호의 우선순위에 의해 동작이 필요한 대상, 즉 중재 성공한 데이터 동작 신호를 확정하도록 한다. 각각의 데이터 동작 신호의 송신 시간을 중재 기준으로 할 수 있으며, 데이터 동작 신호에 포함된 전송 속도 정보를 중재 기준으로 사용할 수도 있다. 예를 들어, 판독/기록 처리회로1가 데이터 동작 신호를 송신하는 시점은 T시각이고, 판독/기록 처리회로2가 데이터 동작 신호를 송신하는 시점은 T+1 시각인 경우, 데이터 동작 신호의 송신 시간을 중재 기준으로 하면 판독/기록 처리회로1에서 송신된 데이터 동작 신호가 우선순위가 높은 데이터 동작 신호, 즉 중재 성공한 데이터 동작 신호라는 것을 알 수 있다. 중재 회로(122)는 중재한 결과에 따라 중재 성공한 판독/기록 처리회로1에서 송신된 데이터 동작 신호에 의해 공유 메모리(13)로부터 데이터를 획득한다.Here, the arbitration circuit 122 determines the priority of a plurality of data operation signals according to a rule set in advance, so that the arbitration circuit 122 determines the priority of each data operation signal, i.e., data that has succeeded in mediation. Confirm the operation signal. The transmission time of each data operation signal may be used as an arbitration criterion, and transmission rate information included in the data operation signal may be used as an arbitration criterion. For example, if the time point at which the read/write processing circuit 1 transmits the data operation signal is time T and the time point when the read/write processing circuit 2 transmits the data operation signal is time T+1, transmission of the data operation signal If time is used as the arbitration criterion, it can be seen that the data operation signal transmitted from the read/write processing circuit 1 is a data operation signal with a high priority, that is, a data operation signal with successful mediation. The arbitration circuit 122 obtains data from the shared memory 13 according to the data operation signal transmitted from the read/write processing circuit 1 that has succeeded in mediation according to the mediation result.

본 실시예에 따른 데이터 처리 장치는 중재 회로를 통해 복수의 판독/기록 처리회로에서 송신된 데이터 동작 신호의 우선순위를 확정하고, 우선순위가 가장 높은 데이터 동작 신호를 중재 성공한 데이터 동작 신호로 사용한다. 복수의 데이터 동작 신호를 동시에 수신한 경우, 중재 회로를 통해 수행 가능한 하나의 데이터 동작 신호를 확정하므로 복수의 데이터 동작 신호를 동시에 수행하는데 생기는 데이터 막힘 현상을 방지한다. 더 나아가서, 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 대해 데이터 동작을 진행할 수 있어 하드웨어의 오버헤드를 줄이고 하드웨어의 리던던시(redundancy)를 방지한다.The data processing apparatus according to the present embodiment determines the priority of data operation signals transmitted from a plurality of read/write processing circuits through an arbitration circuit, and uses a data operation signal with the highest priority as a data operation signal that succeeds in arbitration. . When a plurality of data operation signals are received at the same time, one data operation signal that can be executed is determined through an arbitration circuit, thereby preventing data clogging caused by simultaneous execution of a plurality of data operation signals. Furthermore, a plurality of machine learning units can perform a data operation on a shared memory through one transmission circuit, thereby reducing hardware overhead and preventing hardware redundancy.

일 실시예에서, 중재 회로(122)는, 구체적으로 복수의 판독/기록 처리회로(121)에서 송신된 데이터 동작 신호의 우선순위가 같을 경우, 복수의 데이터 동작 신호의 유형 및 사전에 설정된 수행 조건에 따라 중재 성공한 데이터 동작 신호를 확정한다.In one embodiment, the arbitration circuit 122, specifically, when the priorities of the data operation signals transmitted from the plurality of read/write processing circuits 121 are the same, the types of the plurality of data operation signals and the preset execution conditions. According to, the data operation signal with successful mediation is determined.

여기서, 상기 실시예에 기초하여, 복수의 판독/기록 처리회로(121)에서 송신된 데이터 동작 신호의 우선순위가 같을 경우, 중재 회로(122)는 복수의 데이터 동작 신호의 유형 및 사전에 설정된 수행 조건에 따라 중재 성공한 데이터 동작 신호를 확정할 수 있다.Here, based on the above embodiment, when the priorities of the data operation signals transmitted from the plurality of read/write processing circuits 121 are the same, the arbitration circuit 122 determines the types of the plurality of data operation signals and the preset performance. Depending on conditions, a data operation signal that has succeeded in mediation can be determined.

여기서, 사전에 설정된 수행 조건은 데이터 동작 신호에 대응하는 데이터 전송 채널의 아이들(idle) 여부를 측정하여 중재 결과를 확정하는 것일 수 있다. 데이터 전송 채널이 아이들(idle) 상태인 경우, 상기 데이터 전송 채널에 대응하는 데이터 동작 신호를 중재 성공한 데이터 동작 신호로 중재한다. 또한 데이터 동작 신호에 포함된 송신 시간 정보에 의해 중재 결과를 확정할 수도 있다. 예시적으로, 중재 회로(122)가 데이터 동작 신호0, 데이터 동작 신호1, 데이터 동작 신호2 및 데이터 동작 신호3의 4개 데이터 동작 신호를 수신하되, 여기서 데이터 동작 신호1와 데이터 동작 신호2의 우선순위가 동일하고, 데이터 동작 신호1가 유니캐스트 판독 명령이고, 데이터 동작 신호2가 브로드캐스트 명령이며, 데이터 동작 신호1에 포함된 주소 정보에 따라 기계 학습 유닛1을 목표 기계 학습 유닛으로 확정하고, 데이터 동작 신호2의 유형에 따라 기계 학습 유닛0, 기계 학습 유닛1, 기계 학습 유닛2 및 기계 학습 유닛3을 목표 기계 학습 유닛으로 확정하며, 기계 학습 유닛0, 기계 학습 유닛1 및 기계 학습 유닛2의 데이터 채널이 아이들(idle) 상태이고, 기계 학습 유닛3의 데이터 채널이 비지(busy) 상태인 경우， 중재 회로(122)는, 데이터 동작 신호1가 유니캐스트 판독 명령이고 데이터 동작 신호2가 브로드캐스트 명령이며 기계 학습 유닛3의 데이터 채널이 비지 상태에 따라 데이터 동작 신호1를 중재 성공한 데이터 동작 신호로 확정한다.Here, the pre-set execution condition may be to determine an arbitration result by measuring whether a data transmission channel corresponding to a data operation signal is idle. When the data transmission channel is in an idle state, the data operation signal corresponding to the data transmission channel is mediated with a successful data operation signal. In addition, the arbitration result may be determined by transmission time information included in the data operation signal. Exemplarily, the mediation circuit 122 receives four data operation signals of data operation signal 0, data operation signal 1, data operation signal 2, and data operation signal 3, wherein data operation signal 1 and data operation signal 2 are Priorities are the same, data operation signal 1 is a unicast read command, data operation signal 2 is a broadcast command, and machine learning unit 1 is determined as a target machine learning unit according to the address information included in data operation signal 1; , According to the type of data operation signal 2, machine learning unit 0, machine learning unit 1, machine learning unit 2 and machine learning unit 3 are determined as target machine learning units, and machine learning unit 0, machine learning unit 1 and machine learning unit 3 are determined. When the data channel of machine learning unit 2 is idle and the data channel of machine learning unit 3 is busy, the arbitration circuit 122 determines that data operation signal 1 is a unicast read command and data operation signal 2 is in a busy state. It is a broadcast command, and according to the busy state of the data channel of the machine learning unit 3, the data operation signal 1 is determined as a successful data operation signal.

대안적으로, 데이터 동작 신호가 유니캐스트 유형의 신호인 경우, 상기 수행 조건은, 유니캐스트 유형의 신호를 송신하는 기계 학습 유닛의 채널이 아이들(idle) 상태인 것, 혹은 유니캐스트 유형의 신호를 송신한 기계 학습 유닛 내의 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.Alternatively, when the data operation signal is a unicast type signal, the execution condition is that a channel of a machine learning unit transmitting a unicast type signal is in an idle state or a unicast type signal. Including that a channel of an arithmetic unit in a transmitting machine learning unit is in an idle state.

대안적으로, 데이터 동작 신호가 멀티캐스트 유형의 신호인 경우, 상기 수행 조건은, 멀티캐스트 유형의 신호를 송신한 기계 학습 유닛의 채널이 아이들(idle) 상태이고 멀티캐스트 유형의 신호가 지정한 목표 기계 학습 유닛의 채널이 아이들(idle) 상태인 것, 혹은 멀티캐스트 유형의 신호를 송신한 기계 학습 유닛 내의 연산 유닛의 채널이 아이들(idle) 상태이고 멀티캐스트 유형의 신호가 지정한 목표 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.Alternatively, when the data operation signal is a multicast type signal, the execution condition is that the channel of the machine learning unit that has transmitted the multicast type signal is in an idle state and the target machine designated by the multicast type signal. The channel of the learning unit is in an idle state, or the channel of an arithmetic unit in the machine learning unit that has transmitted the multicast type signal is in an idle state and the channel of the target arithmetic unit specified by the multicast type signal is Including being in an idle state.

대안적으로, 데이터 동작 신호가 브로드캐스트 유형의 신호인 경우, 상기 수행 조건은, 브로드캐스트 유형의 신호를 송신한 기계 학습 유닛의 채널이 아이들(idle) 상태이고 다른 나머지 기계 학습 유닛의 채널이 아이들(idle) 상태인 것, 혹은 브로드캐스트 유형의 신호를 송신한 기계 학습 유닛 내의 연산 유닛의 채널이 아이들(idle) 상태이고 다른 나머지 기계 학습 유닛 내의 연산 유닛의 채널이 아이들(idle) 상태인 것을 포함한다.Alternatively, if the data operation signal is a broadcast type signal, the execution condition is that the channel of the machine learning unit that has transmitted the broadcast type signal is in an idle state and the channels of the other machine learning units are idle. (idle) state, or a channel of a computational unit in the machine learning unit that transmitted a broadcast type signal is in an idle state and a channel of a computational unit in the other machine learning unit is in an idle state. do.

본 실시예에 따른 데이터 처리 장치에 있어서, 복수의 판독/기록 처리회로에서 송신된 데이터 동작 신호의 우선순위가 동일한 경우, 중재 회로는 복수의 데이터 동작 신호의 유형 및 사전에 설정된 수행 조건에 의해 중재 성공한 데이터 동작 신호를 확정할 수 있다. 본 실시예에서, 데이터 동작 신호의 우선순위가 동일한 경우, 데이터 동작 신호의 유형 및 사전에 설정된 수행 조건에 의해 중재 성공한 데이터 동작 신호를 확정한다. 더 나아가서, 복수의 데이터 동작 신호를 동시에 수행하여 생기는 데이터 막힘 현상을 방지하고, 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 대해 데이터 동작을 할 수 있도록 하므로 하드웨어의 오버헤드를 줄이고 하드웨어의 리던던시(redundancy)를 방지하였다.In the data processing apparatus according to the present embodiment, when data operation signals transmitted from a plurality of read/write processing circuits have the same priority, the arbitration circuit arbitrates based on the types of the plurality of data operation signals and a pre-set execution condition. A successful data operation signal can be confirmed. In this embodiment, when data operation signals have the same priority, a successful data operation signal is determined according to the type of the data operation signal and a previously set execution condition. Furthermore, it prevents data clogging caused by simultaneously executing multiple data operation signals, and allows multiple machine learning units to operate data on shared memory through one transmission circuit, thereby reducing hardware overhead and reducing hardware overhead. The redundancy of was prevented.

일 실시예에서, 도 63을 참조하면, 전송 회로(12)는 제2 전송 인터페이스(120)를 더 포함하고, 제2 전송 인터페이스(120) 내의 각 인터페이스와 제1 전송 인터페이스(14) 내의 각 인터페이스는 일일히 대응하여 연결되며, 하나의 기계 학습 유닛(15)과 하나의 판독/기록 처리회로(121)는 대응하여 연결된다.In one embodiment, referring to FIG. 63 , the transmission circuit 12 further includes a second transmission interface 120, each interface within the second transmission interface 120 and each interface within the first transmission interface 14. are connected correspondingly one by one, and one machine learning unit 15 and one read/write processing circuit 121 are connected correspondingly.

여기서, 제1 전송 인터페이스(14)는 제2 전송 인터페이스(120)를 통해 데이터 동작 신호를 대응하는 판독/기록 처리회로(121)로 송신한다. 전송 회로(12)는 제2 전송 인터페이스(120)를 통해 리턴되는 기계 학습 장치에 필요한 입력 뉴런 데이터 및 가중치를 제1 전송 인터페이스(14)로 리턴한 후 다시 제1 전송 인터페이스(14)를 통해 목표 기계 학습 유닛 또는 목표 연산 유닛으로 리턴할 수 있다. 제1 전송 인터페이스(14)는 하나의 인터페이스를 포함할 수 있으며 복수의 인터페이스를 포함할 수도 있다. 제2 전송 인터페이스(120)는 하나의 인터페이스를 포함할 수 있고, 복수의 인터페이스를 포함할 수도 있다. 예시적으로, 제1 전송 인터페이스(14)에 하나의 송신 인터페이스(141) 및 하나의 데이터 수신 인터페이스(142)를 포함할 경우, 제2 전송 인터페이스(120)는 하나의 송신 인터페이스(141) 및 하나의 리턴 인터페이스(142)에 대응하는 제2 수신 인터페이스(1201)및 제2 리턴 인터페이스(1202)를 포함한다.Here, the first transmission interface 14 transmits the data operation signal to the corresponding read/write processing circuit 121 via the second transmission interface 120 . The transmission circuit 12 returns the input neuron data and weights required for the machine learning device, which are returned through the second transmission interface 120, to the first transmission interface 14, and then returns the target through the first transmission interface 14 again. It can return to either the machine learning unit or the target computational unit. The first transmission interface 14 may include one interface or may include a plurality of interfaces. The second transmission interface 120 may include one interface or may include a plurality of interfaces. Exemplarily, when the first transmission interface 14 includes one transmission interface 141 and one data reception interface 142, the second transmission interface 120 includes one transmission interface 141 and one transmission interface 141. It includes a second receiving interface 1201 and a second return interface 1202 corresponding to the return interface 142 of .

대안적으로, 도 64를 참조하면, 하나의 기계 학습 유닛(15) 내의 복수의 연산 유닛(151)은 제1 전송 인터페이스(14) 내의 하나의 송신 인터페이스(141)를 공유하되, 각각의 연산 유닛은 하나의 데이터 수신 인터페이스(142)와 대응한다.Alternatively, referring to FIG. 64 , the plurality of computation units 151 in one machine learning unit 15 share one transmission interface 141 in the first transmission interface 14, but each computation unit corresponds to one data receiving interface 142.

여기서, 하나의 기계 학습 유닛(15)이 복수의 연산 유닛(151)을 포함할 경우, 복수의 연산 유닛(151)은 제1 전송 인터페이스(14)의 하나의 송신 인터페이스(141)를 공유할 수 있다. 하나의 기계 학습 유닛(15) 내의 복수의 연산 유닛(151)은 공유하는 하나의 송신 인터페이스(141)를 통해 데이터 동작 신호를 전송 회로(12)로 송신하고, 전송 회로(12)는 목표 연산 유닛(151)에 대응하는 데이터 수신 인터페이스(142)를 통해 획득한 입력 뉴런 데이터 및 가중치를 목표 연산 유닛으로 리턴한다.Here, when one machine learning unit 15 includes a plurality of calculation units 151, the plurality of calculation units 151 may share one transmission interface 141 of the first transmission interface 14. there is. The plurality of arithmetic units 151 in one machine learning unit 15 transmit data operation signals to the transmission circuit 12 through a common transmission interface 141, and the transmission circuit 12 transmits the target arithmetic unit The input neuron data and weights obtained through the data receiving interface 142 corresponding to (151) are returned to the target computing unit.

따라서, 본 실시예에 따른 데이터 처리 장치에서, 하나의 기계 학습 유닛 내의 복수의 연산 유닛은 상기 제1 전송 인터페이스 내의 하나의 송신 인터페이스를 공유하되, 각각의 연산 유닛은 하나의 데이터 수신 인터페이스와 대응한다. 따라서 기계 학습 유닛 내의 데이터 동작 신호 송신 인터페이스의 개수를 효율적으로 줄이고, 하드웨어의 자원을 절약하고, 하드웨어의 면적과 전력 소모를 낮출 수 있다.Therefore, in the data processing device according to the present embodiment, a plurality of operation units in one machine learning unit share one transmission interface in the first transmission interface, and each operation unit corresponds to one data reception interface. . Therefore, it is possible to efficiently reduce the number of data operation signal transmission interfaces in the machine learning unit, save hardware resources, and reduce hardware area and power consumption.

일 실시예에서, 도 65을 참조하면, 하나의 기계 학습 유닛(15) 내의 복수의 연산 유닛(151)은 각각 상기 제1 전송 인터페이스 내의 하나의 송신 인터페이스(141) 및 하나의 데이터 수신 인터페이스(142)와 대응한다.In one embodiment, referring to FIG. 65 , the plurality of calculation units 151 in one machine learning unit 15 include one transmission interface 141 and one data reception interface 142 in the first transmission interface, respectively. ) corresponds to

여기서, 도 65에 도시된 바와 같이, 하나의 연산 유닛(151)은 하나의 송신 인터페이스(141) 및 하나의 데이터 수신 인터페이스(142)와 각각 대응한다. 연산 유닛(151)은 그와 대응하는 송신 인터페이스(141)를 통해 데이터 동작 신호를 전송 회로(12)로 송신하고, 전송 회로(12)는 대응하는 데이터 수신 인터페이스(142)를 통해 획득한 입력 뉴런 데이터 및 가중치를 대응하는 목표 연산 유닛(151)으로 리턴한다. 예시적으로, 연산 유닛1은 송신 인터페이스1 및 데이터 수신 인터페이스1와 대응하고, 연산 유닛2은 송신 인터페이스2 및 데이터 수신 인터페이스2와 대응한다. 연산 유닛1은 송신 인터페이스1를 통해 데이터 동작 신호를 전송 회로(12)로 송신하고, 전송 회로(12)는 데이터 동작 신호에 의해 연산 유닛1 및 연산 유닛2을 목표 연산 유닛으로 확정한다. 이렇게 되면, 전송 회로는 데이터 수신 인터페이스1 및 데이터 수신 인터페이스2를 통해 획득한 입력 뉴런 데이터 및 가중치를 연산 유닛1 및 연산 유닛2으로 리턴한다.Here, as shown in FIG. 65, one arithmetic unit 151 corresponds to one transmission interface 141 and one data reception interface 142, respectively. The arithmetic unit 151 transmits the data operation signal to the transmission circuit 12 through the corresponding transmission interface 141, and the transmission circuit 12 through the corresponding data reception interface 142 acquires input neurons. Data and weights are returned to the corresponding target arithmetic unit 151. Exemplarily, arithmetic unit 1 corresponds to transmission interface 1 and data reception interface 1, and arithmetic unit 2 corresponds to transmission interface 2 and data reception interface 2. The calculation unit 1 transmits the data operation signal to the transmission circuit 12 through the transmission interface 1, and the transmission circuit 12 determines the calculation unit 1 and the calculation unit 2 as target calculation units according to the data operation signal. In this case, the transmission circuit returns the input neuron data and weights obtained through the data receiving interface 1 and the data receiving interface 2 to the calculating unit 1 and the calculating unit 2.

따라서, 본 실시예에 따른 데이터 처리 장치에서, 하나의 기계 학습 유닛 내의 복수의 연산 유닛은 각각 상기 제1 전송 인터페이스 내의 하나의 송신 인터페이스 및 하나의 데이터 수신 인터페이스와 대응한다. 복수의 연산 유닛은 제1 전송 인터페이스 내의 송신 인터페이스 및 데이터 수신 인터페이스와 일일히 대응되므로 데이터 전송 과정에서 제어 로직을 효과적으로 간소화시킬 수 있다.Therefore, in the data processing apparatus according to the present embodiment, a plurality of computing units in one machine learning unit respectively correspond to one transmission interface and one data reception interface in the first transmission interface. Since the plurality of operation units correspond to the transmission interface and the data reception interface in the first transmission interface, the control logic in the data transmission process can be effectively simplified.

일 실시예에서, 도 66을 참조하면, 복수의 기계 학습 유닛(15)은 제2 전송 인터페이스(120) 내의 하나의 신호 수신 인터페이스(81201)및 하나의 데이터 리턴 인터페이스(81202)를 공유한다.In one embodiment, referring to FIG. 66 , the plurality of machine learning units 15 share one signal reception interface 81201 and one data return interface 81202 in the second transmission interface 120 .

여기서, 복수의 기계 학습 유닛(15)은 제2 전송 인터페이스(120) 내의 하나의 신호 수신 인터페이스(81201) 및 하나의 데이터 리턴 인터페이스(81202)를 공유할 수 있다. 예시적으로, 판독/기록 처리회로(121)가 브로드캐스트 판독 처리회로인 경우, 복수의 기계 학습 유닛에서 송신된 데이터 동작 신호가 하나의 신호 수신 인터페이스(81201)를 통해 브로드캐스트 판독 처리회로로 송신되고, 브로드캐스트 판독 처리회로는 데이터 동작 신호에 따라 입력 뉴런 데이터 및 가중치를 획득하고, 데이터 동작 신호 내의 주소 정보에 의해 데이터 리턴 인터페이스(81202)를 통해 입력 뉴런 데이터 및 가중치를 목표 기계 학습 유닛으로 리턴한다.Here, the plurality of machine learning units 15 may share one signal reception interface 81201 and one data return interface 81202 in the second transmission interface 120 . Exemplarily, when the read/write processing circuit 121 is a broadcast read processing circuit, data operation signals transmitted from a plurality of machine learning units are transmitted to the broadcast read processing circuit through one signal receiving interface 81201. The broadcast read processing circuit acquires input neuron data and weights according to the data operation signal, and returns the input neuron data and weights to the target machine learning unit through the data return interface 81202 according to the address information in the data operation signal. do.

본 실시예에 따른 데이터 처리 장치에서, 복수의 기계 학습 유닛은 제2 전송 인터페이스 내의 하나의 신호 수신 인터페이스 및 하나의 데이터 리턴 인터페이스를 공유한다. 본 실시예에서, 데이터 처리 장치는 제2 전송 인터페이스 내의 하나의 신호 수신 인터페이스 및 하나의 데이터 리턴 인터페이스를 공유하므로 하드웨어의 오버헤드를 추가 줄여 하드웨어의 리던던시(redundancy)를 방지하였다.In the data processing device according to the present embodiment, the plurality of machine learning units share one signal reception interface and one data return interface in the second transmission interface. In this embodiment, since the data processing device shares one signal reception interface and one data return interface in the second transmission interface, hardware overhead is further reduced and hardware redundancy is prevented.

일 실시예에서, 도 66을 계속하여 참조하면, 판독/기록 처리회로(121)는 신호 큐를 더 포함한다. 신호 큐는 각각의 기계 학습 유닛(15)에서 송신된 데이터 동작 신호를 저장한다. 판독/기록 처리회로(121)는 데이터 동작 신호를 수신한 경우, 신호 큐에 잔여 공간이 있는지 여부를 판단하되, "예"인 경우, 데이터 동작 신호를 신호 큐로 캐싱하고, "아니요"인 경우, 데이터 동작 신호를 차단한다.In one embodiment, with continuing reference to FIG. 66, the read/write processing circuitry 121 further includes a signal queue. The signal queue stores data operation signals transmitted from each machine learning unit 15 . When the data operation signal is received, the read/write processing circuit 121 determines whether or not there is remaining space in the signal queue. In the case of "yes", the data operation signal is cached in the signal queue. Block the data operation signal.

여기서, 신호 큐는 각각의 기계 학습 유닛(15)에서 송신된 데이터 동작 신호를 저장할 수 있으며, 판독/기록 처리회로(121)의 외부에 설치될 수 있고, 판독/기록 처리회로(121) 내에 설치될 수도 있다. 판독/기록 처리회로(121)는 데이터 동작 신호를 수신한 경우, 신호 큐에 메모리 조회 명령을 발송하여 신호 큐의 저장 공간을 확보할 수 있다. 신호 큐의 저장 공간의 크기가 데이터 동작 신호를 저장할 수 있을 경우, 데이터 동작 신호를 신호 큐 내로 캐싱하고, 신호 큐의 저장 공간의 크기가 데이터 동작 신호를 저장할 수 없을 경우, 데이터 동작 신호를 차단한다.Here, the signal queue may store data operation signals transmitted from each machine learning unit 15, and may be installed outside the read/write processing circuit 121, or installed within the read/write processing circuit 121. It could be. When the data operation signal is received, the read/write processing circuit 121 may secure a storage space of the signal queue by sending a memory inquiry command to the signal queue. When the size of the storage space of the signal queue can store the data operation signal, the data operation signal is cached into the signal queue; when the size of the storage space of the signal queue cannot store the data operation signal, the data operation signal is blocked. .

본 실시예에 따른 데이터 처리 장치에서, 판독/기록 처리회로는 신호 큐를 더 포함한다. 신호 큐는 각각의 기계 학습 유닛에서 송신된 데이터 동작 신호를 저장한다. 판독/기록 처리회로는 데이터 동작 신호를 수신한 경우, 신호 큐에 잔여 공간이 있는지 여부를 판단하되, "예"인 경우, 데이터 동작 신호를 신호 큐에 캐싱하고, "아니요"인 경우 데이터 동작 신호를 차단한다. 본 실시예에서, 판독/기록 처리회로는 복수의 데이터 동작 신호를 수신한 경우, 데이터 동작 신호를 신호 큐에 캐싱하거나 혹은 데이터 동작 신호를 차단하여 데이터 동작 신호를 하나씩 중재 회로로 송신하여 처리되도록 하므로 데이터 동작 신호의 막힘 현상을 방지할 수 있다. 더 나아가서, 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 대해 데이터 동작을 하도록 하므로 하드웨어의 오버헤드를 줄이고 하드웨어의 리던던시(redundancy)를 방지하였다.In the data processing apparatus according to this embodiment, the read/write processing circuit further includes a signal queue. The signal queue stores data operation signals transmitted from each machine learning unit. When the read/write processing circuit receives the data operation signal, it is determined whether there is remaining space in the signal queue. In case of “yes”, the data operation signal is cached in the signal queue, and in case of “no”, the data operation signal block In this embodiment, when the read/write processing circuit receives a plurality of data operation signals, it caches the data operation signals in the signal queue or blocks the data operation signals to send the data operation signals to the arbitration circuit one by one for processing. Blocking of the data operation signal can be prevented. Furthermore, since a plurality of machine learning units perform data operations on a shared memory through one transmission circuit, hardware overhead is reduced and hardware redundancy is prevented.

대안적으로, 판독/기록 처리회로(121)가 브로드캐스트 처리회로인 경우, 신호 큐는 명령 큐 및 요청 큐를 포함한다. 명령 큐는 브로드캐스트 처리회로가 수신된 명령형 신호를 캐싱하고 요청 큐는 명령형 신호를 분석하여 얻은 요청형 신호를 캐싱한다.Alternatively, when the read/write processing circuit 121 is a broadcast processing circuit, the signal queue includes a command queue and a request queue. The command queue caches the command signal received by the broadcast processing circuit, and the request queue caches the request signal obtained by analyzing the command signal.

여기서, 판독/기록 처리회로(121)가 브로드캐스트 처리회로인 경우, 신호 큐는 명령 큐 및 요청 큐를 포함할 수 있다. 이는 각각의 기계 학습 유닛(15)에서 송신된 명령형 신호를 명령 큐에 저장하고, 브로드캐스트 처리회로를 통해 명령형 신호를 분석하여 요청형 신호를 획득한 후, 획득한 요청형 신호를 요청 큐에 저장한다. 여기서, 명령 큐는 브로드캐스트 처리회로가 수신한 명령형 신호를 캐싱한다. 요청 큐는 명령형 신호를 분석하여 얻은 요청형 신호를 캐싱한다.Here, when the read/write processing circuit 121 is a broadcast processing circuit, the signal queue may include a command queue and a request queue. This stores the command signal transmitted from each machine learning unit 15 in the command queue, analyzes the command signal through a broadcast processing circuit to obtain a request signal, and then stores the obtained request signal in the request queue. do. Here, the command queue caches command signals received by the broadcast processing circuit. The request queue caches request-type signals obtained by parsing command-type signals.

본 실시예에 따른 데이터 처리 장치에서, 판독/기록 처리회로가 브로드캐스트 처리회로인 경우, 신호 큐는 명령 큐 및 요청 큐를 포함하고, 명령 큐는 브로드캐스트 처리회로가 수신한 명령형 신호를 캐싱한다. 요청 큐는 명령형 신호를 분석하여 얻은 요청형 신호를 캐싱한다. 본 실시예에서, 명령형 신호 및 요청형 신호를 명령 큐 및 요청 큐에 각각 저장하므로 명령형 신호와 요청형 신호를 하나씩 중재 회로에 보내어 처리되도록 한다. 따라서 데이터 동작 신호의 막힘 현상을 추가 방지하여, 복수의 기계 학습 유닛이 하나의 전송 회로를 통해 공유 메모리에 대해 데이터 동작을 수행하도록 하여 하드웨어의 오버헤드를 줄이고 하드웨어의 리던던시(redundancy)를 방지하였다.In the data processing apparatus according to the present embodiment, when the read/write processing circuit is a broadcast processing circuit, the signal queue includes a command queue and a request queue, and the command queue caches command signals received by the broadcast processing circuit. . The request queue caches request-type signals obtained by parsing command-type signals. In this embodiment, command and request signals are stored in the command queue and request queue, respectively, so that the command and request signals are sent to the arbitration circuit one by one to be processed. Therefore, blocking of the data operation signal is further prevented, and a plurality of machine learning units perform data operations on the shared memory through one transmission circuit, thereby reducing hardware overhead and preventing hardware redundancy.

상기 각 실시예에서 동일 또는 유사한 부분은 서로 참조할 수 있으며, 일부 실시예에서 상세하게 설명되지 않은 내용은 다른 실시예의 동일 또는 유사한 내용을 참조할 수 있다.The same or similar parts in each of the above embodiments may refer to each other, and details not described in detail in some embodiments may refer to the same or similar contents in other embodiments.

유의할 것은, 본 출원을 설명함에 있어, "제1", "제2"등과 같은 용어들은 단지 목적을 설명하기 위해 사용된 것일 뿐이고, 상대적 중요도를 나태내거나 암시하는 것으로 이해되어서는 안된다. 이 외에, 본 출원을 설명함에 있어, 별도의 설명이 없는 한, "복수의"의 의미는 2개 또는 2개 이상을 의미한다.It should be noted that, in describing this application, terms such as "first", "second", etc. are only used to describe the purpose, and should not be construed as indicating or implying relative importance. In addition, in describing the present application, unless otherwise specified, the meaning of "plurality" means two or two or more.

흐름도 또는 여기에 기타 방식으로 설명한 과정 또는 방법은, 하나 또는 복수의 특정 로직 기능 또는 과정의 단계를 구현하기 위한 이행 가능한 지령의 코드의 모듈, 세그먼트 또는 부분을 포함하는 것으로 이해해야 한다. 아울러, 본 출원의 바람직한 실시 형태의 범위는 별도의 구현을 포함하되, 제시 또는 토론 순서를 따르지 않을 수 있으며, 관련있는 기능에 의해 거의 동시의 방식 또는 반대되는 순서로 기능을 수행하는 것을 포함한다. 이는 본 출원의 실시예가 속하는 기술분야의 통상의 지식을 가진 자들에 의하여 이해될 것이다. A process or method described in a flowchart or otherwise herein is understood to include a module, segment or portion of code with executable instructions for implementing one or more specific logic functions or steps of the process. In addition, the scope of the preferred embodiments of this application includes separate implementations that may not follow the order of presentation or discussion, and that perform the functions in a substantially simultaneous fashion or in reverse order by related functions. This will be understood by those of ordinary skill in the art to which the embodiments of the present application pertain.

본 출원의 각 부분은 하드웨어, 소프트웨어, 펌웨어 또는 그들의 조합으로 구현될 수 있다는 것으로 이해될 것이다. 상술한 실시 형태에서, 복수의 단계 또는 방법은 메모리에 저장되고 또한 적합한 명령 수행 시스템으로 수행되는 소프트웨어 또는 펌웨어로 구현할 수 있다. 예를 들어, 하드웨어로 구현하게 된 경우, 다른 한 실시 형태와 동일하게, 본 분야에서의 데이터 신호의 로직 기능을 구현하기 위한 로직 게이트 회로를 구비한 이산 로직 회로, 적합한 조합 로직 게이트 회로를 구비한 전용 집적 회로, 프로그램 가능 게이트 어레이(PGA), 필드 프로그램 가능 게이트 어레이(FPGA) 등 공지된 기술 중의 어느 하나 또는 그들의 조합으로 구현할 수 있다.It will be understood that each part of this application may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, a plurality of steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable command execution system. For example, in the case of hardware implementation, a discrete logic circuit having a logic gate circuit for implementing a logic function of a data signal in the field, and a suitable combinational logic gate circuit, as in the other embodiment, It can be implemented with any one of known technologies, such as a dedicated integrated circuit, a programmable gate array (PGA), a field programmable gate array (FPGA), or a combination thereof.

본 출원이 속하는 기술분야의 통상의 지식을 가진 자들은 상술한 실시예의 방법이 지니는 전부 또는 일부 단계의 구현은 프로그램으로 관련 하드웨어를 명령하여 완성할 수 있고, 상기 프로그램은 컴퓨터 판독가능 저장매체에 저장될 수 있으며, 상기 프로그램이 실행될 때 방법 실시예의 단계 중의 하나 또는 그 조합이 포함됨을 이해할 것이다.Those of ordinary skill in the art to which this application pertains can implement all or part of the steps of the methods of the above-described embodiments by instructing related hardware with a program, and the program is stored in a computer-readable storage medium. It will be understood that one or a combination of steps of the method embodiments are included when the above program is executed.

이 외에, 본 출원의 각 실시예 중의 각 기능 유닛은 하나의 처리 모듈에 집적될 수 있고, 각 유닛이 단독적 물리 존재로 될 수도 있으며, 둘 또는 적어도 두개 이상의 유닛이 한 모듈에 집적될 수도 있다. 상술한 집적된 모듈은 하드웨어의 형식을 이용하여 구현될 수 있고, 소프트웨어 기능 모듈의 형식을 이용하여 구현될 수도 있다. 상기 집적된 모듈은 소프트웨어 기능 모듈의 형식으로 구현되고, 또한 독립된 제품으로 판매 또는 사용될 경우 컴퓨터 판독가능 저장매체에 저장될 수도 있다.In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may be an independent physical entity, or two or at least two or more units may be integrated into one module. The above-described integrated module may be implemented using the form of hardware or may be implemented using the form of a software function module. The integrated module is implemented in the form of a software function module, and may also be stored in a computer readable storage medium when sold or used as an independent product.

상술한 저장 매체는 읽기 전용 저장 장치, 디스크 또는 CD등 일 수 있다.The storage medium described above may be a read-only storage device, a disk, or a CD.

본 명세서의 설명에서, 참조 용어 "일 실시예", "일부 실시예", "예시", "구체적 예시" 또는 "일부 예시" 등의 설명은 당해 실시예 또는 예시를 결합하여 설명하는 구체적인 특징, 구조, 재료 또는 특징이 본 출원의 적어도 하나의 실시예 또는 예시에 포함된다는 것을 의미한다. 본 명세서에서 상술한 용어에 대한 함축적인 표현이 반드시 동일한 실시예 또는 예시를 가리키는 것은 아니다. 그리고, 설명된 구체적 특징, 구조, 재료 또는 특징은 임의의 하나 또는 복수의 실시예 또는 예시에서 적합한 방식으로 결합될 수 있다.In the description of this specification, descriptions such as reference terms "one embodiment", "some embodiments", "example", "specific examples" or "some examples" refer to specific features described in combination with the embodiment or example, means that a structure, material or feature is included in at least one embodiment or example of the present application. Connotational expressions of terms described above in this specification do not necessarily refer to the same embodiment or example. And, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or multiple embodiments or examples.

비록 이미 본 출원의 실시예를 제시하고 설명하였으나, 상기 실시예는 예시적인 것이며, 본 출원을 제한하는 것으로 이해되어서는 안된다. 본 출원이 속하는 기술 분야의 통상의 지식을 가진자는 본 출원의 범위 내에서 상기 실시예를 변화, 수정, 대체 및 변형할 수 있는 것으로 이해할 것이다.Although embodiments of the present application have already been presented and described, the above embodiments are illustrative and should not be construed as limiting the present application. Those skilled in the art to which this application pertains will understand that the above embodiments can be changed, modified, substituted and modified within the scope of this application.

Claims

A data processing method performed in a data processing apparatus including a processor and a memory in which a computer program is stored,
Receiving a data operation signal transmitted from an internal or external device and including an operation domain and an operation code;
determining whether the data operation signal is an I/O command according to a first type flag bit included in the operation code;
if the data operation signal is an I/O command, determining whether the data operation signal is a broadcast or multicast command among the I/O commands according to a second type flag bit included in the operation domain;
obtaining necessary input data from the memory by performing a corresponding operation on operation schedule data in a memory according to the determined type of the data operation signal; and
determining a device or processing circuit that receives the input data according to the broadcast or multicast command based on a data reception flag bit included in the operation domain;
In the data reception flag bit, a bit value is assigned to each machine learning unit capable of interacting with the memory, and indicates whether or not the machine learning unit can receive data according to the assigned bit value.

delete

The method of claim 1,
The data processing method according to claim 1 , wherein the number of data reception flag bits indicates the number of devices or processing circuits capable of interacting with the memory.

The method of claim 1,
The operation domain further includes information on operation schedule data, wherein the operation schedule data includes a source address in the memory, an operation schedule data length, and a data return address after a data operation, wherein the operation schedule data includes: Obtaining necessary input data by performing a corresponding operation on operation scheduled data in the memory according to the data operation signal,
starting memory reading from the source address to obtain input data that satisfies the length of the data; and
and returning the input data to a storage space corresponding to the data return address in the determined device or processing circuit according to the data return address.

The method of claim 4,
The data processing method according to claim 1 , wherein the determined device includes at least one machine learning unit, and each machine learning unit includes a master processing circuit and a plurality of slave processing circuits.

The method of claim 5,
The operation domain further includes a jump sub operation domain, the jump sub operation domain includes a jump width and a jump data length operated after each jump, starting reading the memory from the source address to satisfy the data length. The step of obtaining the input data to
reading the memory from the source address and obtaining first jump data by a jump data length after a current jump;
obtaining a last address of the jump data and jumping from the last address to a target jump address by the jump width;
and acquiring second jump data according to a length of jump data after jumping from the target jump address until the length of jump data obtained after each jump satisfies the data length.

The method of claim 6,
The jump sub operation domain includes a stride operation domain and/or a segment operation domain, the stride operation domain represents a jump width for each jump of the data operation signal, and the segment operation domain is set in advance for each division of the data operation signal. Data processing method characterized by indicating the size.

The method of claim 7,
The data processing method of claim 1 , wherein the operation domain further includes a function flag bit indicating a processing operation to be performed on the read data.

The method of claim 8,
In the step of determining whether the data operation signal is an I/O command, if the value of the first type flag bit is I/O, the data operation signal is determined as an I/O command;
In the step of determining whether the I/O command is a broadcast or multicast command, if the value of the second type flag bit is 1, the data operation signal is converted into a broadcast or multicast command among the I/O commands. Data processing method to determine.

The method of claim 9,
Receiving the data operation signal transmitted from the internal or external device,
analyzing the data operation signal to obtain information of a type flag bit of the data operation signal and operation schedule data;
The data processing method further comprising the step of executing the analyzed data operation signal by a command queue indicating an execution sequence of the data operation signal.

The method of claim 10,
Before performing the analyzed data operation signal by the command queue, the method comprises:
determining a dependency relationship between the analyzed data operation signals adjacent to each other and obtaining a determination result; Indicates whether there is an association-; and
If the determination result is that a dependency relationship exists between the s-th data operation signal and the s-1-th data operation signal, the s-th data operation signal is cached, and the s-1-th data operation signal and extracting the s-th data operation signal after performing the data processing method.

The method of claim 11,
The step of determining the dependency relationship of the analyzed data operation signals adjacent to each other,
a first storage address section for extracting data necessary for the s-th data operation signal in response to the s-th data operation signal; and obtaining, by the s-1 th data operation signal, a 0 th storage address section for extracting data necessary for the s-1 th data operation signal.
determining that the s-th data operation signal and the s-1-th data operation signal have a dependency relationship if an area overlapping the first storage address period and the 0th storage address period exists; and
Determining that the s-th data operation signal and the s-1-th data operation signal do not have a dependency relationship when there is no overlapping area between the first storage address period and the 0th storage address period. Data processing method characterized in that.

A data processing device comprising a processor and a memory in which a computer program is stored,
A data processing device characterized in that the processor implements the steps of the method according to any one of claims 1, 3 to 12 when executing the computer program.

A board card comprising the data processing device according to claim 13.

An electronic device comprising the board card according to claim 14.