CN111079908B

CN111079908B - Network-on-chip data processing method, storage medium, computer device and apparatus

Info

Publication number: CN111079908B
Application number: CN201811216857.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2024-02-13
Anticipated expiration: 2038-10-18
Also published as: CN111079908A; KR102539571B1; KR20200138411A

Abstract

The present application relates to a network-on-chip data processing method, the method is applied to a network-on-chip processing system, the network-on-chip processing system is used for executing machine learning calculation, and the network-on-chip processing system comprises: a storage device and a computing device; the method comprises the following steps: accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to obtain first operation data; calculating the first operation data through the first calculation device to obtain a first operation result; the first operation result is sent to a second computing device in the network-on-chip processing system. The method can reduce operation cost and improve data reading and writing efficiency.

Description

Network-on-chip data processing method, storage medium, computer device and apparatus

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a network-on-chip data processing method, a storage medium, a computer device, and an apparatus.

Background

With the development of semiconductor process technology, the integration of billions of transistors in a single chip has become a reality. Network on Chip (NoC) is capable of integrating a large amount of computing resources on a single Chip and enabling on-Chip communications.

Since a large number of calculations are required in the neural network, some of the calculations need to be processed in parallel, such as forward operation, backward operation, weight update, and the like. In the chip architecture with a plurality of transistors, the chip design is faced with the problems of high memory access cost, high bandwidth blocking, low data reading and writing efficiency and the like.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a network-on-chip data processing method, a storage medium, a computer device, and an apparatus that can reduce operation overhead and improve data read/write efficiency.

In a first aspect, a method for processing network-on-chip data is provided, where the method is applied to a network-on-chip processing system, and the network-on-chip processing system is used for performing machine learning computation, and the network-on-chip processing system includes: a storage device and a computing device; the method comprises the following steps:

accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to obtain first operation data;

calculating the first operation data through the first calculation device to obtain a first operation result;

the first operation result is sent to a second computing device in the network-on-chip processing system.

In a second aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps mentioned in the above method.

In a third aspect, an embodiment of the present application provides a network-on-chip data processing system, including a memory, a multi-core processor, and a computer program stored on the memory and capable of running on the multi-core processor, where the steps mentioned in the above method are implemented when the multi-core processor executes the computer program.

In a fourth aspect, embodiments of the present application provide a network-on-chip data processing apparatus for performing machine learning calculations, including:

the acquisition module is used for accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to acquire first operation data;

the operation module is used for carrying out operation on the first operation data through the first computing device to obtain a first operation result;

and the sending module is used for sending the first operation result to a second computing device in the network-on-chip processing system.

According to the network-on-chip data processing method, the storage medium, the computer equipment and the device, the connection is established among the plurality of computing devices arranged on the same piece, so that the data transmission among the plurality of computing devices can be realized, in addition, in the computing process, the input data and the generated intermediate computing result are subjected to time sharing and multiplexing, thereby reducing the energy consumption expenditure in the memory access process, reducing the storage bandwidth blocking and simultaneously improving the data reading and writing efficiency.

Drawings

FIG. 1 is a schematic diagram of a network-on-chip processing system 1100 in one embodiment;

FIG. 2 is a schematic diagram of a network-on-chip processing system 1200 in one embodiment;

FIG. 3 is a schematic diagram of a network-on-chip processing system 1300 in one embodiment;

FIG. 4 is a schematic diagram of a network-on-chip processing system 1400 in one embodiment;

FIG. 5a is a schematic diagram of a network-on-chip processing system 1500 in one embodiment;

FIG. 5b is a schematic diagram of a network-on-chip processing system 15000 in one embodiment;

FIG. 6 is a schematic diagram of a network-on-chip processing system 1600 in one embodiment;

FIG. 7 is a schematic diagram of a network-on-chip processing system 1700 in one embodiment;

FIG. 8 is a schematic diagram of a network-on-chip processing system 1800 in one embodiment;

FIG. 9 is a schematic diagram of a network-on-chip processing system 1900 in one embodiment;

FIG. 10a is a schematic diagram of a network-on-chip processing system 1910 according to one embodiment;

FIG. 10b is a schematic diagram of a network-on-chip processing system 19100 in one embodiment;

FIG. 11 is a schematic diagram of a network-on-chip processing system 1920 in one embodiment;

FIG. 12 is a schematic diagram of a network-on-chip processing system 1930 in one embodiment;

FIG. 13 is a schematic diagram of a computing device in one embodiment;

FIG. 14 is a schematic view of a computing device in another embodiment;

FIG. 15 is a schematic diagram of the main processing circuitry in one embodiment;

FIG. 16 is a schematic view of a computing device in another embodiment;

FIG. 17 is a schematic diagram of a computing device in another embodiment;

FIG. 18 is a schematic diagram of a tree module in one embodiment;

FIG. 19 is a schematic view showing the structure of a computing device according to another embodiment;

FIG. 20 is a schematic diagram of a computing device in another embodiment;

FIG. 21 is a schematic diagram of a computing device in another embodiment;

FIG. 22 is a schematic diagram of a combined processing apparatus in one embodiment;

FIG. 23 is a schematic diagram showing a structure of a combined processing apparatus according to another embodiment;

FIG. 24 is a schematic diagram of a card structure in one embodiment;

FIG. 25 is a flow chart of a method of processing network-on-chip data in one embodiment;

FIG. 26 is a flowchart of a method for processing network-on-chip data according to another embodiment;

FIG. 27 is a flowchart of a method for processing network-on-chip data according to another embodiment;

FIG. 28 is a flowchart of a method for processing network-on-chip data according to another embodiment;

FIG. 29 is a flowchart of a method for processing network-on-chip data according to another embodiment;

fig. 30 is a flowchart of a network-on-chip data processing method according to another embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing devices are arranged on the same piece, wherein at least one computing device is connected with the storage device, and at least two computing devices are connected with each other.

Network on Chip (NoC) refers to a communication Network on Chip that integrates a large number of computing resources and connects these resources on a single Chip, and modules in each Chip access the Network through a standardized Network Interface (NIC) and use shared Network resources and destination modules to communicate. In particular, the storage device and the plurality of computing devices being disposed on the same chip means that the storage device and the plurality of computing devices are integrated on the same chip. The processor cores and off-chip storage are connected by nocs that also support communication between the processor cores.

The network-on-chip processing systems in the embodiments of the present application all implement on-chip communications based on nocs. In addition, the network-on-chip processing system in the embodiment of the application can perform on-chip storage and off-chip storage, that is, the operation data in the processing process of the neural network processor can be stored in a storage device in-chip or an off-chip storage device; because the on-chip network processing system has limited on-chip memory capacity, the operation data and intermediate results generated during the operation can be temporarily stored in the off-chip storage device and read from the off-chip storage to the NoC when necessary. In the embodiment of the application, the storage devices in the network-on-chip processing system are all referred to as on-chip storage devices; a computing device in a network-on-chip processing system includes a neural network processor.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing devices comprise a first computing device and a plurality of second computing devices, and the storage device and the plurality of computing devices are arranged on the same piece, wherein the first computing device is connected with the storage device, and at least one second computing device in the plurality of second computing devices is connected with the first computing device.

In one embodiment, a neural network chip is provided, the chip comprising: the device comprises a storage device, a plurality of computing devices, a first interconnection device and a second interconnection device, wherein at least one computing device is connected with the storage device through the first interconnection device, and the plurality of computing devices are connected through the second interconnection device. Furthermore, the computing device can realize the read-write operation of the storage device through the first interconnection device, and the data transmission can be performed among the computing devices through the second interconnection device.

As shown in fig. 1, for one embodiment, a network-on-chip processing system 1100 is provided, where the network-on-chip processing system 1100 includes: the storage 1101, the first computing device 1102, the second computing device 1103 and the second computing device 1104, and the storage 1101, the first computing device 1102, the second computing device 1103 and the second computing device 1104 are disposed on the same chip of the network-on-chip processing system 1100, wherein the first computing device 1102 is connected with the storage 1101, the second computing device 1103 is connected with the first computing device 1102, and the second computing device 1103 is also connected with the second computing device 1104. Only the first computing device 1102 has access to the storage device 1101, that is, only the first computing device 1102 is capable of reading and writing data from the storage device 1101, and the first computing device 1102, the second computing device 1103 and the second computing device 1104 are capable of mutually transmitting data.

Specifically, when the second computing device 1104 needs to read data, the first computing device 1102 accesses the storage device 1101, reads the data required by the second computing device 1104 from the storage device 1101, sends the data to the second computing device 1103 by the first computing device 1102, and sends the data to the second computing device 1104 by the second computing device 1103. Alternatively, the first computing device 1102, the second computing device 1103 and the second computing device 1104 may be connected to the storage device 1101, so long as at least one of the first computing device 1102, the second computing device 1103 and the second computing device 1104 is guaranteed to be connected to the storage device 1101, which is not specifically limited herein. Alternatively, the second computing device 1103 may be connected to the second computing device 1104 or may be connected to the first computing device 1102, so long as at least two computing devices of the first computing device 1102, the second computing device 1103, and the second computing device 1104 are guaranteed to be connected to each other, which is not specifically limited herein.

As shown in fig. 2, for one embodiment, a network-on-chip processing system 1200 is provided, where the network-on-chip processing system 1200 includes: the storage device 1201, the first computing device 1202, the second computing device 1203 and the second computing device 1204 are disposed on the same chip of the network-on-chip processing system 1200, wherein the first computing device 1202 is connected with the storage device 1201, and the second computing device 1203 and the second computing device 1204 are directly connected with the first computing device 1202, i.e., the second computing device 1204 is connected with both the second computing device 1203 and the first computing device 1201 without establishing a connection with the first computing device 1201 through the second computing device 1203. Only the first computing device 1202 has access to the storage device 1201, that is, only the first computing device 1202 can read and write data from and to the storage device 1201, and the first computing device 1202, the second computing device 1203 and the second computing device 1204 can mutually transmit data.

Specifically, when the second computing device 1204 needs to read data, the first computing device 1202 accesses the storage device 1201, reads the data required by the second computing device 1204 from the storage device 1201, and the first computing device 1202 directly transmits the data to the second computing device 1204 without forwarding through the second computing device 1203. Alternatively, the first computing device 1202, the second computing device 1203 and the second computing device 1204 may be connected to the storage device 1201, so long as at least one of the first computing device 1202, the second computing device 1203 and the second computing device 1204 is guaranteed to be connected to the storage device 1201, which is not specifically limited herein. Alternatively, the second computing device 1203 may be connected to the second computing device 1204 or may be connected to the first computing device 1202, so long as at least two computing devices of the first computing device 1202, the second computing device 1203, and the second computing device 1204 are guaranteed to be connected to each other, which is not particularly limited herein.

In the network-on-chip processing system, the connection is established among the plurality of computing devices arranged on the same chip, so that data transmission can be performed among the plurality of computing devices, the problem that the plurality of computing devices read data from the storage device to cause overlarge connection bandwidth cost is avoided, and meanwhile, the data reading and writing efficiency is improved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: a storage device and a plurality of computing devices, which are arranged on the same piece, wherein each computing device in the plurality of computing devices is connected with the storage device, and at least two computing devices are connected with each other.

As shown in fig. 3, for one embodiment, a network-on-chip processing system 1300 is provided, where the network-on-chip processing system 1300 includes: the storage device 1301, the computing device 1302, the computing device 1303 and the computing device 1304 are provided on the same piece of the network-on-chip processing system 1300, wherein the computing device 1302, the computing device 1303 and the computing device 1304 are connected with the storage device 1301, the computing device 1302 is connected with the computing device 1303, and the computing device 1303 is connected with the computing device 1304. The storage device 1201 is accessible to each of the computing devices 1202, 1203 and 1304, and the data transmission between the computing device 1302 and 1303 is possible, while the data transmission between the computing device 1303 and 1304 is possible.

In particular, when the computing device 1304 needs to read data, the computing device 1304 may directly access the storage device 1301; storage device 1301 may also be accessed by computing device 1303, data required by computing device 1304 may be read from storage device 1301, and the data may be sent by computing device 1303 to computing device 1304; storage device 1301 may also be accessed by computing device 1302, data needed by computing device 1304 read from storage device 1301, sent by computing device 1302 to computing device 1303, and sent by computing device 1303 to computing device 1304. Alternatively, the computing device 1302, the computing device 1303, and the computing device 1304 may be any computing device that is guaranteed to be connected to the storage device 1301, and is not specifically limited herein. Alternatively, the computing device 1302, the computing device 1303, and the computing device 1304 may be any computing device that ensures that at least two computing devices are connected to each other, which is not specifically limited herein.

In the network-on-chip processing system, the connection is established among the plurality of computing devices arranged on the same chip, so that data required by any computing device can be transmitted among the plurality of computing devices.

As shown in fig. 4, for one embodiment, a network-on-chip processing system 1400 is provided, where the network-on-chip processing system 1400 includes: the storage device 1401, the computing device 1402, the computing device 1403 and the computing device 1404, the storage device 1401, the computing device 1402, the computing device 1403 and the computing device 1404 are arranged on the same piece of the network-on-chip processing system 1400, wherein the computing device 1402, the computing device 1403 and the computing device 1404 are connected with the storage device 1401, and the three computing devices 1402, the computing device 1403 and the computing device 1404 are connected with each other. The storage device 1401 is accessible to each of the computing device 1402, the computing device 1403 and the computing device 1404, and data transmission can be performed between the three computing devices 1402, 1403 and 1404.

In particular, when the computing device 1404 needs to read data, the storage device 1401 can be accessed directly; the storage device 1401 may also be accessed by the computing device 1403, the data required by the computing device 1404 may be read from the storage device 1401, and the data may be sent by the computing device 1403 to the computing device 1404; the storage device 1401 may also be accessed by the computing device 1402, data required by the computing device 1404 may be read from the storage device 1401, and the data may be sent directly by the computing device 1402 to the computing device 1404 without being forwarded via the computing device 1403. Alternatively, as long as at least one of the computing devices 1402, 1403, and 1404 is ensured to be connected to the storage device 1401, it is not particularly limited herein. Alternatively, the computing device 1402, the computing device 1403, and the computing device 1404 are not particularly limited herein, as long as at least two computing devices are guaranteed to be connected to each other.

In the network-on-chip processing system, the direct connection is established between the plurality of computing devices arranged on the same chip, so that the data reading and writing efficiency can be improved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same piece, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and at least two computing device groups are connected with each other.

In one embodiment, a neural network chip is provided, the chip comprising: the device comprises a storage device, a plurality of computing device groups, a first interconnection device and a second interconnection device, wherein at least one computing device group in the plurality of computing device groups is connected with the storage device through the first interconnection device, and the plurality of computing device groups are connected through the second interconnection device. Furthermore, the computing device group can realize the read-write operation of the storage device through the first interconnection device, and the data transmission among the computing device groups can also be realized through the second interconnection device.

As shown in fig. 5a, a network-on-chip processing system 1500 is provided for one embodiment, where the network-on-chip processing system 1500 includes: the storage 1501 and six computing devices (computing devices 1502 to 1507), the storage 1501 and six computing devices (computing devices 1502 to 1507) are disposed on the same board of the network-on-chip processing system 1500, the six computing devices are divided into three groups, the computing devices 1502 and 1503 being a first computing device group (cluster 1), the computing devices 1504 and 1505 being a second computing device group (cluster 2), the computing devices 1506 and 1507 being a third computing device group (cluster 3), the cluster1 being a main computing device group, and the clusters 2 and 3 being sub-computing device groups. Of these, only cluster1 is connected to the storage device 1501, and clusters 1, 2, and 3 are connected to each other. The computing device 1502 in cluster1 is connected to the storage device 1501, the computing device 1503 in cluster1 is interconnected to the computing device 1504 in cluster2, and the computing device 1505 in cluster2 is interconnected to the computing device 1507 in cluster3.

Specifically, when the cluster3 needs to read data, the storage 1501 may be accessed by the cluster1, the data required by the cluster3 may be read from the storage 1501, the data may be sent to the cluster2 by the cluster1, and the data may be sent to the cluster3 by the cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably a group includes four computing devices.

Alternatively, it is not required that all of the plurality of computing devices be connected to the storage device 1501, so long as at least one of the two computing device groups is connected to the storage device 1501, and the present invention is not limited thereto. Alternatively, cluster1 may be interconnected with cluster2 or with cluster3, so long as at least two of the three computing device groups are guaranteed to be interconnected, and no specific limitation is placed herein. Optionally, each of the computing device groups includes at least one computing device connected to at least one computing device in the other computing device groups, that is, each computing device of the cluster1 may be connected to the second device group, which is only required to ensure that at least one computing device in the cluster1 is connected to at least one computing device in the cluster2, which is not specifically limited herein. Optionally, the plurality of computing device groups are connected to each other through any one computing device in the plurality of computing device groups, that is, any one computing device in the cluster1 may be connected to any one computing device in the cluster2, which is not specifically limited herein.

As shown in fig. 5b, a network-on-chip processing system 15000 is provided for one embodiment, where the network-on-chip processing system 15000 includes: the storage 15010 and six computing devices (computing devices 15020 to 15070), the storage 15010 and six computing devices (computing devices 15020 to 15070) are disposed on the same piece of the network-on-chip processing system 15000, the six computing devices are divided into three groups, the computing devices 15020 and 15030 being a first computing device group (cluster 1), the computing devices 15040 and 15050 being a second computing device group (cluster 2), the computing devices 15060 and 15070 being a third computing device group (cluster 3), the cluster1 being a main computing device group, and the clusters 2 and 3 being sub-computing device groups. Of these, only cluster1 is connected to the storage device 15010, and clusters 1, 2, and 3 are connected to each other. The computing devices 15020 in cluster1 are connected to the storage device 15010, the computing devices 15030 in cluster1 are interconnected to the computing devices 15040 in cluster2, the computing devices 15050 in cluster2 are interconnected to the computing devices 15070 in cluster3, and the computing devices 15060 in cluster3 are interconnected to the computing devices 15020 in cluster 1.

Specifically, when the cluster3 needs to read data, the storage 1501 may be accessed by the cluster1, the data required by the cluster3 is read from the storage 1501, and the data is directly transmitted to the cluster3 by the cluster 1. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably a group includes four computing devices.

Alternatively, all of the plurality of computing devices are not required to be connected to the storage device 15010, so long as at least one of the two computing device groups is connected to the storage device 15010, and no specific limitation is made herein. Alternatively, cluster1 may be interconnected with cluster2 or with cluster3, so long as at least two of the three computing device groups are guaranteed to be interconnected, and no specific limitation is placed herein. Optionally, each of the computing device groups includes at least one computing device connected to at least one computing device in the other computing device groups, that is, each computing device of the cluster1 may be connected to the second device group, which is only required to ensure that at least one computing device in the cluster1 is connected to at least one computing device in the cluster2, which is not specifically limited herein. Optionally, the plurality of computing device groups are connected to each other through any computing device in the plurality of computing device groups, that is, any computing device in the cluster1 may be connected to any computing device in the cluster2, which is not specifically limited herein.

In the network-on-chip processing system, the plurality of computing device groups arranged on the same piece can realize inter-group communication by establishing connection among the plurality of computing device groups, and the system can reduce the computing devices for simultaneously reading the interfaces of the storage devices through inter-group data transmission and reduce the energy consumption cost of memory access; meanwhile, a plurality of computing device groups arranged on the same piece are connected in a plurality of connection modes to establish inter-group communication, and a plurality of communication channels are established among the plurality of computing devices, so that an optimal channel is selected for data transmission according to the current network congestion condition, and the effects of saving energy consumption and improving data processing efficiency are achieved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same piece, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and the plurality of computing device groups are connected with each other.

As shown in fig. 6, a network-on-chip processing system 1600 provided for one embodiment, the network-on-chip processing system 1600 includes: the storage device 1601 and six computing devices (computing devices 1602 to 1607) are disposed on the same chip of the on-chip network processing system 1600, the six computing devices are divided into three groups, the computing devices 1602 and 1603 are a first computing device group cluster1, the computing devices 1604 and 1605 are a second computing device group cluster2, and the computing devices 1606 and 1607 are a third computing device group cluster3, wherein the cluster1, cluster2, and cluster3 are all connected to the storage device 1601, and the clusters 1 and 2 are interconnected, and the clusters 2 and 3 are interconnected. Computing devices 1602 through 1607 are each connected to storage device 1601, computing device 1603 in cluster1 is interconnected to computing device 1604 in cluster2, and computing device 1604 in cluster2 is interconnected to computing device 1607 in cluster 3.

Specifically, when the cluster3 needs to read data, the storage 1601 may be accessed by the cluster2, the data required by the cluster3 may be read from the storage 1601, and sent to the cluster3 by the cluster 2; the storage 1601 may be accessed by the cluster1, the data required by the cluster3 may be read from the storage 1601, the data may be sent to the cluster2 by the cluster1, and the data may be sent to the cluster3 by the cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably a group includes four computing devices.

Alternatively, it is not required that all of the plurality of computing devices be connected to the storage device 1601, so long as at least one of the two computing device groups is connected to the storage device 1601, and no particular limitation is made herein. Alternatively, each computing device of cluster1 may establish a connection with the second unit group and/or cluster3, so long as at least one computing device of cluster1 is guaranteed to connect with at least one computing device of cluster2 and/or cluster3, which is not specifically limited herein. Alternatively, any one of the computing devices of cluster1 may be interconnected with any one of the computing devices of cluster2 and/or cluster3, which are not specifically limited herein.

In the network-on-chip processing system, the connection is established among the plurality of computing device groups arranged on the same piece, so that data required by any computing device group can be transmitted among the plurality of computing device groups.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same piece, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and any two computing device groups in the plurality of computing device groups are directly connected.

As shown in fig. 7, for one embodiment, a network-on-chip processing system 1700 is provided, where the network-on-chip processing system 1700 includes: the storage 1701 and six computing devices (computing devices 1702 to 1707), the storage 1701 and six computing devices (computing devices 1702 to 1707) are disposed on the same chip of the network-on-chip processing system 1700, the six computing devices are divided into three groups, the computing devices 1702 and 1703 are a first computing device group cluster1, the computing devices 1704 and 1705 are a second computing device group cluster2, and the computing devices 1706 and 1707 are a third computing device group cluster3, wherein the cluster1, cluster2, and cluster3 are all connected to the storage 1701, and the three computing device groups cluster1, cluster2, and cluster3 are connected to each other. The computing devices 1702 to 1707 are each connected to the storage device 1701, the computing device 1703 in the cluster1 is connected to the computing device 1704 in the cluster2, the computing device 1704 in the cluster2 is connected to the computing device 1707 in the cluster3, and the computing device 1702 in the cluster1 is connected to the computing device 1706 in the cluster 3.

Specifically, when the cluster3 needs to read data, the cluster2 may access the storage 1701, read the data required by the cluster3 from the storage 1701, and send the data to the cluster3; the storage 1701 may be accessed by the cluster1, and data required by the cluster3 may be read from the storage 1701, and the data may be directly transmitted to the cluster3 by the cluster 1. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices per group is not particularly limited, and preferably a group includes four computing devices.

Alternatively, it is not required that all of the plurality of computing devices be connected to the storage device 1701, so long as at least one of the two computing device groups is connected to the storage device 1701, and no specific limitation is made herein. Alternatively, each computing device of the cluster1 may establish a connection with the second unit group and the cluster3, which is only required to ensure that at least one computing device of the cluster1 is connected with at least one computing device of the cluster2 and the cluster3, which is not specifically limited herein. Alternatively, any one of the computing devices in cluster1 may be interconnected with any one of the computing devices in cluster2 and cluster3, which are not specifically limited herein.

In the network-on-chip processing system, the direct connection is established between the plurality of computing device groups arranged on the same chip, so that the data reading and writing efficiency can be improved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same piece, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, at least two computing device groups are connected with each other, and a plurality of computing devices in each computing device group are connected with each other.

As shown in fig. 8, a network-on-chip processing system 1800 is provided for one embodiment, where the network-on-chip processing system 1800 includes: the storage device 1801 and six computing devices (computing devices 1802 to 1807), the storage device 1801 and six computing devices (computing devices 1802 to 1807) are disposed on the same chip of the network-on-chip processing system 1800, the six computing devices are divided into two groups, the computing devices 1802, 1803 and 1804 are a first computing device group cluster1, the computing devices 1805, 1806 and 1807 are a second computing device group cluster2, wherein the cluster1 and cluster2 are connected with the storage device 1801, the cluster1 and the cluster2 are connected with each other, and three of the cluster1 are connected with each other, and three of the cluster2 are connected with each other. The computing devices 1802 to 1807 are each connected to the storage device 1801, the computing device 1802 in the cluster1 is connected to the computing device 1805 in the cluster2, the computing device 1803 is connected to the computing device 1802 and the computing device 1804, and the computing device 1806 is connected to the computing device 1805 and the computing device 1807. The connection manner between the plurality of computing devices of each computing device group may refer to the connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again.

Specifically, when cluster2 needs to read data, it can directly access storage 1801; the storage 1801 may be accessed by the cluster1, the data required by the cluster2 may be read from the storage 1801, and the data may be sent to the cluster2 by the cluster 1; while the second computing device may also transmit data within the group. When cluster2 needs to read data, computing device 1805, computing device 1806, and computing device 1807 in cluster2 may access storage device 1801 simultaneously, wherein computing device 1805, computing device 1806, and computing device 1807 each read a portion of the data needed by cluster2, which may be transmitted within cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices per group is not particularly limited, and preferably a group includes four computing devices.

Alternatively, all computing devices of the plurality of computing devices are not required to be connected to the storage device 1801, so long as at least one of the two computing device groups is connected to the storage device 1801, which is not specifically limited herein. Alternatively, each computing device of the cluster1 may establish a connection with the second unit group, which is only required to ensure that at least one computing device of the cluster1 is connected with at least one computing device of the cluster2, which is not specifically limited herein. Alternatively, any one of the computing devices in cluster1 may be interconnected with any one of the computing devices in cluster2, which is not specifically limited herein.

In the network-on-chip processing system, the connection is established between the plurality of computing device groups arranged on the same piece, and meanwhile, the connection is established between the plurality of computing devices in each computing device group, so that the plurality of computing devices can realize intra-group communication and inter-group communication, the system can reduce the energy consumption expenditure of memory access, and the data reading efficiency is improved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the network processing module on a plurality of chips interconnects, the network processing module on a plurality of chips sets up on same piece, and network processing module on each piece includes: at least one storage device and a plurality of computing devices, wherein in each network processing module on a chip, at least one computing device is connected with at least one storage device inside the network processing module, and at least two computing devices in the plurality of computing devices are connected with each other.

In one embodiment, a neural network chip is provided, the chip comprising a plurality of network-on-chip processing modules interconnected, each network-on-chip processing module comprising: the system comprises at least one storage device, a plurality of computing devices, a first interconnection device and a second interconnection device, wherein in each network-on-chip processing module, the at least one computing device is connected with the at least one storage device in the network-on-chip processing module through the first interconnection device, and the plurality of computing devices are connected through the second interconnection device. Furthermore, the computing device can realize the read-write operation of the storage device in the network-on-chip processing module where the computing device is located through the first interconnection device, and the data transmission among the computing devices can also be realized through the second interconnection device.

As shown in fig. 9, for one embodiment of a network-on-chip processing system 1900, the network-on-chip processing system 1900 includes four network-on-chip processing modules connected to each other, where the four network-on-chip processing modules are disposed on the same chip of the network-on-chip processing system 1900, and each network-on-chip processing module includes: one storage device 1901 and four computing devices (computing devices 1902-1905), wherein in each network-on-chip processing module, computing device 1902 is connected to storage device 1901 inside its network-on-chip processing module and the four computing devices inside each network-on-chip processing module are connected to each other.

Specifically, the data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module.

Alternatively, the number of storage devices in each network processing module on a chip is not limited to one, but may be two, three or more, and is not particularly limited herein, and preferably four. Optionally, in each network-on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and a connection manner between the plurality of computing devices in each network-on-chip processing module may refer to a connection manner between the network-on-chip processing system 1100 and the network-on-chip processing system 1400, which is not described herein. Alternatively, not all of the plurality of computing devices in each network-on-chip processing module are required to be connected to the storage device 1901, so long as at least one computing device in each network-on-chip processing module is connected to the storage device 1901, which is not particularly limited herein.

Alternatively, each computing device in each network processing module on a chip may establish a connection with another network processing module on a chip, which is not limited herein. Optionally, the plurality of network-on-chip processing modules are connected with each other through any computing device in each network-on-chip processing module, that is, any computing device in each network-on-chip processing module may be connected with any computing device in another network-on-chip processing module, which is not limited herein specifically.

In the network-on-chip processing system, the connection is established among the plurality of network-on-chip processing modules arranged on the same chip, and meanwhile, the connection is established among the plurality of computing devices in each network-on-chip processing module, so that the plurality of computing devices can realize intra-module communication and inter-module communication, the system can reduce the energy consumption expenditure of memory access, and improve the data reading efficiency; meanwhile, a plurality of network processing modules on the same chip are connected in a plurality of modes to establish inter-module communication, and a plurality of communication channels are established among a plurality of computing devices, so that the optimal channel is selected for data transmission according to the current network congestion condition, and the effects of saving energy consumption and improving data processing efficiency are achieved.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the network processing module on chip comprises a plurality of storage devices, at least one computing device in the network processing module on chip is connected with the storage devices in the network processing module on chip, and at least two computing devices in the computing devices are connected with each other.

As shown in fig. 10a, for one of the embodiments of the network-on-chip processing system 1910, the network-on-chip processing system 1910 includes four network-on-chip processing modules connected to each other, where the four network-on-chip processing modules are disposed on the same chip of the network-on-chip processing system 1910, and each network-on-chip processing module includes: storage 1911, storage 1916, and four computing devices (computing devices 1912-1915), wherein in each network-on-chip processing module, computing device 1912 is connected to storage 1911 and storage 1916 inside its network-on-chip processing module, and four computing devices inside each network-on-chip processing module are connected to each other.

Specifically, the data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. At least one computing device in each network-on-chip processing module establishes a connection with all storage devices in the network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in the network-on-chip processing module. The number of the storage devices in each network processing module on a chip is not limited to two, but may be three, four or more, and is not particularly limited herein, and preferably four.

Specifically, computing devices in each network-on-chip processing module have priority to access adjacent storage devices. The adjacent storage device refers to a storage device with the shortest communication distance among a plurality of storage devices connected with the computing device, that is, the storage device with the shortest communication distance has higher access priority than other storage devices.

Optionally, in each network-on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and a connection manner between the plurality of computing devices in each network-on-chip processing module may refer to a connection manner between the network-on-chip processing system 1100 and the network-on-chip processing system 1400, which is not described herein. Alternatively, not all of the plurality of computing devices in each network-on-chip processing module are required to be connected to the storage device 1911, so long as at least one computing device in each network-on-chip processing module is connected to the storage device 1911, which is not specifically limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module, and a plurality of communication channels can be provided for data transmission, so that the data reading and writing efficiency is improved; each computing device in the system accesses the adjacent storage device preferentially, so that certain flexibility can be ensured while access memory overhead is saved.

In one embodiment, such as the network-on-chip processing system 19100 shown in fig. 10b, the data required to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. At least one computing device in each network-on-chip processing module establishes a connection with all storage devices in the network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in the network-on-chip processing module. The number of the storage devices in each network processing module is not limited to two, but may be three, four or more, and is not particularly limited herein, and preferably four.

Specifically, each computing device in each network-on-chip processing module is connected to a storage device of a first communication distance, where the first communication distance refers to the shortest communication distance, that is, the computing device in each network-on-chip processing module can only access the adjacent storage device, that is, the computing device in each network-on-chip processing module can only access the storage device with the shortest communication distance. For example, computing device 19120 may only access proximate storage device 19110 and may not access storage device 19160; when the computing device 19130 only has access to the adjacent storage device 19160 and cannot access to the storage device 19110, and when the data that the computing device 19120 needs to read is stored in the storage device 19160, the data needs to be read from the storage device 19160 by the computing device 19130, and then the data is transmitted to the computing device 19120 by the computing device 19130.

Optionally, in each network-on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and a connection manner between the plurality of computing devices in each network-on-chip processing module may refer to a connection manner between the network-on-chip processing system 1100 and the network-on-chip processing system 1400, which is not described herein. Alternatively, not all of the plurality of computing devices in each network-on-chip processing module are required to be connected to the storage device 19110, so long as at least one computing device in each network-on-chip processing module is connected to the storage device 19110, which is not specifically limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module, and a plurality of communication channels can be provided for data transmission, so that the data reading and writing efficiency is improved; each computing device in the system can only access the adjacent storage devices, so that the memory overhead can be saved maximally.

In one embodiment, a network-on-chip processing system is provided, the system comprising: the network processing module on arbitrary two piece is connected directly between the network processing module on arbitrary two piece, and arbitrary two network processing module set up on same piece, and network processing module on every piece includes: at least one storage device and a plurality of computing devices, wherein in each network processing module on a chip, at least one computing device is connected with at least one storage device inside the network processing module, and at least two computing devices in the plurality of computing devices are connected with each other.

As shown in fig. 11, for one of the on-chip network processing systems 1920 provided by the embodiments, the on-chip network processing system 1920 includes four on-chip network processing modules connected to each other, the four on-chip network processing modules are disposed on the same chip of the on-chip network processing system 1920, and any two on-chip network processing modules of the four on-chip network processing modules are directly connected to each other, and each on-chip network processing module includes: one memory device 1921 and four computing devices (computing devices 1922-1925), wherein in each network-on-chip processing module, computing device 1922 is connected to memory device 1921 inside its network-on-chip processing module and the four computing devices inside each network-on-chip processing module are connected to each other.

Alternatively, the number of storage devices in each network processing module on a chip is not limited to one, but may be two, three or more, and is not particularly limited herein, and preferably four. Optionally, in each network-on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and a connection manner between the plurality of computing devices in each network-on-chip processing module may refer to a connection manner between the network-on-chip processing system 1100 and the network-on-chip processing system 1400, which is not described herein. Alternatively, not all of the plurality of computing devices in each network-on-chip processing module are required to be connected to the storage device 1921, so long as at least one computing device in each network-on-chip processing module is connected to the storage device 1921, which is not specifically limited herein.

In the network-on-chip processing system, the connection is established among the plurality of network-on-chip processing modules arranged on the same chip, and meanwhile, the connection is established among the plurality of computing devices in each network-on-chip processing module, so that the intra-module communication among the plurality of computing devices can be realized, and meanwhile, the inter-module direct communication can be realized between any two network-on-chip processing modules.

In one embodiment, a network-on-chip processing system is provided, the system comprising: any two network processing modules are directly connected, the any two network processing modules are arranged on the same chip, each network processing module comprises a plurality of storage devices, at least one computing device in the network processing modules is connected with the plurality of storage devices in the network processing modules, and at least two computing devices in the plurality of computing devices are connected with each other.

As shown in fig. 12, for one of the on-chip network processing systems 1930 provided in one embodiment, the on-chip network processing system 1930 includes four on-chip network processing modules that are disposed on the same chip of the on-chip network processing system 1920, and any two on-chip network processing modules of the four on-chip network processing modules are directly connected with each other, where each on-chip network processing module includes: storage 1931, storage 1936, and four computing devices (computing devices 1932-1935), wherein in each network-on-chip processing module, computing device 1932 is connected to storage 1931 and storage 1936 within its network-on-chip processing module, and four computing devices within each network-on-chip processing module are interconnected.

Specifically, the data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. Computing devices in each network-on-chip processing module have priority to access the adjacent storage devices.

Alternatively, the number of storage devices in each network processing module on a chip is not limited to two, but may be three, four or more, and is not particularly limited herein, and preferably four. In particular, at least one computing device in each network-on-chip processing module establishes a connection with all storage devices in that network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in that network-on-chip processing module.

Optionally, in each network-on-chip processing module, the plurality of computing devices are connected to each other to form a computing device network, and a connection manner between the plurality of computing devices in each network-on-chip processing module may refer to a connection manner between the network-on-chip processing system 1100 and the network-on-chip processing system 1400, which is not described herein. Alternatively, not all of the plurality of computing devices in each network-on-chip processing module are required to be connected to the storage device 1931, so long as at least one computing device in each network-on-chip processing module is connected to the storage device 1931, which is not specifically limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module, meanwhile, direct communication between any two network-on-chip processing modules can be realized, and the system can provide a plurality of communication channels for data transmission, so that the reading and writing efficiency of data is improved; each computing device in the system accesses the adjacent storage device preferentially, so that certain flexibility can be ensured while access memory overhead is saved.

In one embodiment, as shown in FIG. 13, a computing device of a network-on-chip processing system may be used to perform machine learning calculations, the computing device comprising: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 and the arithmetic unit 12 are connected, and the arithmetic unit 11 includes: a master processing circuit and a plurality of slave processing circuits;

a controller unit 11 for acquiring input data and calculation instructions; in an alternative, the input data and the instruction calculation mode may be obtained through a data input/output unit, where the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above-described computing instructions include, but are not limited to: the present embodiments are not limited to the specific form of the above-described calculation instructions, either forward or reverse training instructions, or other neural network calculation instructions, etc., such as convolution calculation instructions.

The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 for performing preamble processing on the input data and transmitting data and operation instructions to and from the plurality of slave processing circuits;

A plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

According to the technical scheme, the operation unit is set to be of a master multi-slave structure, for a calculation instruction of forward operation, the calculation instruction according to the forward operation can be used for splitting data, so that the part with larger calculation amount can be subjected to parallel operation through the plurality of slave processing circuits, the operation speed is improved, the operation time is saved, and the power consumption is further reduced.

Optionally, the computing device may further include: the storage unit 10 and the direct memory access unit 150, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and scalar; the cache is a scratch pad cache. The direct memory access unit 150 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a store queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: a plurality of arithmetic instructions and/or calculation instructions to be executed in the order of the queue.

For example, in an alternative embodiment, the main arithmetic processing circuit may also include a controller unit, which may include a main instruction processing unit, specifically for decoding instructions into micro instructions. In a further alternative of course, the slave processing circuit may also comprise a further controller unit comprising a slave instruction processing unit, in particular for receiving and processing microinstructions. The micro instruction may be the next instruction of the instruction, and may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instructions may be as shown in Table 1.

TABLE 1

Operation code

Registers or immediate

Register/immediate

…

The ellipses in the table above represent that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an operation code. The computing instructions may include neural network computing instructions. Taking a neural network operation instruction as an example, as shown in table 2, a register number 0, a register number 1, a register number 2, a register number 3, and a register number 4 may be operation domains. Wherein each of register number 0, register number 1, register number 2, register number 3, register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, or may be an on-chip memory in practical applications, and may be used to store data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, n=1 is 1-dimensional data, i.e., vector, where n=2 is 2-dimensional data, i.e., matrix, where n=3 or more is a multidimensional tensor.

Optionally, the controller unit may further include:

The dependency relationship processing unit 112 is configured to determine, when a plurality of operation instructions are included, whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction, if the first operation instruction has an association relationship with the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the execution of the zeroth operation instruction is completed;

the determining whether the association relationship exists between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:

extracting a first storage address interval of required data (for example, a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an association relationship, if the first storage address interval and the zeroth storage address interval do not have overlapping areas, and determining that the first operation instruction and the zeroth operation instruction do not have an association relationship.

In an alternative embodiment, the arithmetic unit 12 may comprise one master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 14. In one embodiment, as shown in FIG. 14, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, and the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, wherein the k slave processing circuits are as follows: the K slave processing circuits shown in fig. 14 include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

k slave processing circuits for forwarding data and instructions between the master processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 15, the main processing circuit may further include: one or any combination of a conversion processing circuit 110, an activation processing circuit 111, and an addition processing circuit 112;

Conversion processing circuitry 110 for performing an exchange (e.g., a conversion of continuous data with discrete data) between the first data structure and the second data structure with the data blocks or intermediate results received by the main processing circuitry; or to perform an exchange between the first data type and the second data type (e.g., a conversion of a fixed point type and a floating point type) on the data block or intermediate result received by the main processing circuit;

an activation processing circuit 111 for performing an activation operation of data in the main processing circuit;

the addition processing circuit 112 is used for executing addition operation or accumulation operation.

The main processing circuit is used for determining that the input neuron is broadcast data, the weight is distribution data, distributing the distribution data into a plurality of data blocks, and sending at least one data block in the plurality of data blocks and at least one operation instruction in a plurality of operation instructions to the auxiliary processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the master processing circuit;

the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit;

the multiplication processing circuit is used for executing product operation on the received data blocks to obtain a product result;

a forwarding processing circuit (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix-by-matrix instruction, an accumulation instruction, an activation instruction, or the like.

The specific calculation method of the calculation device shown in fig. 1 is described below by the neural network operation instruction. For a neural network operation instruction, the formulas that it is required to execute may be: s=s (Σwx) _i +b), wherein the weight w is multiplied by the input data x _i And summing, adding the bias b, and performing an activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 16, the operation unit includes: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of slave processing circuits;

The above-mentioned tree module has a transmitting/receiving function, for example, as shown in fig. 16, and is a transmitting function, as shown in fig. 17, and is a receiving function.

The tree module is used for forwarding the data blocks, the weights and the operation instructions between the master processing circuit and the plurality of slave processing circuits.

Alternatively, the tree module is an optional result of the computing device, which may include at least a layer 1 node, which is a line structure with forwarding functionality, and which may not itself have computing functionality. Such as a tree module, has zero level nodes, i.e., the tree module is not required.

Alternatively, the tree module may be in a tree structure of n-branches, for example, a tree structure of two branches as shown in fig. 18, or may be in a tree structure of three branches, where n may be an integer greater than or equal to 2. The embodiment of the present application is not limited to the specific value of n, and the number of layers may be 2, and the processing circuit may be connected to a node of a layer other than the penultimate layer node, for example, the penultimate layer node shown in fig. 18.

Alternatively, the above-mentioned operation unit may carry a separate cache, as shown in fig. 19, and may include: a neuron buffering unit 63 which buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 20, the operation unit may further include: the weight buffer unit 64 is used for buffering the weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12 may comprise a branch processing circuit 103 as shown in fig. 21; a specific connection structure thereof is shown in fig. 21, in which,

the master processing circuit 101 is connected to the branch processing circuit(s) 103, and the branch processing circuit 103 is connected to the one or more slave processing circuits 102;

branch processing circuitry 103 for executing data or instructions that are forwarded between the master processing circuitry 101 and the slave processing circuitry 102.

The application also discloses a neural network computing device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be computed and control information from other processing devices, executing specified machine learning operation, and transmitting an execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices may be linked and data transferred by a specific structure, such as interconnection and data transfer via a PCIE bus, to support larger scale machine learning operations. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.

The neural network operation device has higher compatibility and can be connected with various servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the neural network operation device, a universal interconnection interface and other processing devices. The neural network operation device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 22 is a schematic view of a combination processing apparatus.

Other processing means may include one or more processor types of general purpose/special purpose processors such as Central Processing Units (CPU), graphics Processing Units (GPU), neural network processors, etc. The number of processors included in the other processing means is not limited. Other processing devices are used as interfaces between the neural network operation device and external data and control, including data carrying, and basic control such as starting and stopping of the neural network operation device is completed; other processing devices can also cooperate with the neural network computing device to complete the computing task.

And the universal interconnection interface is used for transmitting data and control instructions between the neural network operation device and other processing devices. The neural network computing device acquires required input data from other processing devices and writes the required input data into a storage device on a chip of the neural network computing device; control instructions can be obtained from other processing devices and written into a control cache on the chip of the nerve network computing device; the data in the memory module of the neural network computing device can also be read and transmitted to other processing devices.

Optionally, as shown in fig. 23, the structure may further include a storage device, where the storage device is connected to the neural network computing device and the other processing device, respectively. The storage device is used for storing the data in the neural network operation device and the other processing devices, and is particularly suitable for the data which is required to be operated and cannot be stored in the internal storage of the neural network operation device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle, video monitoring equipment and the like, so that the core area of a control part is effectively reduced, the processing speed is improved, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing apparatus is connected to some parts of the device. Some components such as cameras, displays, mice, keyboards, net cards, wifi interfaces.

In some embodiments, a chip is also disclosed, which includes the neural network computing device or the combination processing device.

In some embodiments, a chip package structure is disclosed, which includes the chip.

In some embodiments, a board card is provided that includes the chip package structure described above. Referring to fig. 24, fig. 24 provides a board that may include other mating components in addition to the chips 389, including but not limited to: a memory device 390, an interface device 391 and a control device 392;

The memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the chip through a bus. It is appreciated that each set of memory cells may be DDR SDRAM (Double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers inside, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It will be appreciated that the theoretical bandwidth of data transfer may reach 251600MB/s when DDR4-31200 grains are employed in each set of memory cells.

In one embodiment, each set of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the chip in the chip packaging structure. The interface means is used for enabling data transmission between the chip and an external device, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X108 interface transmission is adopted, the theoretical bandwidth can reach 116000MB/s. In another embodiment, the interface device may be another interface, and the application is not limited to the specific implementation of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the calculation result of the chip is still transmitted back to the external device (e.g. a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

In one embodiment, as shown in fig. 25, there is provided a network-on-chip data processing method, the method comprising the steps of:

step 202, accessing a storage device through a first computing device to obtain first operation data.

Wherein the first computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, a controller unit in a first computing device acquires first operation data and a computing instruction from a storage device.

Step 204, performing an operation on the first operation data by using the first computing device to obtain a first operation result.

The first operation data read from the storage device is operated in the first computing device according to the corresponding computing instruction, and a first operation result is obtained.

Step 206, sending the first operation result to the second computing device.

The first computing device sends the first operation result to the second computing device through a communication channel established between the first computing device and the second computing device through a controller unit in the first computing device. Alternatively, the first operation result may be sent to the second computing device, or the first operation result may be sent to the storage device.

Further, the network-on-chip data processing method provided in this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 1 to 5.

According to the network-on-chip data processing method, the first operation result in the first computing device is sent to the second computing device, so that data transmission among a plurality of computing devices can be realized; meanwhile, through multiplexing the operation data, the excessive bandwidth expense caused by multiple accesses to the storage device by the computing device can be avoided.

In one embodiment, as shown in fig. 26, there is provided a network-on-chip data processing method, which includes the steps of:

in step 302, a first computing device accesses a storage device to obtain first operational data.

Wherein the computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, a controller unit in a first computing device acquires first operation data and a computing instruction from a storage device.

Step 304, performing an operation on the first operation data by using the first computing device to obtain a first operation result.

Step 306, sending the first operation result to the second computing device.

The first computing device sends the first operation result to the second computing device through a communication channel established between the first computing device and the second computing device through a controller unit in the first computing device.

Step 308, accessing the storage device through the second computing device, and obtaining second operation data.

Wherein the second computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second computing device acquires the second operation data and the computation instruction from the storage device.

Step 310, performing, by the second computing device, the second operation data and the first operation result to obtain a second operation result.

And the second operation data read from the storage device and the first operation result received from the first calculation device are operated in the first calculation device according to the corresponding calculation instruction, so that the second operation result is obtained.

According to the network-on-chip data processing method, the first operation result in the first computing device is sent to the second computing device, the second computing device performs operation again by using the first operation result, multiplexing of operation data can be achieved, the operation data and the intermediate operation result can be reasonably utilized, and data processing efficiency is improved.

In one embodiment, the network-on-chip data processing method shown in fig. 26 is applied to the network-on-chip processing system 1900 shown in fig. 9, where each of the computing devices 1902 to 1905 is connected to the storage device 1901 in the network-on-chip processing module where it is located, and any two of the computing devices 1902 to 1905 are directly connected to each other.

For example, calculate a matrix multiplication, matrixMatrix->Computing a matrix

Wherein for c ₀₀ ＝a ₀₀ *b ₀₀ +a ₀₁ *b ₁₀ ；

c ₀₁ ＝a ₀₀ *b ₀₁ +a ₀₁ *b ₁₁ ；

c ₁₀ ＝a ₁₀ *b ₀₀ +a ₁₁ *b ₁₀ ；

c ₁₁ ＝a ₁₀ *b ₀₁ +a ₁₁ *b ₁₁ 。

First, time is divided to obtain three time periods.

Then, during a first period of time, the computing devices 1902-1905 simultaneously access the storage device 1901 in the network-on-chip processing module in which they reside.

Specifically, the computing device 1902 reads the first operation data a from the storage device 1901 ₀₀ And b ₀₀ The method comprises the steps of carrying out a first treatment on the surface of the The computing device 1903 reads the first operation data a from the memory device 1901 ₀₁ And b ₁₁ The method comprises the steps of carrying out a first treatment on the surface of the The computing device 1904 reads the first operation data a from the memory device 1901 ₁₁ And b ₁₀ The method comprises the steps of carrying out a first treatment on the surface of the The computing device 1905 reads the first operation data a from the memory device 1901 ₁₀ And b ₀₁ 。

Further, the first operation data a read is compared with the first operation data a in the computing device 1902 ₀₀ And b ₀₀ Performing operation to obtain a first operation result a ₀₀ *b ₀₀ The method comprises the steps of carrying out a first treatment on the surface of the The first operation data a read is processed in the computing device 1903 ₀₁ And b ₁₁ Performing operation to obtain a first operation result a ₀₁ *b ₁₁ The method comprises the steps of carrying out a first treatment on the surface of the The first operation data a read is processed in the computing device 1904 ₁₁ And b ₁₀ Performing operation to obtain a first operation result a ₁₁ *b ₁₀ The method comprises the steps of carrying out a first treatment on the surface of the The first operation data a read is processed in the computing device 1905 ₁₀ And b ₀₁ Performing operation to obtainFirst operation result a ₁₀ *b ₀₁ 。

Next, during a second period, the computing device 1902 reads the first operation data a from the computing device 1903 respectively ₀₁ And the first operation data b is read in the computing device 1904 ₁₀ Obtaining a second operation result a through operation ₀₁ *b ₁₀ The method comprises the steps of carrying out a first treatment on the surface of the The computing devices 1903 respectively read the first operation data a from the computing devices 1902 ₀₀ And the first operation data b is read in the computing device 1905 ₀₁ Obtaining a second operation result a through operation ₀₀ *b ₀₁ The method comprises the steps of carrying out a first treatment on the surface of the The computing devices 1904 respectively read the first operation data a from the computing devices 1905 ₁₀ And the first operation data b is read in the computing device 1902 ₀₀ Obtaining a second operation result a through operation ₁₀ *b ₀₀ The method comprises the steps of carrying out a first treatment on the surface of the The computing devices 1905 respectively read the first operation data a from the computing devices 1904 ₁₁ And the first operation data b is read in the computing device 1903 ₁₁ Obtaining a second operation result a through operation ₁₁ *b ₁₁ 。

Then, in a third period of time, the computing device 1902 outputs the first operation result a ₀₀ *b ₀₀ And a second operation result a ₀₁ *b ₁₀ Performing operation to obtain a third operation result c ₀₀ ＝a ₀₀ *b ₀₀ +a ₀₁ *b ₁₀ And the third operation result c ₀₀ To the storage device 1902; the computing device 1903 outputs the first operation result a ₀₁ *b ₁₁ And a second operation result a ₀₀ *b ₀₁ Performing operation to obtain a third operation result c ₀₁ ＝a ₀₀ *b ₀₁ +a ₀₁ *b ₁₁ And the third operation result c ₀₁ To the storage device 1902; the computing device 1904 outputs the first operation result a ₁₁ *b ₁₀ And a second operation result a ₁₀ *b ₀₀ Performing operation to obtain a third operation result c ₁₀ ＝a ₁₀ *b ₀₀ +a ₁₁ *b ₁₀ And the third operation result c ₁₀ To the storage device 1902; the computing device 1905 outputs the first operation result a ₁₀ *b ₀₁ And a second operation result a ₁₁ *b ₁₁ Performing operation to obtain a third operation result c ₁₁ ＝a ₁₀ *b ₀₁ +a ₁₁ *b ₁₁ And the third operation result c ₁₁ To the storage device 1902.

In one embodiment, as shown in fig. 27, there is provided a network-on-chip data processing method, which includes the steps of:

in step 402, a storage device is accessed by a first computing device group, wherein the first computing device group includes a plurality of first computing devices, and first operation data is acquired.

Wherein each first computing device in the first computing device group cluster1 comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster1 acquires the first operation data and the calculation instruction from the storage device.

Optionally, a plurality of first computing devices in the cluster1 access the storage device simultaneously, and each first computing device reads a part of data required by the cluster1 from the storage device, and the data are transmitted in the cluster 1. Alternatively, one or more first computing devices in a given cluster1 may access the storage device, with the remaining first computing devices only being able to communicate within the group.

Step 404, performing an operation on the plurality of first operation data by the first computing device group to obtain a first operation result.

The first operation data are operated and forwarded among the first computing devices according to the corresponding computing instructions, and a first operation result is obtained.

Step 406, sending the first operation result to a second computing device group.

Wherein the cluster1 sends the first operation result to the cluster2 through the controller unit in the cluster1 through the communication channel established with the second computing device group cluster2.

Alternatively, the first operation result may be sent to the cluster2, or the first operation result may be sent to the storage device. Optionally, the first operation result is sent to the cluster2 through a first computing device of a communication channel established between any one of the clusters 1 and the cluster2. Alternatively, the cluster1 may send the first operation result to a second computing device that establishes a communication channel between any one of the clusters 2 and the cluster 1.

Further, the network-on-chip data processing method provided in this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 6 to 8.

According to the network-on-chip data processing method, the plurality of computing device groups can realize intra-group communication and inter-group data transmission, the operation data and the intermediate operation result can be reasonably utilized, and the data processing efficiency is improved.

In one embodiment, as shown in fig. 28, there is provided a network-on-chip data processing method, which includes the steps of:

step 502, accessing a storage device through a first computing device group, wherein the first computing device group comprises a plurality of first computing devices, and acquiring first operation data.

Step 504, performing an operation on the plurality of first operation data by using the first computing device group, so as to obtain a first operation result.

Step 506, sending the first operation result to a second computing device group.

Optionally, the first operation result is sent to the cluster2 through a first computing device of a communication channel established between any one of the clusters 1 and the cluster2. Alternatively, the cluster1 may send the first operation result to a second computing device that establishes a communication channel between any one of the clusters 2 and the cluster 1.

Step 508, accessing the storage device through the second computing device group, and obtaining second operation data, wherein the second computing device group includes a plurality of second computing devices.

Wherein each first computing device in cluster2 comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster2 acquires the second operation data and the calculation instruction from the storage device.

Optionally, a plurality of second computing devices in the cluster2 access the storage device simultaneously, and each second computing device reads a part of data required by the cluster2 from the storage device, and the data is transmitted in the cluster 2. Alternatively, one or more second computing devices in designated cluster2 may access the storage device, with the remaining second computing devices being only capable of intra-group communication.

Step 510, performing an operation on the second operation data and the first operation result by using the second computing device group, to obtain a second operation result.

And the second operation data read from the storage device and the first operation result received from the first calculation device group are subjected to operation and forwarding among a plurality of second calculation devices according to corresponding calculation instructions, so that a second operation result is obtained.

According to the network-on-chip data processing method, the first operation result in the first calculation device group is sent to the second calculation device group, the second calculation device group performs operation again by using the first operation result, multiplexing of operation data can be achieved, the operation data and the intermediate operation result can be reasonably utilized, and data processing efficiency is improved.

In one embodiment, as shown in fig. 29, there is provided a network-on-chip data processing method, the method including the steps of:

step 602, obtaining first operation data through a first network-on-chip processing module, where the first network-on-chip processing module includes a first storage device and a plurality of first computing devices, and the first operation data is stored in the first storage device.

Wherein each first computing device in the first network-on-chip processing module comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, a controller unit in the first network-on-chip processing module obtains first operation data and calculation instructions from the first storage device.

Optionally, a plurality of first computing devices in the first network-on-chip processing module access the first storage device simultaneously, and each first computing device reads a part of data required by the first network-on-chip processing module from the first storage device, and the data are transmitted in the first network-on-chip processing module.

Optionally, one or more first computing devices in the first network-on-chip processing module are designated to have access to the first storage device, and the remaining first computing devices are only capable of intra-group communication. Specifically, the operation data required to be processed by the first network-on-chip processing module are stored in the first storage device.

Step 604, performing an operation on the first operation data by using a plurality of first computing devices in the first network-on-chip processing module, to obtain a first operation result.

And step 606, sending the first operation result to a second network-on-chip processing module.

The first network-on-chip processing module sends a first operation result to the second network-on-chip processing module through a communication channel established between the first network-on-chip processing module and the second network-on-chip processing module through a controller unit in the first network-on-chip processing module.

Alternatively, the first operation result may be sent to the second network-on-chip processing module, or the first operation result may be sent to the first storage device. Optionally, the first operation result is sent to the second network-on-chip processing module through a first computing device of a communication channel established between any one of the first network-on-chip processing modules and the second network-on-chip processing module. Optionally, the first network-on-chip processing module may send the first operation result to a second computing device that establishes a communication channel between any one of the second network-on-chip processing modules and the first network-on-chip processing module.

Further, the network-on-chip data processing method provided in this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 9 to 12.

According to the network-on-chip data processing method, the plurality of network-on-chip processing modules can realize intra-module communication and inter-module data transmission, the operation data and the intermediate operation result can be reasonably utilized, and the data processing efficiency is improved.

In one embodiment, as shown in fig. 30, there is provided a network-on-chip data processing method, which includes the steps of:

in step 702, first operation data is obtained through a first network-on-chip processing module, where the first network-on-chip processing module includes a first storage device and a plurality of first computing devices, and the first operation data is stored in the first storage device.

Step 704, performing an operation on the first operation data by using a plurality of first computing devices in the first network-on-chip processing module, to obtain a first operation result.

And step 706, transmitting the first operation result to a second network-on-chip processing module.

Optionally, the first operation result is sent to the second network-on-chip processing module through a first computing device of a communication channel established between any one of the first network-on-chip processing modules and the second network-on-chip processing module. Optionally, the first network-on-chip processing module may send the first operation result to a second computing device that establishes a communication channel between any one of the second network-on-chip processing modules and the first network-on-chip processing module.

Step 708, obtaining second operation data through the second network-on-chip processing module, where the second network-on-chip processing module includes a second storage device and a plurality of second computing devices, and the second operation data is stored in the second storage device.

Wherein each second computing device in the second network-on-chip processing module comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second network-on-chip processing module obtains the second operation data and the calculation instruction from the second storage device.

Optionally, a plurality of second computing devices in the second network-on-chip processing module access the second storage device simultaneously, and each second computing device reads a part of data required by the second network-on-chip processing module from the second storage device, and the data are transmitted in the second network-on-chip processing module.

Optionally, one or more second computing devices in the second network-on-chip processing module are designated to have access to the second storage device, and the remaining second computing devices are only capable of intra-group communication. Specifically, the operation data required to be processed by the second network-on-chip processing module are stored in the second storage device.

Step 710, performing an operation on the second operation data and the first operation result by using a plurality of second computing devices in the second network-on-chip processing module, so as to obtain a second operation result.

Step 710 specifically includes the following steps:

step 7102, performing an operation between the second computing devices with the second computing data and the first computing result to obtain the second computing result.

Specifically, each second computing device may perform an operation on the second operation data and the first operation result according to the corresponding computing instruction to obtain a plurality of intermediate results, and then perform an operation on the plurality of intermediate results according to the corresponding computing instruction to obtain a second operation result.

Step 7104, storing the second operation result in the second storage device.

According to the network-on-chip data processing method, the first operation result in the first network-on-chip processing system is sent to the second network-on-chip processing system, the second network-on-chip processing system uses the first operation result to perform operation again, multiplexing of operation data can be achieved, the operation data and the intermediate operation result can be reasonably utilized, and data processing efficiency is improved.

The network-on-chip processing method in the embodiment of the application can be used for machine learning calculation, and particularly can be used for artificial neural network operation, wherein the operation data in the network-on-chip processing system can particularly comprise: the input neuron data and weight data, the operation result in the network-on-chip processing system may be specifically: and outputting the neuron data as a result of the artificial neural network operation.

The operation in the neural network can be one-layer operation in the neural network, and in the multi-layer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the upper layer is completed, the operation instruction of the lower layer carries out operation by taking the output neuron calculated in the operation unit as the input neuron of the lower layer (or carries out some operations on the output neuron and then takes the operation as the input neuron of the lower layer), and meanwhile, the weight is replaced by the weight of the lower layer; in the backward operation, when the backward operation of the artificial neural network of the previous layer is completed, the next-layer operation instruction performs an operation with the input neuron gradient calculated by the operation unit as the output neuron gradient of the next layer (or performs some operations on the input neuron gradient and then uses the operation as the output neuron gradient of the next layer), and simultaneously replaces the weight with the weight of the next layer.

The machine learning computation may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, a specific scheme of machine learning calculation is described below by taking an artificial neural network operation as an example.

For the artificial neural network operation, if the artificial neural network operation has multiple layers of operation, the input neurons and the output neurons of the multiple layers of operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the forward operation of the network are the input neurons, and the neurons in the upper layer of the forward operation of the network are the output neurons. Taking convolutional neural networks as an example, let a convolutional neural network have L layers, k=1, 2,..l-1, for the K-th layer and the K + 1-th layer, we refer to the K-th layer as the input layer, where the neurons are the input neurons, the k+1-th layer as the output layer, where the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

In an alternative embodiment, taking the example of the fully connected operation in the neural network operation, the process may be: y=f (wx+b), where x is the input neuron matrix, w is the weight matrix, b is the bias scalar, and f is the activation function, which may be specifically: a sigmoid function, a tanh, relu, softmax function. Assuming here a binary tree structure with 8 slave processing circuits, the method implemented may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, distributes the 8 sub-matrices to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit performs multiplication operation and accumulation operation of 8 submatrices and an input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

The main processing circuit is used for sequencing the 8 intermediate results to obtain an operation result of wx, executing the operation of the bias b on the operation result, executing the activating operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 1 may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the operation code to the operation unit.

The controller unit extracts the weight w and the bias b corresponding to the operation domain from the storage unit (when b is 0, the bias b does not need to be extracted), the weight w and the bias b are transmitted to the main processing circuit of the operation unit, the controller unit extracts the input data Xi from the storage unit, and the input data Xi is transmitted to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines that input data Xi are broadcast data, determines weight data are distribution data, and splits the weight w into n data blocks;

An instruction processing unit of the controller unit determines a multiplication instruction, a bias instruction and an accumulation instruction according to the at least one operation code, sends the multiplication instruction, the bias instruction and the accumulation instruction to a main processing circuit, and the main processing circuit sends the multiplication instruction and input data Xi to a plurality of slave processing circuits in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits (for example, n slave processing circuits are provided, and each slave processing circuit sends one data block); and the main processing circuit is used for executing accumulation operation on the intermediate results sent by the plurality of slave processing circuits according to the accumulation instruction to obtain an accumulation result, executing addition offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, the multiplication operation and the bias operation of the neural network are realized through one instruction, namely the neural network operation instruction, the intermediate results calculated by the neural network are not required to be stored or extracted, and the storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

It should be understood that, although the steps in the flowcharts of fig. 25-30 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be performed in other orders. Moreover, at least some of the steps in fig. 25-30 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and that the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the description of the embodiments above being merely intended to facilitate an understanding of the method of the present application and the core concepts thereof; meanwhile, as one skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A network-on-chip data processing method, wherein the method is applied to a network-on-chip processing system for performing machine learning calculations, the network-on-chip processing system comprising: a storage device and a computing device; the method comprises the following steps:

transmitting the first operation result to a second computing device in the network-on-chip processing system;

wherein the computing device comprises an arithmetic unit and a controller unit;

the accessing, by the first computing device in the network-on-chip processing system, the storage device in the network-on-chip processing system to obtain first operation data includes: a controller unit in the first computing device acquires the first operation data and a computing instruction from the storage device;

establishing a connection between each computing device in each network-on-chip processing system and one or more computing devices in other network-on-chip processing systems;

the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are: n slave processing circuits of the 1 st row, n slave processing circuits of the m th row, and m slave processing circuits of the 1 st column;

Forwarding of data and instructions between the master processing circuit and the plurality of slave processing circuits by the K slave processing circuits;

the main processing circuit determines that the input neuron is broadcast data, the weight is distribution data, the distribution data is distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks and at least one operation instruction in a plurality of operation instructions are sent to the K slave processing circuits;

the K slave processing circuits convert data between the master processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits execute operation on the received data blocks according to the operation instruction to obtain an intermediate result, and the operation result is transmitted to the K slave processing circuits;

and the main processing circuit performs subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a first operation result of the calculation instruction, and sends the first operation result of the calculation instruction to the controller unit.

2. The method according to claim 1, wherein the computing, by the first computing device, the first operation data to obtain a first operation result includes:

Analyzing the calculation instructions through a controller unit in the first calculation device to obtain a plurality of calculation instructions, and sending the plurality of calculation instructions and the first calculation data to a main processing circuit in the first calculation device by the controller unit in the first calculation device;

performing, by a master processing circuit in the first computing device, preamble processing on the first operational data, and transmission data and operational instructions with a plurality of slave processing circuits in the first computing device;

a plurality of slave processing circuits in the first computing device execute intermediate operations in parallel according to operation data and operation instructions transmitted from a master processing circuit in the first computing device to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit in the first computing device;

and a main processing circuit in the first computing device executes subsequent processing on the plurality of intermediate results to obtain a first operation result of the computing instruction.

3. The method of claim 1, wherein the sending the first operation result to a second computing device in the network-on-chip processing system comprises:

A controller unit in the first computing device sends the first operation result to a second computing device in the network-on-chip processing system.

4. The method of claim 1, wherein the machine learning calculation comprises: an artificial neural network operation, the first operation data comprising: inputting neuron data and weight data; the first operation result is output neuron data.

5. The method of claim 3, wherein the computing device further comprises: a storage unit and a direct memory access unit, the storage unit comprising: registers, caches, any combination;

the cache is used for storing the first operation data;

the register is used for storing scalar quantities in the first operation data.

6. A method according to claim 3, wherein the controller unit comprises: an instruction storage unit, an instruction processing unit and a storage queue unit;

the instruction storage unit stores calculation instructions related to the artificial neural network operation;

the instruction processing unit analyzes the calculation instructions to obtain a plurality of operation instructions;

the store queue unit stores an instruction queue, the instruction queue comprising: and a plurality of operation instructions and/or calculation instructions to be executed according to the sequence of the instruction queue.

7. The method of claim 1, wherein the main processing circuit comprises: a dependency relationship processing unit;

the dependency relation processing unit determines whether a first operation instruction and a zeroth operation instruction before the first operation instruction have an association relation, if so, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction is finished, the first operation instruction is extracted from the instruction storage unit and transmitted to the operation unit;

extracting a first storage address interval of required data in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of required data in the zeroth operation instruction according to the zeroth operation instruction, determining that the first operation instruction and the zeroth operation instruction have an association relation if the first storage address interval and the zeroth storage address interval have overlapping areas, and determining that the first operation instruction and the zeroth operation instruction do not have an association relation if the first storage address interval and the zeroth storage address interval do not have overlapping areas.

8. The method according to claim 1, wherein the arithmetic unit comprises: a tree module, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

the tree module forwards data blocks, weights and operation instructions between the master processing circuit and the plurality of slave processing circuits.

9. The method of claim 1, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit being coupled to at least one slave processing circuit;

the main processing circuit determines that the input neuron is broadcast data, the weight is distribution data, the distribution data is distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks, the broadcast data and at least one operation instruction in a plurality of operation instructions are sent to the branch processing circuit;

the branch processing circuit forwards data blocks, broadcast data and operation instructions between the master processing circuit and the plurality of slave processing circuits;

The plurality of slave processing circuits execute operation on the received data blocks and broadcast data according to the operation instruction to obtain intermediate results, and the intermediate results are transmitted to the branch processing circuits;

and the main processing circuit performs subsequent processing on the intermediate result sent by the branch processing circuit to obtain a first operation result of the calculation instruction, and sends the first operation result of the calculation instruction to the controller unit.

10. The method according to any one of claims 8-9, wherein,

the main processing circuit performs combined sequencing on intermediate results sent by the processing circuits to obtain a first operation result of the calculation instruction;

or the main processing circuit performs combined sequencing on the intermediate results sent by the processing circuits and obtains a first operation result of the calculation instruction after activation processing.

11. The method of any of claims 8-9, wherein the main processing circuit comprises: one or any combination of a conversion processing circuit, an activation processing circuit and an addition processing circuit;

the conversion processing circuit performs preamble processing on the first operation data, specifically: executing interchange between the first data structure and the second data structure on the data or intermediate result received by the main processing circuit; or the data or intermediate result received by the main processing circuit is exchanged between the first data type and the second data type;

The activation processing circuit executes the subsequent processing, specifically, executes the activation operation of the data in the main processing circuit;

the addition processing circuit performs the subsequent processing, specifically, performs an addition operation or an accumulation operation.

12. The method of claim 9, wherein the slave processing circuit comprises: a multiplication processing circuit;

and the multiplication processing circuit performs product operation on the received data blocks to obtain a product result.

13. The method of claim 12, wherein the slave processing circuit further comprises: and the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

14. The method of claim 8, wherein the tree module is an N-ary tree structure, and N is an integer greater than or equal to 2.

15. The method according to claim 1, wherein the method further comprises: and accessing a storage device in the network-on-chip processing system through a second computing device in the network-on-chip processing system to acquire second operation data.

16. The method of claim 15, wherein the method further comprises: and operating the second operation data and the first operation result through a second computing device in the network-on-chip processing system to obtain a second operation result.

17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-16.

18. A network-on-chip data processing apparatus for performing machine learning calculations, comprising:

the first operation data acquisition module is used for accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to acquire first operation data;

the operation module is used for carrying out operation on the first operation data through the first calculation device to obtain a first operation result;

a first operation result sending module, configured to send the first operation result to a second computing device in the network-on-chip processing system; wherein the computing device comprises an arithmetic unit and a controller unit; the accessing, by the first computing device in the network-on-chip processing system, the storage device in the network-on-chip processing system to obtain first operation data includes: a controller unit in the first computing device acquires the first operation data and a computing instruction from the storage device; establishing a connection between each computing device in each network-on-chip processing system and one or more computing devices in other network-on-chip processing systems; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits; the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are: n slave processing circuits of the 1 st row, n slave processing circuits of the m th row, and m slave processing circuits of the 1 st column; forwarding of data and instructions between the master processing circuit and the plurality of slave processing circuits by the K slave processing circuits; the main processing circuit determines that the input neuron is broadcast data, the weight is distribution data, the distribution data is distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks and at least one operation instruction in a plurality of operation instructions are sent to the K slave processing circuits; the K slave processing circuits convert data between the master processing circuit and the plurality of slave processing circuits; the plurality of slave processing circuits execute operation on the received data blocks according to the operation instruction to obtain an intermediate result, and the operation result is transmitted to the K slave processing circuits; and the main processing circuit performs subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a first operation result of the calculation instruction, and sends the first operation result of the calculation instruction to the controller unit.