CN111079908A

CN111079908A - Network-on-chip data processing method, storage medium, computer device and apparatus

Info

Publication number: CN111079908A
Application number: CN201811216857.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2020-04-28
Anticipated expiration: 2038-10-18
Also published as: KR20200138411A; CN111079908B; KR102539571B1

Abstract

The application relates to a network-on-chip data processing method, which is applied to a network-on-chip processing system, wherein the network-on-chip processing system is used for executing machine learning calculation, and comprises the following steps: a storage device and a computing device; the method comprises the following steps: accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to acquire first operation data; operating the first operation data through the first computing device to obtain a first operation result; and sending the first operation result to a second computing device in the network-on-chip processing system. The method can reduce the operation overhead and improve the data reading and writing efficiency.

Description

Network-on-chip data processing method, storage medium, computer device and apparatus

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a network-on-chip data processing method, a storage medium, a computer device, and an apparatus.

Background

With the development of semiconductor process technology, it has become a reality to integrate hundreds of millions of transistors in a single chip. Network on Chip (NoC) is capable of integrating a large amount of computing resources on a single Chip and enabling on-Chip communication.

Because a great amount of calculation is needed in the neural network, some of the calculation needs parallel processing, such as forward operation, backward operation, weight update, and the like. In a chip architecture with numerous transistors, chip design will face the problems of large access and storage overhead, more bandwidth blockage, low data read-write efficiency and the like.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a network-on-chip data processing method, a storage medium, a computer device, and an apparatus, which can reduce computation overhead and improve data read-write efficiency.

In a first aspect, a method for processing network-on-chip data is provided, where the method is applied to a network-on-chip processing system, where the network-on-chip processing system is configured to perform machine learning computation, and the network-on-chip processing system includes: a storage device and a computing device; the method comprises the following steps:

accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to acquire first operation data;

operating the first operation data through the first computing device to obtain a first operation result;

and sending the first operation result to a second computing device in the network-on-chip processing system.

In a second aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps mentioned in the above method.

In a third aspect, an embodiment of the present application provides a network-on-chip data processing system, including a memory, a multi-core processor, and a computer program stored on the memory and operable on the multi-core processor, where the multi-core processor implements the steps mentioned in the above method when executing the computer program.

In a fourth aspect, an embodiment of the present application provides an on-chip network data processing apparatus, where the on-chip network data processing apparatus is configured to perform machine learning computation, and the apparatus includes:

the acquisition module is used for accessing a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system to acquire first operation data;

the operation module is used for operating the first operation data through the first computing device to obtain a first operation result;

and the sending module is used for sending the first operation result to a second computing device in the network-on-chip processing system.

According to the network-on-chip data processing method, the storage medium, the computer equipment and the device, the plurality of computing devices arranged on the same chip are connected, so that data transmission can be performed among the plurality of computing devices, and in the computing process, the input data and the generated intermediate computing result are subjected to time sharing and multiplexing, so that the energy consumption overhead in the access and storage process is reduced, the storage bandwidth blockage is reduced, and the data reading and writing efficiency is improved.

Drawings

FIG. 1 is a block diagram of a network-on-chip processing system 1100 in one embodiment;

FIG. 2 is a block diagram of a network-on-chip processing system 1200 in one embodiment;

FIG. 3 is a block diagram of a network-on-chip processing system 1300 in one embodiment;

FIG. 4 is a block diagram of a network-on-chip processing system 1400 in one embodiment;

FIG. 5a is a block diagram of a network-on-chip processing system 1500 in one embodiment;

FIG. 5b is a schematic diagram of a network-on-chip processing system 15000 in an embodiment;

FIG. 6 is a block diagram of a network-on-chip processing system 1600 in one embodiment;

FIG. 7 is a block diagram of a network-on-chip processing system 1700 in one embodiment;

FIG. 8 is a block diagram of a network-on-chip processing system 1800 in one embodiment;

FIG. 9 is a block diagram of a network-on-chip processing system 1900 in one embodiment;

FIG. 10a is a diagram illustrating an architecture of a network-on-chip processing system 1910 in one embodiment;

FIG. 10b is a block diagram of a network-on-chip processing system 19100, according to an embodiment;

FIG. 11 is a block diagram of a network-on-chip processing system 1920 in one embodiment;

FIG. 12 is a block diagram of a network-on-chip processing system 1930 in one embodiment;

FIG. 13 is a schematic diagram of a computing device in one embodiment;

FIG. 14 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 15 is a schematic diagram of the main processing circuitry in one embodiment;

FIG. 16 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 17 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 18 is a block diagram of a tree module in accordance with one embodiment;

FIG. 19 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 20 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 21 is a schematic diagram of a computing device in accordance with another embodiment;

FIG. 22 is a schematic diagram showing the structure of a combined treatment apparatus according to an embodiment;

FIG. 23 is a schematic structural view of a combined treatment apparatus according to another embodiment;

fig. 24 is a schematic structural diagram of a board card in one embodiment;

FIG. 25 is a flowchart illustrating a method of network-on-chip data processing according to an embodiment;

FIG. 26 is a flowchart illustrating a method for network-on-chip data processing according to another embodiment;

FIG. 27 is a flowchart illustrating a method for network-on-chip data processing according to another embodiment;

FIG. 28 is a flowchart illustrating a method for network-on-chip data processing according to another embodiment;

FIG. 29 is a flowchart illustrating a method for network-on-chip data processing according to another embodiment;

fig. 30 is a flowchart illustrating a network-on-chip data processing method according to another embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In one embodiment, there is provided an on-chip network processing system, comprising: the storage device and the plurality of computing devices are arranged on the same chip, wherein at least one computing device is connected with the storage device, and at least two computing devices are connected with each other.

The Network on Chip (NoC) refers to a communication Network on Chip that integrates a large number of computing resources on a single Chip and connects these resources, and modules in each Chip are connected to the Network through a standardized Network Interface (NIC), and communicate using shared Network resources and a destination module. Specifically, the storage device and the plurality of computing devices are disposed on the same chip, which means that the storage device and the plurality of computing devices are integrated on the same chip. The processor cores and the off-chip storage devices are connected through a NoC, which also supports communication among the multiple cores of the processor.

The network-on-chip processing system in the embodiment of the application realizes the communication on chip based on the NoC. In addition, the network-on-chip processing system in the embodiment of the application can perform both on-chip storage and off-chip storage, that is, the operation data in the processing process of the neural network processor can be stored in the on-chip storage device or the off-chip storage device; since the on-chip memory capacity of the network-on-chip processing system is limited, the operation data and the intermediate results generated during the operation can be temporarily stored in the off-chip memory device, and read from the off-chip memory to the NoC when necessary. In the embodiment of the application, the storage devices in the network-on-chip processing system all refer to on-chip storage devices; a computing device in a network-on-chip processing system includes a neural network processor.

In one embodiment, there is provided an on-chip network processing system, comprising: the storage device and the plurality of computing devices are arranged on the same chip, wherein the first computing device is connected with the storage device, and at least one second computing device in the plurality of second computing devices is connected with the first computing device.

In one embodiment, a neural network chip is provided, the chip comprising: the device comprises a storage device, a plurality of computing devices, a first interconnection device and a second interconnection device, wherein at least one computing device is connected with the storage device through the first interconnection device, and the computing devices are connected with each other through the second interconnection device. Furthermore, the computing devices can realize read-write operation on the storage device through the first interconnection device, and data transmission can be performed among the plurality of computing devices through the second interconnection device.

As shown in fig. 1, a network-on-chip processing system 1100 is provided for an embodiment, where the network-on-chip processing system 1100 includes: the network processing system 1100 comprises a storage device 1101, a first computing device 1102, a second computing device 1103 and a second computing device 1104, wherein the storage device 1101, the first computing device 1102, the second computing device 1103 and the second computing device 1104 are arranged on the same chip of the network processing system 1100, the first computing device 1102 is connected with the storage device 1101, the second computing device 1103 is connected with the first computing device 1102, and the second computing device 1103 is also connected with the second computing device 1104. Only the first computing device 1102 can access the storage device 1101, that is, only the first computing device 1102 can read and write data from and to the storage device 1101, and the first computing device 1102, the second computing device 1103 and the second computing device 1104 can mutually perform data transmission.

Specifically, when the second computing device 1104 needs to read data, the first computing device 1102 accesses the storage device 1101, reads data needed by the second computing device 1104 from the storage device 1101, the first computing device 1102 sends the data to the second computing device 1103, and the second computing device 1103 sends the data to the second computing device 1104. Alternatively, the first computing device 1102, the second computing device 1103 and the second computing device 1104 may all be connected to the storage device 1101, and at least one of the first computing device 1102, the second computing device 1103 and the second computing device 1104 is connected to the storage device 1101, which is not limited herein. Alternatively, the second computing device 1103 may be connected to the second computing device 1104 or the first computing device 1102, as long as at least two of the first computing device 1102, the second computing device 1103 and the second computing device 1104 are connected to each other, and thus, the connection is not limited in detail.

As shown in fig. 2, for an on-chip network processing system 1200 provided in an embodiment, the on-chip network processing system 1200 includes: the storage device 1201, the first computing device 1202, the second computing device 1203, and the second computing device 1204 are disposed on the same chip of the network-on-chip processing system 1200, wherein the first computing device 1202 is connected to the storage device 1201, and the second computing device 1203 and the second computing device 1204 are directly connected to the first computing device 1202, that is, the second computing device 1204 is connected to both the second computing device 1203 and the first computing device 1201, and does not need to establish a connection with the first computing device 1201 through the second computing device 1203. Only the first computing device 1202 has access to the storage device 1201, that is, only the first computing device 1202 can read and write data from the storage device 1201, and the first computing device 1202, the second computing device 1203, and the second computing device 1204 can mutually perform data transfer.

Specifically, when the second computing device 1204 needs to read data, the first computing device 1202 accesses the storage device 1201, reads the data needed by the second computing device 1204 from the storage device 1201, and the first computing device 1202 sends the data directly to the second computing device 1204 without forwarding through the second computing device 1203. Alternatively, the first computing device 1202, the second computing device 1203 and the second computing device 1204 may all be connected to the storage device 1201, as long as at least one of the first computing device 1202, the second computing device 1203 and the second computing device 1204 is connected to the storage device 1201, and the connection is not particularly limited herein. Alternatively, the second computing device 1203 may be connected to the second computing device 1204, or may be connected to the first computing device 1202, as long as at least two of the first computing device 1202, the second computing device 1203 and the second computing device 1204 are connected to each other, which is not limited in detail herein.

In the network-on-chip processing system, the connection is established among the plurality of computing devices arranged on the same chip, so that data transmission can be carried out among the plurality of computing devices, the problem that the connection bandwidth overhead is overlarge due to the fact that the plurality of computing devices read data from the storage device is avoided, and meanwhile, the data reading and writing efficiency is improved.

In one embodiment, an on-chip network processing system is provided, the system comprising: the storage device and the plurality of computing devices are arranged on the same chip, wherein each computing device in the plurality of computing devices is connected with the storage device, and at least two computing devices are connected with each other.

As shown in fig. 3, for an on-chip network processing system 1300 provided in an embodiment, the on-chip network processing system 1300 includes: the storage device 1301, the computing device 1302, the computing device 1303, and the computing device 1304 are disposed on the same chip of the network-on-chip processing system 1300, wherein the computing device 1302, the computing device 1303, and the computing device 1304 are all connected to the storage device 1301, the computing device 1302 and the computing device 1303 are connected to each other, and the computing device 1303 and the computing device 1304 are connected to each other. Computing device 1202, computing device 1203, and computing device 1304 each have access to storage device 1201, and computing device 1302 and computing device 1303 may be capable of data transfer therebetween, while computing device 1303 and computing device 1304 may be capable of data transfer therebetween.

Specifically, when the computing device 1304 needs to read data, the computing device 1304 may directly access the storage device 1301; the storage device 1301 may be accessed by the computing device 1303, data required by the computing device 1304 may be read from the storage device 1301, and the computing device 1303 may transmit the data to the computing device 1304; the storage device 1301 may be accessed by the computing device 1302, data required by the computing device 1304 may be read from the storage device 1301, sent to the computing device 1303 by the computing device 1302, and sent to the computing device 1304 by the computing device 1303. Optionally, at least one of the computing device 1302, the computing device 1303 and the computing device 1304 is only required to be connected to the storage device 1301, and is not specifically limited herein. Alternatively, the computing device 1302, the computing device 1303 and the computing device 1304 may be connected to each other only by ensuring that at least two computing devices are connected to each other, and the connection is not specifically limited herein.

In the network-on-chip processing system, the connection is established among the plurality of computing devices arranged on the same chip, so that data required by any computing device can be transmitted among the plurality of computing devices, the system can reduce the number of computing devices reading the interfaces of the storage device at the same time, and reduce bandwidth blocking.

As shown in fig. 4, for an on-chip network processing system 1400 provided in an embodiment, the on-chip network processing system 1400 includes: the network-on-chip processing system 1400 comprises a storage device 1401, a computing device 1402, a computing device 1403 and a computing device 1404, wherein the storage device 1401, the computing device 1402, the computing device 1403 and the computing device 1404 are arranged on the same chip of the network-on-chip processing system 1400, the computing device 1402, the computing device 1403 and the computing device 1404 are connected with the storage device 1401, and the computing device 1402, the computing device 1403 and the computing device 1404 are connected with one another. Computing device 1402, computing device 1403, and computing device 1404 each have access to storage device 1401, and three computing devices, computing device 1402, computing device 1403, and computing device 1404, can communicate data with one another.

In particular, when computing device 1404 needs to read data, storage 1401 may be accessed directly; the storage device 1401 may be accessed by the computing device 1403, data necessary for the computing device 1404 may be read from the storage device 1401, and the data may be transmitted from the computing device 1403 to the computing device 1404; storage device 1401 can also be accessed by computing device 1402, data needed by computing device 1404 can be read from storage device 1401, and sent directly to computing device 1404 by computing device 1402 without having to forward the data through computing device 1403. Alternatively, the computing device 1402, the computing device 1403, and the computing device 1404 may be any device that ensures at least one of the computing devices is connected to the storage device 1401, and is not particularly limited herein. Alternatively, the computing device 1402, the computing device 1403, and the computing device 1404 may be any device that can ensure at least two computing devices are connected to each other, and are not particularly limited herein.

In the network-on-chip processing system, the direct connection is established among a plurality of computing devices arranged on the same chip, so that the data reading and writing efficiency can be improved.

In one embodiment, there is provided an on-chip network processing system, comprising: the storage device and the plurality of computing device groups are arranged on the same chip, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and at least two computing device groups are connected with each other.

In one embodiment, a neural network chip is provided, the chip comprising: the storage device comprises a storage device, a plurality of computing device groups, a first interconnection device and a second interconnection device, wherein at least one computing device group in the computing device groups is connected with the storage device through the first interconnection device, and the computing device groups are connected through the second interconnection device. Furthermore, the computing device groups can realize read-write operation on the storage device through the first interconnection device, and data transmission can be carried out among the plurality of computing device groups through the second interconnection device.

As shown in fig. 5a, for an on-chip network processing system 1500 provided in an embodiment, the on-chip network processing system 1500 includes: storage 1501 and six computing devices (computing device 1502 through computing device 1507), storage 1501 and six computing devices (computing device 1502 through computing device 1507) are disposed on the same piece of network-on-chip processing system 1500, the six computing devices are divided into three groups, computing device 1502 and computing device 1503 are a first computing device group (cluster1), computing device 1504 and computing device 1505 are a second computing device group (cluster2), computing device 1506 and computing device 1507 are a third computing device group (cluster3), cluster1 is a main computing device group, and cluster2 and cluster3 are sub-computing device groups. Wherein, only cluster1 is connected with the storage device 1501, and the cluster1, the cluster2 and the cluster3 are connected with each other. The computing device 1502 in cluster1 is connected to the storage device 1501, the computing device 1503 in cluster1 is connected to the computing device 1504 in cluster2, and the computing device 1505 in cluster2 is connected to the computing device 1507 in cluster 3.

Specifically, when cluster3 needs to read data, storage 1501 can be accessed by cluster1, the data needed by cluster3 is read from storage 1501, sent to cluster2 by cluster1, and sent to cluster3 by cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably, one group includes four computing devices.

Alternatively, all of the plurality of computing devices are not required to be connected to the storage device 1501, and at least one of the two computing device groups is only required to be connected to the storage device 1501, which is not particularly limited herein. Alternatively, cluster1 may be connected to either cluster2 or cluster3, as long as at least two of the three computing device groups are connected to each other, and is not particularly limited herein. Optionally, each computing device in each computing device group includes at least one computing device connected to at least one computing device in the other computing device groups, that is, each computing device of cluster1 may establish a connection with the second computing device group, only by ensuring that at least one computing device in cluster1 is connected to at least one computing device in cluster2, which is not limited herein. Optionally, the plurality of computing device groups are connected to each other through any computing device in the plurality of computing device groups, that is, any computing device in the cluster1 may be connected to any computing device in the cluster2, and is not limited in this respect.

As shown in fig. 5b, for an on-chip network processing system 15000 provided in an embodiment of the present invention, the on-chip network processing system 15000 includes: the storage apparatus 15010 and six computing devices (computing device 15020 to computing device 15070), the storage apparatus 15010 and six computing devices (computing device 15020 to computing device 15070) are provided on the same chip of the network-on-chip processing system 15000, the six computing devices are divided into three groups, the computing device 15020 and the computing device 15030 are a first computing device group (cluster1), the computing device 15040 and the computing device 15050 are a second computing device group (cluster2), the computing device 15060 and the computing device 15070 are a third computing device group (cluster3), the cluster1 is a main computing device group, and the cluster2 and the cluster3 are sub-computing device groups. Wherein, only cluster1 is connected with the storage device 15010, and cluster1, cluster2 and cluster3 are connected with each other. computing device 15020 in cluster1 is connected to storage device 15010, computing device 15030 in cluster1 is connected to computing device 15040 in cluster2, computing device 15050 in cluster2 is connected to computing device 15070 in cluster3, and computing device 15060 in cluster3 is connected to computing device 15020 in cluster 1.

Specifically, when cluster3 needs to read data, storage 1501 can be accessed by cluster1, the data needed by cluster3 is read from storage 1501, and the data is sent directly to cluster3 by cluster 1. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably, one group includes four computing devices.

Alternatively, all of the plurality of computing devices are not required to be connected to the storage device 15010, and at least one of the two computing device groups may be connected to the storage device 15010, which is not particularly limited herein. Alternatively, cluster1 may be connected to either cluster2 or cluster3, as long as at least two of the three computing device groups are connected to each other, and is not particularly limited herein. Optionally, at least one computing device in each computing device group is connected to at least one computing device in other computing device groups, that is, each computing device of cluster1 may establish a connection with the second device group, only by ensuring that at least one computing device in cluster1 is connected to at least one computing device in cluster2, which is not limited herein. Optionally, the plurality of computing device groups are connected to each other through any computing device in the plurality of computing device groups, that is, any computing device in cluster1 may be connected to any computing device in cluster2, and is not limited in this respect.

In the network-on-chip processing system, the plurality of computing device groups can realize inter-group communication by establishing connection among the plurality of computing device groups arranged on the same chip, and the system can reduce computing devices for reading interfaces of the storage devices simultaneously through inter-group data transmission and reduce energy consumption overhead of memory access; meanwhile, a plurality of computing device groups arranged on the same chip are connected in a plurality of ways to establish inter-group communication, a plurality of communication channels are established among the computing devices, and an optimal channel is selected for data transmission according to the current network congestion condition, so that the effects of saving energy consumption and improving data processing efficiency are achieved.

In one embodiment, an on-chip network processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same chip, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and the computing device groups are connected with each other.

As shown in fig. 6, for an on-chip network processing system 1600 provided in one embodiment, the on-chip network processing system 1600 includes: storage 1601 and six computing devices (computing device 1602 to computing device 1607), storage 1601 and six computing devices (computing device 1602 to computing device 1607) are disposed on the same piece of network-on-chip processing system 1600, the six computing devices are divided into three groups, computing device 1602 and computing device 1603 are a first computing device group cluster1, computing device 1604 and computing device 1605 are a second computing device group cluster2, computing device 1606 and computing device 1607 are a third computing device group cluster3, wherein cluster1, cluster2 and cluster3 are all connected with storage 1601, cluster1 and cluster2 are connected with each other, and cluster2 and cluster3 are connected with each other. Computing devices 1602 to 1607 are each connected to storage device 1601, computing device 1603 in cluster1 and computing device 1604 in cluster2 are connected to each other, and computing device 1604 in cluster2 and computing device 1607 in cluster3 are connected to each other.

Specifically, when the cluster3 needs to read data, the storage device 1601 can be accessed by the cluster2, the data needed by the cluster3 is read from the storage device 1601, and the data is sent to the cluster3 by the cluster 2; the storage device 1601 may be accessed by the cluster1, the data required by the cluster3 is read from the storage device 1601, the data is sent to the cluster2 by the cluster1, and the data is sent to the cluster3 by the cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices in each group is not particularly limited, and preferably, one group includes four computing devices.

Alternatively, all of the plurality of computing devices are not required to be connected to the storage device 1601, as long as at least one of the two computing device groups is connected to the storage device 1601, and the connection is not specifically limited herein. Optionally, each computing device of the cluster1 may establish a connection with the second cell group and/or the cluster3, only by ensuring that at least one computing device of the cluster1 is connected with at least one computing device of the cluster2 and/or the cluster3, which is not limited herein. Optionally, any computing device in cluster1 may be interconnected with any computing device in cluster2 and/or cluster3, and is not specifically limited herein.

In the network-on-chip processing system, the connection is established among the plurality of computing device groups arranged on the same chip, so that data required by any computing device group can be transmitted among the plurality of computing device groups, the system can reduce computing devices for reading storage device interfaces simultaneously, and bandwidth blocking is reduced.

In one embodiment, an on-chip network processing system is provided, the system comprising: the storage device and the plurality of computing device groups are arranged on the same chip, each computing device group comprises a plurality of computing devices, at least one computing device group in the plurality of computing device groups is connected with the storage device, and any two computing device groups in the plurality of computing device groups are directly connected.

As shown in fig. 7, for an on-chip network processing system 1700 provided in an embodiment, the on-chip network processing system 1700 includes: the storage device 1701 and six computing devices (computing device 1702-computing device 1707), the storage device 1701 and six computing devices (computing device 1702-computing device 1707) are disposed on the same slice of the network-on-chip processing system 1700, the six computing devices are divided into three groups, the computing device 1702 and the computing device 1703 are a first computing device group cluster1, the computing device 1704 and the computing device 1705 are a second computing device group cluster2, and the computing device 1706 and the computing device 1707 are a third computing device group cluster3, wherein the cluster1, the cluster2 and the cluster3 are all connected with the storage device 1701, the cluster1, the cluster2 and the cluster 3. The computing devices 1702-1707 are all connected to the storage device 1701, the computing device 1703 in cluster1 is connected to the computing device 1704 in cluster2, the computing device 1704 in cluster2 is connected to the computing device 1707 in cluster3, and the computing device 1702 in cluster1 is connected to the computing device 1706 in cluster 3.

Specifically, when the cluster3 needs to read data, the storage device 1701 may be accessed by the cluster2, the data needed by the cluster3 may be read from the storage device 1701, and sent by the cluster2 to the cluster 3; the storage device 1701 may be accessed by the cluster1, data required by the cluster3 may be read from the storage device 1701, and the data may be directly sent to the cluster3 by the cluster 1. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices per group is not particularly limited, and it is preferable that one group includes four computing devices.

Alternatively, all of the plurality of computing devices are not required to be connected to the storage device 1701, and at least one of the two computing device groups is not particularly limited as long as it is connected to the storage device 1701. Optionally, each computing device of the cluster1 may establish a connection with the second cell group and the cluster3, and only needs to ensure that at least one computing device of the cluster1 is connected with at least one computing device of the cluster2 and the cluster3, which is not limited herein. Alternatively, any one of the computing devices of cluster1 may be interconnected with any one of the computing devices of cluster2 and cluster3, and is not specifically limited herein.

In the network-on-chip processing system, the direct connection is established among a plurality of computing device groups arranged on the same chip, so that the data reading and writing efficiency can be improved.

In one embodiment, there is provided an on-chip network processing system, comprising: the storage device and the computing device groups are arranged on the same chip, each computing device group comprises a plurality of computing devices, at least one computing device group in the computing device groups is connected with the storage device, at least two computing device groups are connected with each other, and the computing devices in each computing device group are connected with each other.

As shown in fig. 8, a network-on-chip processing system 1800 is provided for one embodiment, where the network-on-chip processing system 1800 includes: storage 1801 and six computing devices (computing device 1802 through computing device 1807), storage 1801 and six computing devices (computing device 1802 through computing device 1807) disposed on the same slice of network-on-chip processing system 1800, the six computing devices divided into two groups, computing device 1802, computing device 1803, and computing device 1804 being a first computing device group cluster1, computing device 1805, computing device 1806, and computing device 1807 being a second computing device group cluster2, wherein cluster1 and cluster2 are each connected to storage 1801, cluster1 and cluster2 are each connected to each other, and three of cluster1 are each connected to each other, and three of cluster2 are each connected to each other. Computing device 1802 through computing device 1807 are each connected to storage device 1801, computing device 1802 in cluster1 is connected to computing device 1805 in cluster2, computing device 1803 is connected to computing device 1802 and computing device 1804, and computing device 1806 is connected to computing device 1805 and computing device 1807. The connection between the computing devices of each computing device group can be referred to as the connection from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not repeated herein.

Specifically, when cluster2 needs to read data, storage 1801 can be accessed directly; the storage device 1801 can also be accessed by the cluster1, the data required by the cluster2 is read from the storage device 1801, and the data is sent to the cluster2 by the cluster 1; while the second computing device may also perform data transfers within the group. When the cluster2 needs to read data, the computing device 1805, the computing device 1806 and the computing device 1807 in the cluster2 can simultaneously access the storage device 1801, wherein the computing device 1805, the computing device 1806 and the computing device 1807 respectively read a part of data needed by the cluster2, and the data can be transmitted in the cluster 2. The plurality of computing devices may be divided into a plurality of groups, and the number of computing devices per group is not particularly limited, and it is preferable that one group includes four computing devices.

Optionally, all the computing devices in the plurality of computing devices are not required to be connected to the storage device 1801, as long as at least one computing device group in the two computing device groups is connected to the storage device 1801, which is not specifically limited herein. Optionally, each computing device of the cluster1 may establish a connection with the second unit group, and only needs to ensure that at least one computing device in the cluster1 is connected with at least one computing device in the cluster2, which is not limited herein. Optionally, any computing device in cluster1 may be interconnected with any computing device in cluster2, and is not specifically limited herein.

In the network-on-chip processing system, the plurality of computing device groups arranged on the same chip are connected, and meanwhile, the plurality of computing devices in each computing device group are connected, so that the plurality of computing devices can realize both intra-group communication and inter-group communication.

In one embodiment, there is provided an on-chip network processing system, comprising: a plurality of network-on-chip processing modules interconnect, a plurality of network-on-chip processing modules set up on same piece, each network-on-chip processing module includes: the network processing module comprises at least one storage device and a plurality of computing devices, wherein in each network processing module on chip, at least one computing device is connected with at least one storage device inside the network processing module, and at least two computing devices in the plurality of computing devices are connected with each other.

In one embodiment, a neural network chip is provided, the chip comprising a plurality of network-on-chip processing modules interconnected, each network-on-chip processing module comprising: the network processing module on chip comprises at least one storage device, a plurality of computing devices, a first interconnection device and a second interconnection device, wherein in each network processing module on chip, at least one computing device is connected with at least one storage device inside the network processing module on chip through the first interconnection device, and the computing devices are connected through the second interconnection device. Furthermore, the computing devices can realize read-write operation on the storage device inside the on-chip network processing module through the first interconnection device, and data transmission can be carried out among the plurality of computing devices through the second interconnection device.

As shown in fig. 9, for an on-chip network processing system 1900 provided in an embodiment of the present invention, the on-chip network processing system 1900 includes four on-chip network processing modules connected to each other, the four on-chip network processing modules are disposed on a same chip of the on-chip network processing system 1900, and each on-chip network processing module includes: one storage device 1901 and four computing devices (computing device 1902 to computing device 1905), wherein, in each network-on-chip processing module, the computing device 1902 is connected with the storage device 1901 inside its network-on-chip processing module, and the four computing devices inside each network-on-chip processing module are connected with each other.

Specifically, data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module.

Optionally, the number of the storage devices in each network-on-chip processing module is not limited to one, and may be two, three or more, and is not specifically limited herein, and is preferably four. Optionally, in each network-on-chip processing module, the multiple computing devices are connected to each other to form a network of computing devices, and a connection manner between the multiple computing devices in each network-on-chip processing module may refer to a connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again. Optionally, not all of the computing devices in each network-on-chip processing module are required to be connected to the storage device 1901, as long as at least one computing device in each network-on-chip processing module is connected to the storage device 1901, which is not limited herein.

Optionally, each computing device in each network-on-chip processing module may establish a connection with another network-on-chip processing module, and only at least one computing device in each network-on-chip processing module needs to be connected with at least one computing device in another network-on-chip processing module, which is not limited specifically herein. Optionally, the network-on-chip processing modules are connected to each other through any computing device in each network-on-chip processing module, that is, any computing device in each network-on-chip processing module may be connected to any computing device in another network-on-chip processing module, which is not limited herein.

In the network-on-chip processing system, the connection is established among the network-on-chip processing modules arranged on the same chip, and meanwhile, the connection is established among the computing devices in each network-on-chip processing module, so that the intra-module communication and inter-module communication can be realized among the computing devices, the system can reduce the energy consumption overhead of memory access and improve the efficiency of data reading; meanwhile, a plurality of network-on-chip processing modules arranged on the same chip are connected in various ways to establish inter-module communication, a plurality of communication channels are established among a plurality of computing devices, and a best channel is selected for data transmission according to the current network congestion condition, so that the effects of saving energy consumption and improving data processing efficiency are achieved.

In one embodiment, there is provided an on-chip network processing system, comprising: the network-on-chip processing module comprises a plurality of storage devices, at least one computing device is connected with the storage devices in the network-on-chip processing module, and at least two computing devices are connected with each other.

As shown in fig. 10a, for an on-chip network processing system 1910 provided in an embodiment of the present invention, the on-chip network processing system 1910 includes four on-chip network processing modules connected to each other, where the four on-chip network processing modules are disposed on a same chip of the on-chip network processing system 1910, and each of the on-chip network processing modules includes: storage 1911, storage 1916, and four computing devices (computing devices 1912-1915), where in each network-on-chip processing module, the computing device 1912 is connected with the storage 1911 and the storage 1916 inside its network-on-chip processing module, and the four computing devices inside each network-on-chip processing module are connected with each other.

Specifically, data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. At least one computing device in each network-on-chip processing module establishes a connection with all storage devices in the network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in the network-on-chip processing module. The number of the storage devices in each network-on-chip processing module is not limited to two, and may be three, four or more, and is not specifically limited herein, and is preferably four.

In particular, the computing devices in each network-on-chip processing module have priority access to adjacent storage devices. The storage device in the vicinity refers to a storage device having the shortest communication distance among the plurality of storage devices connected to the computing device, that is, the storage device having the shortest communication distance has a higher access priority than other storage devices.

Optionally, in each network-on-chip processing module, the multiple computing devices are connected to form a computing device network, and a connection manner between the multiple computing devices in each network-on-chip processing module may refer to a connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again. Optionally, not all of the computing devices in each network-on-chip processing module are required to be connected to the storage device 1911, as long as at least one computing device in each network-on-chip processing module is connected to the storage device 1911, which is not specifically limited herein.

Optionally, each computing device in each network-on-chip processing module may establish a connection with another network-on-chip processing module, and only at least one computing device in each network-on-chip processing module needs to be ensured to be connected with at least one computing device in another network-on-chip processing module, which is not limited specifically herein. Optionally, the network-on-chip processing modules are connected to each other through any computing device in each network-on-chip processing module, that is, any computing device in each network-on-chip processing module may be connected to any computing device in another network-on-chip processing module, which is not limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module and can provide a plurality of communication channels for data transmission, so that the data reading and writing efficiency is improved; each computing device in the system preferentially accesses the adjacent storage device, so that the access and storage overhead can be saved, and certain flexibility can be ensured.

In one embodiment, as shown in fig. 10b, the network-on-chip processing system 19100 is configured such that data required to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. At least one computing device in each network-on-chip processing module establishes a connection with all storage devices in the network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in the network-on-chip processing module. The number of the storage devices in each network processing module is not limited to two, and may be three, four or more, and is not specifically limited herein, and is preferably four.

Specifically, in each network-on-chip processing module, each computing device is connected to a storage device with a first communication distance, where the first communication distance refers to the shortest communication distance, that is, the computing device in each network-on-chip processing module can only access the adjacent storage device, that is, the computing device in each network-on-chip processing module can only access the storage device with the shortest communication distance. For example, computing device 19120 may only access proximate storage device 19110, and may not access storage device 19160; when data that the computing device 19120 needs to read is stored in the storage device 19160, the data needs to be read from the storage device 19160 by the computing device 19130 and then transmitted to the computing device 19120 by the computing device 19130.

Optionally, in each network-on-chip processing module, the multiple computing devices are connected to form a computing device network, and a connection manner between the multiple computing devices in each network-on-chip processing module may refer to a connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again. Optionally, not all of the computing devices in each network-on-chip processing module are required to be connected to the storage device 19110, as long as at least one computing device in each network-on-chip processing module is connected to the storage device 19110, which is not specifically limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module and can provide a plurality of communication channels for data transmission, so that the data reading and writing efficiency is improved; each computing device in the system can only access the adjacent storage device, so that the access and storage overhead can be saved to the maximum extent.

In one embodiment, there is provided an on-chip network processing system, comprising: two arbitrary network processing modules on chip set up on same piece, and each network processing module on chip includes: at least one storage device and a plurality of computing devices, wherein, in each network processing module on chip, at least one computing device is connected with at least one storage device inside the network processing module, and at least two computing devices in the plurality of computing devices are connected with each other.

As shown in fig. 11, for an on-chip network processing system 1920 provided in an embodiment of the present invention, the on-chip network processing system 1920 includes four network-on-chip processing modules connected to each other, where the four network-on-chip processing modules are disposed on a same chip of the network-on-chip processing system 1920, any two network-on-chip processing modules of the four network-on-chip processing modules are directly connected to each other, and each network-on-chip processing module includes: one storage device 1921 and four computing devices (computing device 1922 to computing device 1925), wherein in each network processing module on chip, the computing device 1922 is connected with the storage device 1921 inside the network processing module on chip, and the four computing devices inside each network processing module on chip are connected with each other.

Optionally, the number of the storage devices in each network-on-chip processing module is not limited to one, and may be two, three or more, and is not specifically limited herein, and is preferably four. Optionally, in each network-on-chip processing module, the multiple computing devices are connected to each other to form a network of computing devices, and a connection manner between the multiple computing devices in each network-on-chip processing module may refer to a connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again. Optionally, not all of the computing devices in each network-on-chip processing module are required to be connected to the storage device 1921, as long as at least one computing device in each network-on-chip processing module is connected to the storage device 1921, which is not limited herein.

In the network processing system on chip, the connection is established among the network processing modules on chip arranged on the same chip, and the connection is established among the computing devices in each network processing module on chip, so that intra-module communication can be realized among the computing devices, and direct inter-module communication can be realized between any two network processing modules on chip.

In one embodiment, there is provided an on-chip network processing system, comprising: the network processing module on the chip comprises a plurality of storage devices, at least one computing device is connected with the storage devices inside the network processing module on the chip, and at least two computing devices are connected with each other.

As shown in fig. 12, for an on-chip network processing system 1930 provided in an embodiment of the present invention, the on-chip network processing system 1930 includes four on-chip network processing modules that are connected to each other, the four on-chip network processing modules are disposed on a same chip of the on-chip network processing system 1920, any two on-chip network processing modules of the four on-chip network processing modules are directly connected to each other, and each on-chip network processing module includes: storage 1931, storage 1936, and four computing devices (computing devices 1932 through 1935), wherein, in each network-on-chip processing module, computing device 1932 is connected with storage 1931 and storage 1936 inside its network-on-chip processing module, and the four computing devices inside each network-on-chip processing module are connected with each other.

Specifically, data to be processed by each network-on-chip processing module is stored in a storage device inside the network-on-chip processing module, that is, a plurality of computing devices in each network-on-chip processing module can only access the storage device inside the network-on-chip processing module, and can only read and write data from the storage device inside the network-on-chip processing module. The computing device in each network-on-chip processing module has priority access to adjacent storage devices.

Optionally, the number of the storage devices in each network-on-chip processing module is not limited to two, and may be three, four or more, and is not specifically limited herein, and is preferably four. Specifically, at least one computing device in each network-on-chip processing module establishes a connection with all storage devices in the network-on-chip processing module, that is, the computing device in each network-on-chip processing module has access to all storage devices in the network-on-chip processing module.

Optionally, in each network-on-chip processing module, the multiple computing devices are connected to form a computing device network, and a connection manner between the multiple computing devices in each network-on-chip processing module may refer to a connection manner from the network-on-chip processing system 1100 to the network-on-chip processing system 1400, which is not described herein again. Optionally, not all of the computing devices in each network processing module on chip are required to be connected to the storage device 1931, as long as at least one computing device in each network processing module on chip is connected to the storage device 1931, which is not limited herein.

In the network-on-chip processing system, each computing device can access all storage devices in the network-on-chip processing module, meanwhile, direct communication between any two network-on-chip processing modules can be realized, and the system can provide a plurality of communication channels for data transmission, so that the read-write efficiency of data is improved; each computing device in the system preferentially accesses the adjacent storage device, so that the access and storage overhead can be saved, and certain flexibility can be ensured.

In one embodiment, as shown in fig. 13, a computing device of a network-on-chip processing system may be configured to perform a machine learning computation, the computing device comprising: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected with the arithmetic unit 12, and the arithmetic unit 11 comprises: a master processing circuit and a plurality of slave processing circuits;

a controller unit 11 for acquiring input data and a calculation instruction; in an alternative, the input data obtaining and calculating instruction modes may be obtained through a data input and output unit, and the data input and output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

The controller unit 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 configured to perform a preamble process on the input data and transmit data and an operation instruction with the plurality of slave processing circuits;

a plurality of slave processing circuits 102 configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can be with the computational instruction according to the forward operation with data split, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption.

Optionally, the computing device may further include: the storage unit 10 and the direct memory access unit 150, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and a scalar; the cache is a scratch pad cache. The direct memory access unit 150 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions and/or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in Table 1.

TABLE 1

Operation code

Registers or immediate data

Register/immediate

…

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instruction may comprise a neural network operation instruction. Taking the neural network operation instruction as an example, as shown in table 2, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, for example, when n is 1, the data is 1-dimensional data, that is, a vector, and when n is 2, the data is 2-dimensional data, that is, a matrix, and when n is 3 or more, the data is a multidimensional tensor.

Optionally, the controller unit may further include:

the dependency processing unit 112 is configured to determine whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, cache the first operation instruction in the instruction storage unit if the first operation instruction has an association relationship with the zeroth operation instruction, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the zeroth operation instruction is executed;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises:

extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, determining that the first operation instruction and the zeroth operation instruction have an association relation if the first storage address interval and the zeroth storage address interval have an overlapped area, and determining that the first operation instruction and the zeroth operation instruction do not have an association relation if the first storage address interval and the zeroth storage address interval do not have an overlapped area.

In another alternative embodiment, the arithmetic unit 12 may include a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 14. In one embodiment, as shown in fig. 14, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 14, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

k slave processing circuits for forwarding of data and instructions between the master processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 15, the main processing circuit may further include: one or any combination of a conversion processing circuit 110, an activation processing circuit 111, and an addition processing circuit 112;

a conversion processing circuit 110 for performing an interchange between the first data structure and the second data structure (e.g., conversion of continuous data and discrete data) on the data block or intermediate result received by the main processing circuit; or performing an interchange between the first data type and the second data type (e.g., a fixed point type to floating point type conversion) on the data block or intermediate result received by the main processing circuitry;

an activation processing circuit 111 for performing an activation operation of data in the main processing circuit;

and an addition processing circuit 112 for performing addition operation or accumulation operation.

The master processing circuit is configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the main processing circuit;

and the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit;

the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result;

forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activate instruction, or the like.

The following describes a specific calculation method of the calculation apparatus shown in fig. 1 by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be: s-s (∑ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 16, the arithmetic unit includes: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module has a transceiving function, for example, as shown in fig. 16, the tree module is a transmitting function, and as shown in fig. 17, the tree module is a receiving function.

The tree module is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Alternatively, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 18, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 18.

Optionally, the arithmetic unit may carry a separate buffer, as shown in fig. 19, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 20, the arithmetic unit may further include: and a weight buffer unit 64, configured to buffer weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 21, may include a branch processing circuit 103; the specific connection structure is shown in fig. 21, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

The application also discloses a neural network arithmetic device which comprises one or more computing devices mentioned in the application and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. In this case, the same control system may be shared, or separate control systems may be provided; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The neural network arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the neural network arithmetic device, the universal interconnection interface and other processing devices. The neural network arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 22 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. Other processing devices are used as interfaces of the neural network arithmetic device and external data and control, and include data transportation to finish basic control of starting, stopping and the like of the neural network arithmetic device; other processing devices can cooperate with the neural network arithmetic device to complete the arithmetic task.

And the universal interconnection interface is used for transmitting data and control instructions between the neural network arithmetic device and other processing devices. The neural network arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on a neural network arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a neural network arithmetic device chip; the data in the storage module of the neural network arithmetic device can also be read and transmitted to other processing devices.

Optionally, as shown in fig. 23, the structure may further include a storage device, and the storage device is connected to the neural network operation device and the other processing device, respectively. The storage device is used for storing data in the neural network arithmetic device and the other processing devices, and is particularly suitable for data which needs to be calculated and cannot be stored in the internal storage of the neural network arithmetic device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, net card, wifi interface.

In some embodiments, a chip including the above neural network operation device or the combined processing device is also provided.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 24, fig. 24 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-31200 granules are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 251600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so that data transfer is implemented. Preferably, when PCIE 3.0X 16 interface is adopted for transmission, the theoretical bandwidth can reach 116000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the specific representation of the other interface, and the interface unit may implement the switching function. In addition, the calculation results of the chip are still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chips.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

In one embodiment, as shown in fig. 25, there is provided a network-on-chip data processing method, including the steps of:

step 202, accessing the storage device through the first computing device to obtain first operation data.

Wherein the first computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the first computing device acquires the first operation data and the computing instruction from the storage device.

Step 204, the first calculation device calculates the first calculation data to obtain a first calculation result.

And the first operation data read from the storage device is operated in the first computing device according to the corresponding computing instruction to obtain a first operation result.

Step 206, sending the first operation result to a second computing device.

The first computing device sends the first operation result to the second computing device through a communication channel established between the first computing device and the second computing device and a controller unit in the first computing device. Alternatively, the first operation result may be sent to the second computing device, or the first operation result may be sent to the storage device.

Further, the network-on-chip data processing method provided by this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 1 to 5.

According to the network-on-chip data processing method, the first operation result in the first computing device is sent to the second computing device, so that data transmission among a plurality of computing devices can be realized; meanwhile, the operation data are multiplexed, so that the condition that the bandwidth overhead is too large due to the fact that the computing device accesses the storage device for multiple times can be avoided, the operation data and the intermediate operation result can be reasonably utilized, and the data processing efficiency is improved.

In one embodiment, as shown in fig. 26, a method for processing network-on-chip data is provided, which includes the following steps:

step 302, accessing the storage device through the first computing device to obtain first operation data.

Wherein the computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the first computing device acquires the first operation data and the computing instruction from the storage device.

Step 304, the first calculation device calculates the first calculation data to obtain a first calculation result.

Step 306, sending the first operation result to a second computing device.

The first computing device sends the first operation result to the second computing device through a communication channel established between the first computing device and the second computing device and a controller unit in the first computing device.

Step 308, accessing the storage device through the second computing device to obtain second operation data.

Wherein the second computing device comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second calculation device acquires the second operation data and the calculation instruction from the storage device.

Step 310, the second calculation device calculates the second calculation data and the first calculation result to obtain a second calculation result.

And the second operation data read from the storage device and the first operation result received from the first calculation device are operated in the first calculation device according to the corresponding calculation instruction to obtain a second operation result.

According to the network-on-chip data processing method, the first operation result in the first computing device is sent to the second computing device, and the second computing device performs operation again by using the first operation result, so that the multiplexing of operation data can be realized.

In one embodiment, the network-on-chip data processing method shown in fig. 26 is applied to the network-on-chip processing system 1900 shown in fig. 9, wherein each of the computing devices 1902 to 1905 is connected to the storage device 1901 in the network-on-chip processing module, and any two of the computing devices 1902 to 1905 are directly connected to each other.

For example, a matrix multiplication, a matrix, is calculated

Matrix array

Computing matrices

Wherein for c₀₀＝a₀₀*b₀₀+a₀₁*b₁₀；

c₀₁＝a₀₀*b₀₁+a₀₁*b₁₁；

c₁₀＝a₁₀*b₀₀+a₁₁*b₁₀；

c₁₁＝a₁₀*b₀₁+a₁₁*b₁₁。

First, time is divided into three time segments.

Then, during a first period of time, the computing devices 1902 through 1905 simultaneously access the storage devices 1901 in the network-on-chip processing module in which they reside.

Specifically, the computing device 1902 reads the first operation data a from the storage device 1901₀₀And b₀₀(ii) a The computing device 1903 reads the first operation data a from the storage device 1901₀₁And b₁₁(ii) a The computing device 1904 reads the first operation data a from the storage device 1901₁₁And b₁₀(ii) a The computing device 1905 reads the first operation data a from the storage device 1901₁₀And b₀₁。

Further, the read first operation data a is read in the computing device 1902₀₀And b₀₀Performing operation to obtain a first operation result a₀₀*b₀₀(ii) a The read first operation data a is processed by the computing device 1903₀₁And b₁₁Performing operation to obtain a first operation result a₀₁*b₁₁(ii) a The first operation data a read by the computing device 1904₁₁And b₁₀Performing operation to obtain a first operation result a₁₁*b₁₀(ii) a The read first operation data a is processed in the computing device 1905₁₀And b₀₁Performing operation to obtain a first operation result a₁₀*b₀₁。

Then, in a second time period, the computing device 1902 reads the first operation data a from the computing device 1903 respectively₀₁And the first operation data b is read from the calculation device 1904₁₀Obtaining a second operation result a by operation₀₁*b₁₀(ii) a The computing device 1903 reads the first operation data a from the computing device 1902₀₀And the first operation data b is read from the calculation device 1905₀₁Obtaining a second operation result a by operation₀₀*b₀₁(ii) a The computing device 1904 reads the first operation data a from the computing device 1905₁₀And reading the first operation data b from the computing device 1902₀₀Obtaining a second operation result a by operation₁₀*b₀₀(ii) a The computing device 1905 reads the first operation data a from the computing device 1904₁₁And the first operation data b is read from the calculation device 1903₁₁Obtaining a second operation result a by operation₁₁*b₁₁。

Then, in a third time period, the computing device 1902 outputs the first operation result a₀₀*b₀₀And a second operation result a₀₁*b₁₀Performing operation to obtain a third operation result c₀₀＝a₀₀*b₀₀+a₀₁*b₁₀And the third operation result c₀₀To the storage device 1902; the calculation means 1903 outputs the first operation result a₀₁*b₁₁And a second operation result a₀₀*b₀₁Performing operation to obtain a third operation result c₀₁＝a₀₀*b₀₁+a₀₁*b₁₁And the third operation result c₀₁To the storage device 1902; the calculation device 1904 outputs the first operation result a₁₁*b₁₀And a second operation result a₁₀*b₀₀Performing operation to obtain a third operation result c₁₀＝a₁₀*b₀₀+a₁₁*b₁₀And the third operation result c₁₀To the storage device 1902; the calculation means 1905 converts the first operation result a into a₁₀*b₀₁And a second operation result a₁₁*b₁₁Performing operation to obtain a third operation result c₁₁＝a₁₀*b₀₁+a₁₁*b₁₁And the third operation result c₁₁ SendingTo memory device 1902.

In one embodiment, as shown in fig. 27, a method for processing network-on-chip data is provided, which includes the following steps:

step 402, accessing a storage device through a first computing device group to obtain first operation data, wherein the first computing device group comprises a plurality of first computing devices.

Wherein each first computing device of the first computing device group cluster1 comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster1 retrieves the first operation data and the calculation instruction from the storage device.

Optionally, a plurality of first computing devices in the cluster1 access the storage device simultaneously, each first computing device reads a part of data required by the cluster1 from the storage device, and the data is transferred in the cluster 1. Alternatively, one or more first computing devices in a given cluster1 may have access to storage, with the remaining first computing devices only capable of intra-group communication.

Step 404, operating the plurality of first operation data by the first computing device group to obtain a first operation result.

The plurality of first operation data are operated and forwarded among the plurality of first computing devices according to the corresponding computing instructions to obtain first operation results.

Step 406, sending the first operation result to a second computing device group.

Wherein the cluster1 sends the first operation result to the cluster2 through the controller unit in the cluster1 through the communication channel established with the second computing device group cluster 2.

Optionally, the first operation result may be sent to cluster2, or may be sent to a storage device. Optionally, the first calculation result is sent to the cluster2 through the first calculation apparatus of the communication channel established between any one of the cluster1 and the cluster 2. Optionally, the cluster1 may send the first operation result to a second computing device that establishes a communication channel between any one of the clusters 2 and the cluster 1.

Further, the network-on-chip data processing method provided by this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 6 to 8.

According to the network-on-chip data processing method, not only can the intra-group communication be realized among a plurality of computing device groups, but also the inter-group data transmission can be realized.

In one embodiment, as shown in fig. 28, a method for processing network-on-chip data is provided, which includes the following steps:

step 502, accessing a storage device through a first computing device group to obtain first operation data, wherein the first computing device group comprises a plurality of first computing devices.

Step 504, operating the plurality of first operation data by the first computing device group to obtain a first operation result.

Step 506, sending the first operation result to a second computing device group.

Optionally, the first calculation result is sent to the cluster2 through the first calculation device of the communication channel established between any one of the cluster1 and the cluster 2. Optionally, the cluster1 may send the first operation result to a second computing device that establishes a communication channel between any one of the clusters 2 and the cluster 1.

Step 508, accessing the storage device through the second computing device group to obtain second operation data, wherein the second computing device group includes a plurality of second computing devices.

Wherein each first computing device in cluster2 comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in cluster2 retrieves the second operation data and the calculation instruction from the storage device.

Optionally, a plurality of second computing devices in the cluster2 access the storage device simultaneously, each second computing device reads a part of data required by the cluster2 from the storage device, and the data is transferred in the cluster 2. Alternatively, one or more second computing devices in a given cluster2 may have access to storage, with the remaining second computing devices only capable of intra-group communication.

And 510, operating the second operation data and the first operation result through the second computing device group to obtain a second operation result.

And the second operation data read from the storage device and the first operation result received from the first computing device group are operated and forwarded among the plurality of second computing devices according to the corresponding computing instruction to obtain a second operation result.

According to the network-on-chip data processing method, the first operation result in the first computing device group is sent to the second computing device group, and the second computing device group performs re-operation by using the first operation result, so that the multiplexing of operation data can be realized.

In one embodiment, as shown in fig. 29, a method for processing network-on-chip data is provided, which includes the following steps:

step 602, a first network-on-chip processing module is used to obtain first operation data, where the first network-on-chip processing module includes a first storage device and a plurality of first computing devices, and the first operation data is stored in the first storage device.

Wherein each first computing device in the first network-on-chip processing module comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, a controller unit in the first network-on-chip processing module obtains first operation data and a calculation instruction from a first storage device.

Optionally, a plurality of first computing devices in the first network-on-chip processing module access the first storage device at the same time, each first computing device reads a part of data required by the first network-on-chip processing module from the first storage device, and the data is transmitted in the first network-on-chip processing module.

Optionally, one or more first computing devices in the first network-on-chip processing module are assigned to have access to the first storage device, and the remaining first computing devices are only capable of performing intra-group communication. Specifically, the operation data to be processed by the first network-on-chip processing module is stored in the first storage device.

Step 604, computing the first operation data through a plurality of first computing devices in the first network-on-chip processing module to obtain a first operation result.

Step 606, sending the first operation result to a second network-on-chip processing module.

The first network-on-chip processing module sends the first operation result to the second network-on-chip processing module through a communication channel established between the first network-on-chip processing module and the second network-on-chip processing module and a controller unit in the first network-on-chip processing module.

Optionally, the first operation result may be sent to the second network-on-chip processing module, or the first operation result may be sent to the first storage device. Optionally, the first operation result is sent to the second network-on-chip processing module by a first computing device of a communication channel established between any one of the first network-on-chip processing modules and the second network-on-chip processing module. Optionally, the first network-on-chip processing module may send the first operation result to a second computing device that establishes a communication channel between any one of the second network-on-chip processing modules and the first network-on-chip processing module.

Further, the network-on-chip data processing method provided by this embodiment may be applied to any one of the network-on-chip processing systems shown in fig. 9 to 12.

According to the network-on-chip data processing method, intra-module communication and inter-module data transmission can be realized among the network-on-chip processing modules, the method can reasonably utilize operation data and intermediate operation results, and the data processing efficiency is improved.

In one embodiment, as shown in fig. 30, a method for processing network-on-chip data is provided, which includes the following steps:

step 702, a first operation data is obtained through a first network-on-chip processing module, where the first network-on-chip processing module includes a first storage device and a plurality of first computing devices, and the first operation data is stored in the first storage device.

Step 704, computing the first operation data through a plurality of first computing devices in the first network-on-chip processing module to obtain a first operation result.

Step 706, sending the first operation result to a second network-on-chip processing module.

Optionally, the first operation result is sent to the second network-on-chip processing module by the first computing device of the communication channel established between any one of the first network-on-chip processing modules and the second network-on-chip processing module. Optionally, the first network-on-chip processing module may send the first operation result to a second computing device that establishes a communication channel between any one of the second network-on-chip processing modules and the first network-on-chip processing module.

Step 708, obtaining second operation data by the second network-on-chip processing module, where the second network-on-chip processing module includes a second storage device and a plurality of second computing devices, and the second operation data is stored in the second storage device.

Wherein each second computing device in the second network-on-chip processing module comprises: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits. Specifically, the controller unit in the second network-on-chip processing module acquires the second operation data and the calculation instruction from the second storage device.

Optionally, a plurality of second computing devices in the second network-on-chip processing module access the second storage device at the same time, each second computing device reads a part of data required by the second network-on-chip processing module from the second storage device, and the data is transmitted in the second network-on-chip processing module.

Optionally, one or more second computing devices in the second network-on-chip processing module are assigned to have access to the second storage device, and the remaining second computing devices are only capable of performing intra-group communication. Specifically, the operation data required to be processed by the second network-on-chip processing module is stored in the second storage device.

Step 710, computing the second operation data and the first operation result through a plurality of second computing devices in the second network-on-chip processing module to obtain a second operation result.

Wherein step 710 specifically comprises the following steps:

step 7102, computing the second operation data and the first operation result among the plurality of second computing devices to obtain the second operation result.

Specifically, each second computing device may perform an operation on the second operation data and the first operation result according to the corresponding computing instruction to obtain a plurality of intermediate results, and then perform an operation on the plurality of intermediate results according to the corresponding computing instruction to obtain a second operation result.

Step 7104, storing the second operation result to the second storage device.

According to the network-on-chip data processing method, the first operation result in the first network-on-chip processing system is sent to the second network-on-chip processing system, and the second network-on-chip processing system performs operation again by using the first operation result, so that the multiplexing of operation data can be realized.

The network-on-chip processing method in the embodiment of the application can be used for machine learning calculation, and specifically can be used for artificial neural network operation, wherein the operation data in the network-on-chip processing system specifically can include: inputting neuron data and weight data, wherein the operation result in the on-chip network processing system can be specifically as follows: and outputting the neuron data as the result of the artificial neural network operation.

In the forward operation, after the execution of the artificial neural network of the upper layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer.

The above-described machine learning calculations may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, the following describes a specific scheme of machine learning calculation by taking an artificial neural network operation as an example.

For the artificial neural network operation, if the artificial neural network operation has multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K1, 2.., L-1, for the K-th layer and K + 1-th layer, we will refer to the K-th layer as an input layer, in which the neurons are the input neurons, and the K + 1-th layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 1 may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines input data Xi as broadcast data, determines weight data as distribution data, and splits the weight w into n data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n data blocks to the plurality of slave processing circuits (for example, if the plurality of slave processing circuits are n, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for multiplying the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, and sending the intermediate result to the master processing circuit, the master processing circuit carries out accumulation operation on the intermediate result sent by the plurality of slave processing circuits according to the accumulation instruction to obtain an accumulation result, carries out biasing b on the accumulation result according to the biasing instruction to obtain a final result, and sends the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

It should be understood that although the various steps in the flow charts of fig. 25-30 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated herein, and may be performed in other orders. Moreover, at least some of the steps in fig. 25-30 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of the logic functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be essentially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware associated with instructions of a program, which may be stored in a computer readable memory, the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the methods and their core ideas of the present application; meanwhile, for a person skilled in the art, based on the idea of the present application, the detailed description and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A network-on-chip data processing method is applied to a network-on-chip processing system, wherein the network-on-chip processing system is used for executing machine learning calculation, and the network-on-chip processing system comprises: a storage device and a computing device; the method comprises the following steps:

2. The method of claim 1, wherein the computing device comprises: an arithmetic unit and a controller unit;

the accessing, by a first computing device in the network-on-chip processing system, a storage device in the network-on-chip processing system to obtain first operation data includes:

the controller unit in the first computing device obtains the first operation data and the computing instruction from the storage device.

3. The method of claim 2, wherein the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;

the performing, by the first computing device, an operation on the first operation data to obtain a first operation result includes:

analyzing the calculation instruction through a controller unit in the first calculation device to obtain a plurality of calculation instructions, and sending the plurality of calculation instructions and the first calculation data to a main processing circuit in the first calculation device through the controller unit in the first calculation device;

performing, by a master processing circuit in the first computing device, a preamble processing on the first arithmetic data, and a transmission data and an arithmetic instruction with a plurality of slave processing circuits in the first computing device;

the plurality of slave processing circuits in the first computing device execute intermediate operations in parallel according to the operation data and the operation instruction transmitted from the master processing circuit in the first computing device to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit in the first computing device;

and the main processing circuit in the first computing device executes subsequent processing on the plurality of intermediate results to obtain a first operation result of the computing instruction.

4. The method of claim 3, wherein sending the first operation result to a second computing device in the network-on-chip processing system comprises:

and the controller unit in the first computing device sends the first operation result to a second computing device in the network-on-chip processing system.

5. The method of claim 3, wherein the machine learning computation comprises: artificial neural network operations, the first operational data comprising: inputting neuron data and weight data; the first operation result is output neuron data.

6. The method of claim 5, wherein the computing device further comprises: a storage unit and a direct memory access unit, the storage unit comprising: any combination of a register and a cache;

the cache is used for storing the first operation data;

the register is used for storing a scalar in the first operation data.

7. The method of claim 5, wherein the controller unit comprises: the device comprises an instruction storage unit, an instruction processing unit and a storage queue unit;

the instruction storage unit stores a calculation instruction associated with the artificial neural network operation;

the instruction processing unit analyzes the calculation instruction to obtain a plurality of operation instructions;

the store queue unit stores an instruction queue, the instruction queue comprising: and a plurality of operation instructions and/or calculation instructions to be executed according to the front and back sequence of the instruction queue.

8. The method of claim 7, wherein the main processing circuit comprises: a dependency processing unit;

the dependency relationship processing unit determines whether a first operation instruction and a zeroth operation instruction before the first operation instruction have an association relationship, if the first operation instruction and the zeroth operation instruction have the association relationship, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit and transmitted to the operation unit;

extracting a first storage address interval of required data in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required data in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

9. The method of claim 2, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

and the tree module forwards data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

10. The method of claim 5, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit being connected to at least one slave processing circuit;

the main processing circuit determines the input neuron to be broadcast data, the weight value is distribution data, the distribution data are distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks, the broadcast data and at least one operation instruction in the plurality of operation instructions are sent to the branch processing circuit;

the branch processing circuit forwards data blocks, broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits execute operation on the received data blocks and the broadcast data according to the operation instruction to obtain intermediate results, and transmit the intermediate results to the branch processing circuit;

and the main processing circuit carries out subsequent processing on the intermediate result sent by the branch processing circuit to obtain a first operation result of the calculation instruction, and sends the first operation result of the calculation instruction to the controller unit.

11. The method of claim 5, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k basic circuits are: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1;

forwarding of data and instructions by the K slave processing circuits between the master processing circuit and a plurality of slave processing circuits;

the main processing circuit determines the input neurons as broadcast data, the weight value is distribution data, the distribution data are distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks and at least one operation instruction in the plurality of operation instructions are sent to the K slave processing circuits;

the K slave processing circuits converting data between the master processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits execute operation on the received data blocks according to the operation instruction to obtain intermediate results, and transmit the operation results to the K slave processing circuits;

and the main processing circuit carries out subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a first operation result of the calculation instruction, and sends the first operation result of the calculation instruction to the controller unit.

12. The method according to any one of claims 9 to 11,

the main processing circuit combines and sorts the intermediate results sent by the plurality of processing circuits to obtain a first operation result of the calculation instruction;

or the main processing circuit performs combination sequencing and activation processing on the intermediate results sent by the plurality of processing circuits to obtain a first operation result of the calculation instruction.

13. The method of any of claims 9-11, wherein the main processing circuit comprises: one or any combination of a conversion processing circuit, an activation processing circuit and an addition processing circuit;

the conversion processing circuit executes preamble processing on the first operation data, specifically: performing an interchange between the first data structure and the second data structure with data or intermediate results received by the main processing circuit; or performing an interchange between the first data type and the second data type on data or intermediate results received by the main processing circuit;

the activation processing circuit executes the subsequent processing, specifically executes activation operation of data in the main processing circuit;

the addition processing circuit executes the subsequent processing, specifically, executes addition operation or accumulation operation.

14. The method of claim 10 or 11, wherein the slave processing circuit comprises: a multiplication processing circuit;

the multiplication processing circuit performs multiplication operation on the received data block to obtain a product result.

15. The method of claim 14, wherein the slave processing circuit further comprises: and the accumulation processing circuit executes accumulation operation on the product result to obtain the intermediate result.

16. The method of claim 9, wherein the tree module is an n-way tree structure, and wherein n is an integer greater than or equal to 2.

17. The method of claim 1, further comprising: and accessing a storage device in the network-on-chip processing system through a second computing device in the network-on-chip processing system to acquire second operation data.

18. The method of claim 17, further comprising: and operating the second operation data and the first operation result through a second computing device in the network-on-chip processing system to obtain a second operation result.

19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 18.

20. A network-on-chip data processing apparatus for performing machine learning computations, comprising:

a first operation data obtaining module, configured to access a storage device in the network-on-chip processing system through a first computing device in the network-on-chip processing system, and obtain first operation data;

and the first operation result sending module is used for sending the first operation result to a second computing device in the network-on-chip processing system.