WO2020063940A1

WO2020063940A1 - Computing apparatus and related product

Info

Publication number: WO2020063940A1
Application number: PCT/CN2019/108842
Authority: WO
Inventors: 杜子东; 周诗怡; 刘少礼; 王秉睿; 张尧; 周徐达; 兰慧盈
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2018-09-29
Filing date: 2019-09-29
Publication date: 2020-04-02

Abstract

A computing apparatus and a related product. The computing apparatus comprises: X groups of neural network chips, each group of neural network chips in the X groups of neural network chips comprise one master chip and at least one slave chip, the master chip is connected to the slave chip, the master chips in the X groups of neural network chips are connected, and the value range of X is integers greater than or equal to 2. The computing apparatus divides multiple groups of neural network chips into the master chips and the slave chips, and then performs data scheduling in the chips according to a connection relationship between the master chips, so as to improve a training speed of the neural network chips and save training duration.

Description

Computing devices and related products

Technical field

The present application relates to the field of information processing technology, and in particular, to a computing device and related products.

Background technique

Artificial neural network (Artificial Neural Network, ANN) has been a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes some simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to as neural network or neural network. A neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other.

Existing neural network operations are based on CPU (Central Processing Unit) or GPU (English: Graphics Processing Unit) to implement neural network operations. Existing training equipment has slow training speed and takes a long time .

Summary of the Invention

The embodiments of the present application provide a computing device and related products, which can improve the training speed and efficiency of the training device.

According to a first aspect, a computing device is provided. The computing device includes:

A group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the group X neural network chip The connection between the main chips in the X, the value of X is an integer greater than or equal to 2;

Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;

A first master chip in a first group of neural network chips in the X group of neural network chips, configured to receive an operation result of a slave chip connected to the first master chip;

The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.

According to a second aspect, a neural network chip is provided. The neural network chip includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

The controller unit is configured to obtain input data and calculation instructions;

The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;

The multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;

The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.

According to a third aspect, a combined computing device is provided. The combined computing device includes: M computing devices according to claim 1, the M computing devices according to claim 1 being connected, the M The value is an integer greater than or equal to 2.

According to a fourth aspect, a calculation method for executing a machine learning model is provided, and the calculation method is applied to the calculation device according to the first aspect.

In a fifth aspect, a calculation method for executing a machine learning model is provided, and the calculation method is applied to the combination calculation device according to the third aspect.

According to a sixth aspect, an embodiment of the present application provides a computing device, where the computing device includes multiple computing carriers, an on-chip storage data path control circuit connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers, And an on-chip storage data path connected to the on-chip storage data path control circuit, wherein:

The on-chip storage data path control circuit is configured to receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers; and decode the data transmission instruction to obtain a transmission Data address and receiving data address;

The on-chip cache circuit data path is configured to obtain target data according to the sending data address and transmit the target data to the receiving data address, where the receiving data address is the second of the plurality of computing carriers. Calculate an address in the carrier's second on-chip cache circuit.

In a seventh aspect, an embodiment of the present application provides a combined processing device, where the combined processing device includes the computing device described in the first aspect, a universal interconnection interface, and other processing devices;

The computing device interacts with the other processing devices to jointly complete a computing operation designated by the user.

In an eighth aspect, an embodiment of the present application provides a system-on-chip including the computing device according to the first aspect or the combined processing device according to the second aspect.

In a ninth aspect, an embodiment of the present application provides a data transmission method, which is applied to a computing device according to the first aspect, and the method includes:

Receiving, via the on-chip storage data path control circuit, a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers;

And decoding the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, where the receiving data address is a second of a second computing carrier of the plurality of computing carriers. An address in the on-chip cache circuit;

Acquiring target data according to the sending data address through the on-chip buffer circuit data path, and transmitting the target data to the receiving data address.

In a tenth aspect, an embodiment of the present application provides another computing device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are stored in the memory. Configuration is performed by the processor, and the program includes instructions for some or all of the steps as described in the fourth aspect.

According to an eleventh aspect, an embodiment of the present application provides a computer-readable storage medium. The computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the computer program to The processor executes the method of the fourth aspect described above.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.

FIG. 1-1a is a schematic diagram of a neural network training device according to an embodiment of the present application.

FIG. 1-1b is a schematic diagram of a chip connection structure of a computing device according to an embodiment of the present application.

FIG. 1-1c is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.

FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.

FIG. 1-1e is a schematic structural diagram of a neural network chip according to an embodiment of the present application.

FIG. 1-1f is a schematic diagram of a scheduling strategy for a computing result of a main chip according to an embodiment of the present application.

FIG. 1-1g is a schematic structural diagram of a combined computing device according to an embodiment of the present application.

FIG. 1-2 is a schematic diagram of a combination processing device provided by an embodiment of the present application.

FIG. 1-3 is a structural diagram of another combination processing device provided by an embodiment of the present application.

FIG. 1-3a is a schematic structural diagram of a board card according to an embodiment of the present application.

FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application; FIG.

FIG. 2-1a is a schematic structural diagram of a computing unit according to an embodiment of the present application; FIG.

FIG. 2-1b is a schematic structural diagram of a main processing circuit according to an embodiment of the present application; FIG.

FIG. 2-1c is a schematic diagram of data distribution of a computing unit according to an embodiment of the present application; FIG.

FIG. 2-1d is a schematic diagram of data return of a computing unit according to an embodiment of the present application; FIG.

FIG. 2-1e is a schematic structural diagram of an on-chip storage data path control circuit according to an embodiment of the present application;

FIG. 2-1f is a schematic structural diagram of a memory management unit according to an embodiment of the present application; FIG.

2-2 is a schematic flowchart of a data transmission method according to an embodiment of the present application;

FIG. 2-3 is a schematic structural diagram of a combination processing device according to an embodiment of the present application;

FIG. 2-4 is a schematic structural diagram of a board card according to an embodiment of the present application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in combination with the drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

Reference to "an embodiment" herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

First introduce the neural network training device involved in this application. As shown in Figure 1-1a, the neural network training device consists of multiple neural network chips, multiple neural network chips perform multiple tasks, or divide a single task into segments based on depth. It learns the characteristics of algorithm to schedule and cooperate to complete the training task. The arrangement and cooperation of multiple neural network chips in the neural network training device are specifically described in the following embodiments.

A training device according to an embodiment of the present application includes: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master chip is connected to the slave chip, The connection between the main chips in the X group of neural network chips. The value of X is an integer greater than or equal to 2.

Each neural network chip in the X group of neural network chips is used to obtain input data and weights, and the weights are calculated with the input data corresponding to each neural network chip to obtain the operation result. The input data is different and the obtained weights are the same; the first master chip in the first group of neural network chips in the X group of neural network chips is used to receive the operation result of the slave chip connected to the first master chip; the first master chip It is used to share the operation result of the first master chip and the received operation result of the slave chip with the master chips in other groups of neural network chips, and receive the operation results shared by the master chips in other groups of neural network chips.

Specifically, X can be any integer greater than or equal to 2 such as 2, 3, 5, 8 and so on. In the group X of neural network chips, each group of neural network chips includes a master chip and at least one slave chip, wherein different groups of neural networks The number of slave chips in the chip can be the same or different. For example, when X is 3, a total of 10 slave chips are included, then the master chip in the first two sets of neural network chips can be connected with 3 slave chips, and the last set of neural network chips The master chip is connected to 4 slave chips. Preferably, the slave chips are equally divided and connected to the master chip, so that the master chip receives the operation results of the slave chips and quickly schedules the operation results between the master chips.

Please refer to FIG. 1-1b. FIG. 1-1b is a chip connection structure of a computing device according to an embodiment of the present application. As shown in FIG. 1-1b, X is 4, among which chip 4, chip 8, chip 13, and chip. 10 is the master chip, and 3 slave chips are connected to each master chip. Chips 1 to 16 all get input data and weights, where each chip gets different input data and the weights are the same, so each chip will use the same training model to train different input data . The input data of each chip can be for data corresponding to multiple tasks, or for data sets segmented for the same task. The segmentation of the data set can be completed in other external devices, or in other modules in the computing device. It can be completed in the main chip of a certain group of neural network chips in the computing device.

Since the input data of each chip in the computing device is different and the weights are the same, the obtained operation results are different. After all the chips have completed training to obtain the operation results, the first master chip is used to receive the operation results of the slave chips connected to the first master chip. The first master chip may be the master chip 4, the master chip 8, the master chip 10, and the master chip. Any one of the master chips in chip 13 respectively obtains the operation results of the slave chips connected to itself, and finally all operation results included in the master chip are its own operation results and the operation results of the slave chips connected to it.

After the first master chip obtains the operation results of the slave chip, all the operation results included in it are shared among the X group of master chips. When sharing, the operation results are transmitted cyclically in the same direction, for example, in a clockwise direction. That is: chip 4 → chip 8 → chip 13 → chip 10 → chip 4 or pass in a counterclockwise direction, that is: chip 4 → chip 10 → chip 13 → chip 8 → chip 4. When sharing, all the operation results included in the main chip can be transferred to the next adjacent main chip at one time, or it can be transferred in multiple steps.

It can be seen that this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.

Optionally, the first master chip is further configured to: transmit all operation results in the first master chip to a slave chip connected to the first master chip.

After the main chip 4, the main chip 8, the main chip 10, and the main chip 13 have shared the transfer, they obtain the calculation results of all the chips, and then each main chip passes the calculation results it contains to the respective connected slave chips, so that each Each slave chip contains the operation results of all chips.

Optionally, the master chip is connected to the slave chip through a tree structure, the tree structure is an n-tree structure, the master chip is a root node of the n-tree structure, and the slave chip is a child node of the n-tree structure. The child node may be One-level child nodes can also be multi-level child nodes.

Specifically, the master chip in the group X neural network chip can be connected to the slave chip through a tree structure, where the master chip is the root node of the tree structure, the slave chip is a child node, and the child node can be a first-level child node or Are multi-level child nodes. When the master chip obtains the operation results of the slave chips, the operation results of each slave chip can be directly obtained, or the operation results of other slave chips can be obtained by the slave chip directly connected to the master chip, and then passed to the master chip.

It can be seen that this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved. The slave chip is connected to the master chip through a tree structure, and the operation results of the slave chip can be integrated before the operation result is sent to the master chip, which reduces the operation pressure of the master chip and further reduces the loss to the master chip.

Please refer to FIG. 1-1c. FIG. 1-1c is another chip connection structure of a computing device provided by an embodiment of the present application. As shown in FIG. 1-1c, X is 4, and among the 4 groups of neural network chips, The master chip is the master chip 31, the master chip 32, the master chip 33, and the master chip 34. Each master chip is connected to the slave chip through a tree structure. For example, the master chip 31 is the root node, and the slave chips connected to it include the chip 311. The chip 312 and the chip 313 are first-level child nodes, and the slave chips connected to the slave chip 311 include a chip 3111, a chip 3112, and a chip 3113, which are second-level child nodes. The other slave chips are also primary child nodes or secondary child nodes.

Alternatively, please refer to FIG. 1-1d. FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application. As shown in FIG. It is connected to the slave chip, and the tree structure includes three levels of sub-nodes, and the operation results of the leaf nodes at the lowest level can be directly transferred to the master chip, or can be transferred to the master chip through the integration of the slave chip of the upper-level sub-node.

The neural network computing device involved in the embodiment of the present application includes a neural network chip. Please refer to FIG. 1-1e. FIG. 1-1e is a schematic structural diagram of a neural network chip provided by an embodiment of the present application, as shown in FIG. 1-1e. The neural network chip includes: an arithmetic unit 12 and a controller unit 11; the arithmetic unit 12 includes: a master processing circuit 101 and a plurality of slave processing circuits 102;

The controller unit 11 is configured to obtain input data and calculation instructions. In an optional solution, specifically, the method of obtaining input data and calculation instructions may be obtained through a data input and output unit. The data input and output unit may be one or Multiple data I / O interfaces or I / O pins.

The above calculation instructions include, but are not limited to, forward operation instructions or backward training instructions, or other neural network operation instructions, such as convolution operation instructions. The specific implementation manner of this application does not limit the specific expressions of the above calculation instructions.

The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and input data to a main processing circuit; the main processing circuit 101 is configured to perform preprocessing on the input data and perform multiple operations Data and operation instructions are transmitted between two slave processing circuits; multiple slave processing circuits 102 are used to perform intermediate operations in parallel according to the data transmitted from the main processing circuit and the operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to A main processing circuit; a main processing circuit 101, configured to perform subsequent processing on a plurality of intermediate results to obtain an operation result of a calculation instruction.

The technical solution provided in this application sets the operation unit into a master-slave structure. For forward operation calculation instructions, the operation unit can split the data according to the forward operation calculation instructions, so that multiple slave processing circuits can The part with a large amount of calculation is performed in parallel, thereby increasing the operation speed, saving operation time, and further reducing power consumption.

Optionally, the aforementioned neural network chip is specifically used for an artificial neural network operation, and the aforementioned input data may specifically include input neuron data and weight data. The above operation result may be specifically: the result of the operation of the artificial neural network is the output neuron data.

The operation in the neural network can be a layer of the neural network. For a multilayer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instructions in the next layer are completed. The output neuron calculated by the arithmetic unit will be used as the input neuron of the next layer (or perform some operations on the output neuron and then be used as the input neuron of the next layer), and the weight will also be replaced. Is the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the operation instructions of the next layer will use the input neuron gradient calculated in the operation unit as the next layer The output neuron gradient is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weight is replaced with the weight of the next layer.

For an artificial neural network operation, if the artificial neural network operation has a multi-layer operation, the input neuron and output neuron of the multi-layer operation do not refer to the neurons in the input layer and the output layer of the entire neural network, but to the For any two adjacent layers in the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron. Taking a convolutional neural network as an example, suppose a convolutional neural network has L layers, K = 1, 2, ..., L-1, and for the Kth layer and the K + 1th layer, we will use the It is called the input layer, where the neuron is the input neuron, the K + 1th layer is called the output layer, and the neuron is the output neuron. That is, except for the top layer, each layer can be used as the input layer, and the next layer is the corresponding output layer.

Optionally, the aforementioned neural network chip may further include a storage unit 10 and a direct memory access unit 50. The storage unit 10 may include one or any combination of a register 201 and a cache 202. Specifically, the cache is used for storing The calculation instruction; the register is used to store the input data and a scalar; and the cache is a high-speed temporary cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;

An instruction storage unit 110, configured to store calculation instructions associated with the artificial neural network operation;

The instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions;

The storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or calculation instructions to be executed according to a sequence of the queue.

For example, in an optional technical solution, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions. Of course, in another alternative, the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions. The above micro-instruction may be an instruction next to the instruction. The micro-instruction may be obtained by splitting or decoding the instruction, and may be further decoded into a control signal of each component, each unit, or each processing circuit.

Optionally, the controller unit 11 may further include:

The dependency relationship processing unit 112 is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is related, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit. Transmitted to the arithmetic unit;

The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:

Extracting a first storage address range of required data (such as a matrix) in the first operation instruction according to the first operation instruction, and extracting a zeroth position of the required matrix in the zeroth operation instruction according to the zeroth operation instruction Storage address range, if the first storage address range and the zeroth storage address range have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage If there is no overlapping area between the address interval and the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction have no correlation.

In another optional embodiment, when the neural network chip is the main chip, the controller unit 11 further includes a scheduling unit 114 for scheduling the operation results in the main chip.

Specifically, the main chip in each group of neural network chips needs to schedule operation results, so that all the main chips share all the operation results included in each main chip. When scheduling, you need to follow a certain scheduling strategy. First, the operation results of the master neural network chip in the X group of neural network chips can be integrated, including the operation results of the master neural network chip itself and the received operation results of the slave chip to obtain X integrated operation results, and then X Each integration operation result is scheduled in the same direction according to the connection order of the main chip. Each integration operation result is dispatched once, and after X ² dispatches, all the main chips obtain X integration operation results; or after obtaining X integration After the operation results, then the X integrated operation results are scheduled in the same direction according to the connection order of the main chip. After the next main chip receives the operation results transmitted by the previous main chip, it will compare the received operation results with its own operation results. Integrate to form a new calculation result, and then pass it to the next main chip. After 2 * (X-1) scheduling, all the main chips have obtained X integrated operation results; the X main chips can also be The operation results are partially integrated or not integrated, and then multiple partial scheduling is performed between the main chips.

In an optional embodiment, scheduling the operation results in the main chip includes: scheduling the main chip in the X group of neural network chips connected to the same direction to schedule the operation content of 1 / Y + 1, where: The same direction includes a clockwise direction or a counterclockwise direction, and Y is the number of slave chips connected to the master chip in the X-group neural network chip.

Please refer to FIG. 1-1f. FIG. 1-1f is an operation result scheduling strategy between main chips provided in the embodiment of the present application. As shown in FIG. 1-1f, corresponding to FIG. 1-1b, there are 4 groups of nerves. Network chip, the main chip of which is chip 4, chip 8, chip 13 and chip 10, the operation result in main chip 4 includes its own operation result, and the operation results of chip 1, chip 2 and chip 3 received, Correspond to these four calculation results as a1, b1, c1, d1. Correspondingly, the operation result of chip 8 corresponds to four parts a2, b2, c2, and d2. The operation result of chip 13 corresponds to a3, b3. , c3, d3. The operation result of chip 10 corresponds to four parts: a4, b4, c4, and d4. Scheduling is clockwise. For the first scheduling, chip 4 dispatches part a1 to chip 8, chip 8 dispatches part b2 to chip 13, chip 13 dispatches part c3 to chip 10, and chip 13 dispatches part d4 to chip 4. This schedule The process can be performed at the same time or at different times. Each time the operation content of the 1 / Y + 1 part of each main chip is scheduled, after (X-1) ² times of scheduling, all the main chips obtain all the operation results and complete the scheduling. This scheduling method can save the integration time of each chip and improve scheduling efficiency.

In an optional embodiment, the main processing circuit 101 is specifically configured to combine and sort multiple intermediate results sent from the processing circuit 102 to obtain a result of the calculation instruction;

Or the main processing circuit 101 is specifically configured to combine and sort the intermediate results sent by the multiple slave processing circuits 102 and obtain the result of the calculation instruction after the activation processing.

In an optional embodiment, the main processing circuit includes one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;

The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;

The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;

The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.

The slave processing circuit includes: a multiplication processing circuit;

The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.

Optionally, the slave processing circuit further includes: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.

The embodiment of the present application also relates to another combined computing device, where the combined computing device includes: M computing devices according to the first embodiment, and the M computing devices according to the first embodiment are connected to each other. The value range of M is an integer greater than or equal to two.

Please refer to FIG. 1-1g. FIG. 1-1g is a schematic structural diagram of a combination computing device provided by an embodiment of the present application. As shown in FIG. 1-1g, the combination computing device includes four calculations as shown in FIG. 1-1b. The four computing devices are connected to each other, and can be bridged through circuits, can be connected by setting a special connection module, and can also be connected by the main chip in the four computing devices. This connection structure can improve data training efficiency through multiple chips' cooperative operation on the one hand, and can schedule the operation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve the slave chip. Performance and cost savings. And selecting a main chip from multiple sets of main chips to connect with an external main chip reduces the loss of the main chip and improves the service life of the main chip.

In an optional embodiment, the connections between the M computing devices as in the first embodiment include: each of the M computing devices as in the first embodiment, which includes X groups of neural network chips The main chip of a group of neural network chips is used to connect with the main chip of a group of neural network chips in a group X neural network in another computing device.

As shown in Figure 1-1g, each of the four computing devices as in the first embodiment includes four sets of neural network chips, and the main chip of one set of neural network chips is used to communicate with four of the other computing devices. The main chips of a group of neural network chips in the group of neural network chips are connected, for example, the main chip 502, the main chip 507, the main chip 512, and the main chip 510 are connected. When selecting the master chip in one of the X group of neural network chips, it can be randomly selected or selected using a selection strategy, such as selecting the master chip with the most slave chips, or selecting it with other computing devices. Among the 4 groups of neural network chips, the closest physical chip is the closest.

It can be seen that, in the embodiment of the present application, multiple groups of neural network chips are divided into a master chip and a slave chip, and then the master chip obtains the operation results of the slave chips, and schedules the calculation results between different sets of master chips, so that each The master chip of the group contains all the calculation results, and the master chip distributes all the calculation results to the slave chips, which improves the training speed of the neural network chip and saves training time.

This application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices. The computing device interacts with other processing devices to complete a user-specified operation. Figure 1-2 is a schematic diagram of a combined processing device.

Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the computing device and external data and control, including data handling, to complete basic control of the computing device such as start and stop; other processing devices can also cooperate with the computing device to complete computing tasks.

A universal interconnection interface for transmitting data and control instructions between the computing device and other processing devices. The computing device obtains required input data from other processing devices and writes it to a storage device on the computing device chip; it can obtain control instructions from other processing devices and write it to the control cache on the computing device chip; it can also read the computing device's The data in the module is stored and transmitted to other processing devices.

Optionally, the structure is shown in FIG. 1-3, and may further include a storage device, and the storage device is connected to the computing device and the other processing devices, respectively. The storage device is used to store data in the computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the computing device or other processing devices.

The combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption. In this case, the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip is also applied, which includes the above computing device or combined processing device.

In some embodiments, a chip packaging structure is applied, which includes the above chip.

In some embodiments, a board card is applied, which includes the chip package structure described above. Referring to FIG. 1-3a, FIG. 1-3a provides a board card. In addition to the above chip 389, the board card may also include other supporting components, which include, but are not limited to, a storage device 390 and an interface device 391 And control device 392;

The memory device 390 is connected to a chip in the chip package structure through a bus, and is used to store data. The memory device may include a plurality of sets of memory cells 393. Each group of the storage units is connected to the chip through a bus. It can be understood that the memory cells in each group may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.

In one embodiment, each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.

The interface device is electrically connected to a chip in the chip package structure. The interface device is used to implement data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer. Preferably, when the PCIE 3.0 X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB / s. In another embodiment, the interface device may also be other interfaces. The present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function. In addition, the operation result of the chip is still transmitted by the interface device to an external device (such as a server).

The control device is electrically connected to the chip. The control device is configured to monitor a state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit). For example, the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.

In the field of information processing technology, in terms of data transmission, neural networks are the basis of many current artificial intelligence applications. With the further expansion of the application scope of neural networks, many neural network models and large batches of requests have appeared. In the prior art, the calculation of the neural network can be performed in parallel using a heterogeneous computing carrier. Therefore, how to improve the data transmission efficiency between heterogeneous computing devices is a technical problem to be solved by those skilled in the art.

In order to solve the above problems, we propose the following scheme.

In this application, the computing device may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, and various forms of user equipment (user equipment, UE ), A mobile station (MS), a terminal device (terminal), etc., the computing device may also include a system-on-chip (SOC).

In this application, the computing carrier may be a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field programmable logic gate array (ASIC) Field-Programmable Gate Array (FPGA), Coarse-Grained Re-configurable Array (CGRA), Digital Signal Processing (DSP), etc.

The embodiments of the present application provide a data transmission method and related products, which can improve the data transmission efficiency between different computing carriers and facilitate the improvement of the neural network operation efficiency. The present application will be further described in detail below with reference to specific embodiments and with reference to the drawings.

Please refer to FIG. 2-1, which is a schematic structural diagram of a computing device according to an embodiment of the present application. As shown in FIG. 2-1, the computing device 100 includes a plurality of computing carriers such as a first computing carrier 101, a second computing carrier 102, and an N-th computing carrier 103. Wherein N is a positive integer greater than 2, the multiple computing carriers may include at least two of the above-mentioned CPUs, GPUs, ASICs, FPGAs, CGRAs, or DSPs, and may also include the above-mentioned two same-type carriers, for example, 2 CPUs, 2 GPUs, 1 ASIC, or 1 FPGA.

In a possible implementation manner, each computing carrier may include at least one computing unit for a neural network operation, such as a processing chip and the like. The specific structure of the computing unit is not limited. Please refer to FIG. 2-1a. FIG. 2-1a is a schematic structural diagram of a computing unit. As shown in Figure 2-1a, the calculation unit includes: a main processing circuit, a basic processing circuit, and a branch processing circuit. Specifically, the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to at least one basic processing circuit.

The branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit.

Referring to FIG. 2-1b, FIG. 2-1b is a schematic structural diagram of a main processing circuit. As shown in FIG. 2-1b, the main processing circuit may include a register and / or an on-chip buffer circuit. The main processing circuit may further include: a control Circuit, vector operator circuit, ALU (arithmetic and logic unit) circuit, accumulator circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in actual applications, the above main processing circuit also It may include a conversion circuit (such as a matrix transposition circuit), a data rearrangement circuit, or an activation circuit and the like.

The main processing circuit also includes a data sending circuit, a data receiving circuit, or an interface. The data sending circuit can integrate a data distribution circuit and a data broadcasting circuit. Of course, in practical applications, the data distribution circuit and the data broadcasting circuit can also be set separately; in actual applications The above-mentioned data transmitting circuit and data receiving circuit may also be integrated together to form a data transmitting and receiving circuit. For broadcast data, that is, data that needs to be sent to each basic processing circuit. For the distribution data, that is, the data that needs to be selectively sent to part of the basic processing circuit, the specific selection method can be specifically determined by the main processing circuit according to the load and the calculation method. The broadcast transmission method is to broadcast data to each basic processing circuit in a broadcast form. (In practical applications, broadcast data can be sent to each basic processing circuit by one broadcast, and broadcast data can be sent to each basic processing circuit by multiple broadcasts. The specific implementation of this application is not limited. The number of broadcasts mentioned above), and the distribution and transmission method is to selectively send the distribution data to some basic processing circuits.

When the data is distributed, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different. Specifically, if the data is sent in a distributed manner, each basic processing circuit that receives the data receives The data received can be different, of course, the data received by some basic processing circuits can also be the same;

Specifically, when broadcasting data, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and each basic processing circuit that receives the data can receive the same data, that is, the broadcast data can include all the basic processing circuits that need to receive The data. Distributing the data may include: part of the data that the basic processing circuit needs to receive. The main processing circuit may send the broadcast data to all the branch processing circuits through one or more broadcasts, and the branch processing circuits forward the broadcast data to all the basic processing circuits.

Optionally, the vector operator circuit of the main processing circuit described above can perform vector operations, including but not limited to: addition, subtraction, multiplication, and division of two vectors, addition and subtraction of vectors and constants, or operations on each element of a vector Perform arbitrary operations. Among them, the continuous operation may specifically be addition and subtraction of vectors and constants, multiplication, division operations, activation operations, accumulation operations, and the like.

Each basic processing circuit may include a basic register and / or a basic on-chip cache circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The inner product operator circuit, the vector operator circuit, and the accumulator circuit may all be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may also be separately provided circuits.

The connection structure of the branch processing circuit and the basic circuit may be arbitrary, and is not limited to the H-shaped structure in FIG. 2-1b. Optionally, the main processing circuit to the basic circuit is a broadcast or distribution structure, and the basic circuit to the main processing circuit is a gather structure. Broadcasting, distribution and collection are defined as follows:

The data transmission mode from the main processing circuit to the basic circuit may include:

The main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is respectively connected to a plurality of basic circuits.

The main processing circuit is connected to a branch processing circuit, and the branch processing circuit is further connected to a branch processing circuit, and so on, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is respectively connected to a plurality of basic circuits.

The main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is further connected in series with a plurality of basic circuits.

The main processing circuit is connected to a branch processing circuit. The branch processing circuit is further connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series. Then, each branch processing circuit is connected in series to a plurality of basic circuits.

When distributing data, the main processing circuit transmits data to some or all of the basic circuits, and the data received by each basic circuit that receives the data may be different;

When broadcasting data, the main processing circuit transmits data to some or all of the basic circuits, and each basic circuit that receives the data receives the same data.

When data is collected, some or all of the underlying circuits transmit data to the main processing circuit. It should be noted that the computing unit shown in Figure 2-1a may be a separate physical chip. Of course, in practical applications, the computing unit may also be integrated in other chips (such as CPU, GPU). This application is specifically implemented. The manner does not limit the physical expression of the chip device.

Referring to Figure 2-1c, Figure 2-1c is a schematic diagram of data distribution of a computing unit, as shown by the arrow in Figure 2-1c, and this arrow is the data distribution direction. As shown in Figure 2-1c, the main processing circuit receives After the external data, the external data is split and distributed to multiple branch processing circuits, and the branch processing circuit sends the split data to the basic processing circuit.

Refer to Figure 2-1d, Figure 2-1d is a schematic diagram of data return of a computing unit, as shown by the arrow in Figure 2-1d, the arrow is the direction of data return, as shown in Figure 2-1d, basic processing The circuit returns the data (such as the result of the inner product operation) to the branch processing circuit, and the branch processing circuit is returning to the main processing circuit.

For the input data, the specific data may be vector, matrix, multi-dimensional (three-dimensional or four-dimensional or more) data, and for a specific value of the input data, it may be called an element of the input data.

The embodiment of the present disclosure also provides a calculation method of a calculation unit shown in FIG. 2-1a. The calculation method is applied to the calculation of a neural network. Specifically, the calculation unit may be used for a layer or a multi-layer neural network. Multiple layers of input data and weight data perform operations.

Specifically, the calculation unit is configured to perform an operation on one or more input data and weight data of the trained multi-layer neural network;

Or, the calculation unit is configured to perform an operation on one or more layers of input data and weight data in a multi-layer neural network in a forward operation.

The above operations include, but are not limited to, one or any combination of convolution operations, matrix multiplication matrix operations, matrix multiplication vector operations, offset operations, fully connected operations, GEMM operations, GEMV operations, and activation operations.

GEMM calculation refers to the matrix-matrix multiplication operation in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * op (P) + beta * C, where S and P are the two matrices of the input, C is the output matrix, alpha and beta are the scalars, and op Represents some kind of operation on the matrix S or P, in addition, there will be some auxiliary integers as parameters to explain the width and height of the matrix S and P;

GEMV calculation refers to the matrix-vector multiplication operation in the BLAS library. The general representation of this operation is: C = alpha * op (S) * P + beta * C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op represents the pair Some operation of the matrix S.

This application does not limit the connection relationship between computing carriers in a computing device, and may be a homogeneous or heterogeneous computing carrier. It also does not limit the connection relationship between computing units in a computing carrier. The computing unit executes parallel tasks, which can improve computing efficiency.

In Figure 2-1, each computing carrier further includes at least one on-chip cache circuit and one off-chip cache circuit. For example, the first computing carrier 101 includes a first on-chip cache circuit 1011 and a first off-chip cache circuit 1012. The second computing carrier 102 includes a second on-chip cache circuit 1021 and a second off-chip cache circuit 1022. The N-th computing carrier 103 includes an N-th on-chip cache circuit 1031 and an N-th off-chip cache circuit 1032.

The on-chip cache circuit may include on-chip memory, including, but not limited to, double data rate memory (DDRM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and three times dynamic random access memory. Take three memory (ThreeDataDRAM), three times static random access memory (ThreeDataStaticRandom-AccessMemory, 3DSRAM) and other forms; the off-chip cache circuit can be off-chip memory (Off-chipMemory), including but not limited to SharedMemory, Cache and so on. The cache may include a multilayer structure, such as an N-layer cache structure, including L1Cache, L2Cache, ..., LNCache.

As shown in FIG. 2-1, the computing device 100 further includes an on-chip storage data path control circuit 110 connected to each on-chip cache circuit, and an on-chip storage data path 121 connected to the on-chip storage data path control circuit 110, wherein: on-chip The storage data path control circuit 110 is configured to receive a data transmission instruction sent by the first on-chip cache circuit 1011 of the first computing carrier 101 among the plurality of computing carriers; and decode the data transmission instruction to obtain the transmitted data. An address and a received data address; the on-chip buffer circuit data path 121 is configured to obtain target data according to the sent data address, and transmit the target data to the received data address.

The first computing carrier 101 is any one of a plurality of computing carriers, and the data transmission instruction is a binary file. In this application, a data transmission instruction is decoded to obtain a sending data address and a receiving data address, and parameters such as a data capacity and a data identifier for determining target data can also be obtained. The sending data address is an address where the target data is stored in the first on-chip cache circuit, and the receiving data address is an address in the second on-chip cache circuit 1021 of the second computing carrier 102 of the plurality of computing carriers, that is, the The data transfer instruction instructs the on-chip storage data path control unit 110 to transfer the target data buffered in the first on-chip cache circuit 1011 to the second on-chip cache circuit 1021, that is, it is determined that the computing carrier that the first computing carrier 101 performs data transmission in advance is the second Computing carrier 1021.

It can be understood that when the on-chip storage data path control circuit 110 receives a data transmission instruction sent by the first on-chip cache circuit 1011, it decodes the data transmission instruction to obtain a sending data address and a receiving data address. The circuit data path 121 obtains the target data corresponding to the sending data address and transmits the target data to the receiving data address, and the second on-chip cache circuit 1021 caches the target data, thereby completing the on-chip cache circuit between the two computing carriers. data transmission.

For the on-chip storage data path control circuit 110, multiple data transmission instructions can be received at the same time. Therefore, it is necessary to determine the transmission order between the data transmission instructions. This application does not limit how to determine the execution order. The priorities corresponding to the data transmission instructions can be obtained to obtain multiple priorities. Each data in the multiple data transmission instructions is determined according to the multiple priorities. The execution order of the transfer instruction.

Among them, the priority can be obtained through multiple dimensions such as the quantity and capacity of the target data, the priority of the target data, or the priority of the first on-chip cache circuit and the remaining memory size.

It can be understood that the on-chip storage data path control circuit 110 determines the execution order between the data transmission instructions, and controls the on-chip cache circuit data path 121 to perform data transmission according to the execution order, which can improve the stability of the transmission.

In a possible embodiment, as shown in FIG. 2-1e, the on-chip storage data path control circuit 110 includes an instruction cache unit 1101, an instruction decoding unit 1102 connected to the instruction cache unit 1101, and A memory management unit 1103 connected to the instruction cache unit 1101 and the instruction decoding unit 1102, where:

The instruction buffer unit 1101 is configured to buffer the data transmission instruction;

The instruction decoding unit 1102 is configured to decode the data transmission instruction to obtain the sending data address and the receiving data address;

The memory management unit 1103 is configured to manage the data transmission instruction.

It can be understood that the on-chip storage data path control circuit 110 is further divided to obtain an instruction cache unit 1101, an instruction decoding unit 1102, and a memory management unit 1103, respectively, and execute corresponding steps, that is, the data management instruction is managed by the memory management unit 1103, that is, When the data transmission instruction is executed, it is directly called from the instruction buffer unit 1101, and the data decoding instruction is translated by the instruction decoding unit 1102 to complete the data transmission, so that the execution efficiency and the stability of execution are improved.

Further, as shown in FIG. 2-1f, the memory management unit 1103 includes an address mapping module 11031, a request arbitration module 11032, and a consistency control module 11033, where:

The address mapping module 11031 is configured to determine the second on-chip cache circuit corresponding to the received data address;

The request arbitration module 11032 is configured to allocate an execution order of each data transmission instruction in the plurality of data transmission instructions if the instruction cache unit includes a plurality of the data transmission instructions;

The consistency control module 11033 is configured to ensure consistency of data transmission.

It can be understood that the memory management unit 1103 is further divided to obtain the address mapping module 11031, the request arbitration module 11032, and the consistency control module 11033, respectively, and corresponding steps are performed, that is, the address mapping module 11031 is used to determine the target data to be cached. The request arbitration module 11032 determines the execution order of each data transmission instruction, and controls the on-chip cache circuit data path 121 for data transmission according to the transmission order, which can improve the stability of the transmission. And the consistency control module 11033 ensures the consistency of data transmission, which improves the stability of the transmission and the security of execution.

In one embodiment, as shown in FIG. 2-1, the computing device 100 further includes a fast peripheral device interconnect bus (PCIE) data path 122 connected to each off-chip cache circuit, for implementing the described Data transmission between off-chip cache circuits of any two computing carriers in multiple computing carriers.

It can be seen that the off-chip storage data between the various computing carriers can be directly used for data interaction through the PCIE data path 122, that is, the off-chip cached data is exchanged through the dedicated off-chip storage data path 122 to support larger-scale machines Learning operations. It can also be connected to various types of servers through the PCIE interface, which improves transmission efficiency.

Please refer to FIG. 2-2, which is a schematic flowchart of a data transmission method proposed by this application. The data transmission method is applied to a computing device shown in FIG. 2-1, that is, the computing device includes multiple computing carriers, and an on-chip storage data path control connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers. A circuit, and an on-chip storage data path connected to the on-chip storage data path control circuit. Specifically, as shown in Figure 2-2:

S201: Receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier in a plurality of computing carriers through an on-chip storage data path control circuit.

S202: Decode the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address.

S203: Obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address.

The received data address is an address in a second on-chip cache circuit of a second computing carrier of the plurality of computing carriers.

It can be understood that the on-chip storage data path control circuit receives the data transmission instruction sent by the first on-chip cache circuit of the first computing carrier in the plurality of computing carriers, and then the on-chip storage data path control circuit sends the data transmission instruction. Decode to obtain a sending data address and a receiving data address, obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address. In this way, the data transmission efficiency between different computing carriers can be improved, and it is convenient to improve the operation efficiency of the neural network.

In a possible embodiment, the on-chip storage data path control circuit includes an instruction cache unit, an instruction decoding unit connected to the instruction cache unit, and an instruction cache unit connected to the instruction cache unit and the instruction decoding unit. The memory management unit, which decodes the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, includes:

Decoding the data transmission instruction by the instruction decoding unit to obtain the sending data address and the receiving data address;

The method further includes:

Buffering the data transmission instruction by the instruction buffer unit;

The data transmission instruction is managed by the memory management unit.

In a possible embodiment, the memory management unit includes an address mapping module, a request arbitration module, and a consistency control module, and the managing the data transmission instruction by the memory management unit includes:

Determining, by the address mapping module, the second on-chip cache circuit corresponding to the received data address;

If the instruction buffer unit includes a plurality of the data transmission instructions, determining an execution order of each data transmission instruction in the plurality of data transmission instructions through the request arbitration module;

The consistency control module ensures data transmission consistency.

In a possible embodiment, the computing device further includes a fast external device interconnect bus PCIE data path, and the method further includes:

Data transmission between the off-chip cache circuits of any two computing carriers in the plurality of computing carriers is implemented through the PCIE data path.

In a possible embodiment, the multiple computing carriers include a central processing unit CPU, an image processor GPU, an application specific integrated circuit ASIC, a field programmable logic gate array FPGA, a coarse-grained reconfigurable array CGRA, or digital signal processing. At least two of the processor DSPs.

In a possible embodiment, the calculation carrier includes at least one calculation unit. The calculation unit includes: a main processing circuit, a branch processing circuit, and a basic processing circuit. The main processing circuit is connected to the branch processing circuit. The basic processing circuit is connected to the branch processing circuit, and the method further includes:

Acquiring data other than the computing unit through the main processing circuit, and dividing the data into broadcast data and distribution data;

Sending the broadcast data to all branch processing circuits in a broadcast manner through the main processing circuit, and selectively distributing the distribution data to different branch processing circuits;

Forwarding data between the main processing circuit and the basic processing circuit through the branch processing circuit;

Receiving broadcast data and distribution data forwarded by the branch processing circuit through the basic processing circuit, and performing operations on the broadcast data and distribution data to obtain an operation result, and sending the operation result to the branch processing circuit;

Receiving the operation result of the basic processing circuit forwarded by the branch processing circuit through the main processing circuit, and processing the operation result to obtain an operation result.

In a possible embodiment, the sending the broadcast data to all the branch processing circuits in a broadcast manner by using the main processing circuit includes:

Sending the broadcast data to all branch processing circuits in one broadcast or multiple broadcasts through the main processing circuit;

The operation performed by the basic processing circuit on the broadcast data and the distribution data to obtain an operation result includes:

The basic processing circuit performs an inner product operation, a product operation, or a vector operation on the broadcast data and the distribution data to obtain an operation result.

In one embodiment of the present invention, a computing device is provided, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured by The processor executes the implementation manner described in the data transmission method.

In another embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, the processor executes the instructions. The implementation manner described in the data transmission method.

The present application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to jointly complete the operation specified by the user. Figure 2-3 is a schematic structural diagram of a combined processing device.

Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of the machine learning computing device such as start and stop; other processing devices can also cooperate with the machine learning computing device to complete the computing tasks.

A universal interconnection interface for transmitting data and control instructions between the machine learning computing device and other processing devices. The machine learning computing device obtains required input data from other processing devices and writes it to the storage device on the chip of the machine learning computing device; it can obtain control instructions from other processing devices and write it to the control cache on the machine learning computing device chip; The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.

Optionally, the combined processing device shown in FIG. 2-3 may further include a storage device, and the storage device is connected to the machine learning computing device and the other processing devices, respectively. The storage device is configured to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the machine learning computing device or other processing devices.

In some embodiments, a chip is also applied, which includes the above-mentioned machine learning computing device or combined processing device.

In some embodiments, a board card is applied, which includes the chip package structure described above. Referring to FIG. 2-4, FIG. 2-4 provides a board card. In addition to the above chip, the board card may also include other supporting components. The supporting components include, but are not limited to, a storage device, an interface device, and a control device. ;

The memory device is connected to a chip in the chip package structure through a bus, and is used to store data. The memory device may include a plurality of groups of memory cells. Each group of the storage units is connected to the chip through a bus. It can be understood that each group of the storage units may be a double-rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM).

In some embodiments, an electronic device is applied, which includes the board card described above.

Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, camcorders, projectors, watches, headphones , Mobile storage, wearables, transportation, home appliances, and / or medical devices.

The vehicles include airplanes, ships, and / or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.

It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all described as a series of action combinations. However, those skilled in the art should know that this application is not limited by the described action order. Because according to the present application, certain steps may be performed in another order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required for this application.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in one embodiment, reference may be made to related descriptions in other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software program modules.

When the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present application essentially or part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing memories include: U disks, Read-Only Memory (ROM), Random Access Memory (RAM), mobile hard disks, magnetic disks, or optical disks and other media that can store program codes.

A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by a program instructing related hardware. The program may be stored in a computer-readable memory, and the memory may include a flash disk. , Read-only memory (English: Read-Only Memory, referred to as ROM), random access device (English: Random Access Memory, referred to as RAM), magnetic disks or optical disks, etc.

The embodiments of the present application have been described in detail above. Specific examples have been used in this document to explain the principles and implementation of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. Persons of ordinary skill in the art may change the specific implementation and application scope according to the idea of the present application. In summary, the content of this description should not be construed as a limitation on the present application.

Claims

A computing device, characterized in that the computing device comprises: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master The chip is connected to the slave chip, and the master chip in the X group of neural network chips is connected, and the value of X is an integer greater than or equal to 2;

Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;

A first master chip in a first group of neural network chips in the X group of neural network chips, configured to receive an operation result of a slave chip connected to the first master chip;

The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.
The device according to claim 1, wherein the first main chip is further configured to:

Transmitting all operation results in the first master chip to a slave chip connected to the first master chip.
The device according to claim 1 or 2, wherein the master chip is connected to the slave chip through a tree structure, the tree structure is an n-ary tree structure, and the master chip is the n-ary The root node of the tree structure, and the slave chip is a child node of the n-ary tree structure, and the child node may be a first-level child node or a multi-level child node.
The device according to claim 1, wherein the neural network chip comprises: an arithmetic unit and a controller unit; the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;

The controller unit is configured to obtain input data and calculation instructions;

The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;

The multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;

The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
The device according to claim 4, wherein the neural network chip further comprises: a storage unit and a direct memory access unit, and the storage unit comprises: any combination of a register and a cache;

The cache is used to store the input data;

The register is used to store scalar data in the input data;

The cache includes a high-speed temporary cache;

The controller unit includes: an instruction cache unit, an instruction processing unit, and a storage queue unit;

The instruction buffer unit is configured to store calculation instructions associated with the artificial neural network operation;

The instruction processing unit is configured to parse the calculation instruction to obtain a plurality of operation instructions;

The storage queue unit is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the order of the queue;

The controller unit includes: a dependency relationship processing unit;

The dependency relationship processing unit is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction, such as the first operation instruction and the zeroth operation instruction have an association relationship, Storing the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extracting the first operation instruction from the instruction storage unit and transmitting the first operation instruction to the operation unit;

The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:

Extracting a first storage address interval of data required in the first operation instruction according to the first operation instruction, and extracting a zeroth storage address interval of data required in the zeroth operation instruction according to the zeroth operation instruction, If the first storage address range and the zeroth storage address range have overlapping areas, it is determined that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage address range and the The zeroth storage address interval does not have an overlapping area, and it is determined that the first operation instruction and the zeroth operation instruction have no correlation.
The device according to claim 4 or 5, wherein when the neural network chip is a main chip, the controller unit further comprises a scheduling unit, and is specifically configured to:

Scheduling an operation result in the main chip.
The apparatus according to claim 6, wherein the scheduling the operation result in the main chip comprises:

The main chip in the X group of neural network chips connected to the same direction schedules the operation content of 1 / Y + 1, wherein the same direction includes a clockwise direction or a counterclockwise direction, and Y is the The number of slave chips connected to the master chip in the X-group neural network chip is described.
The device according to any one of claims 4 to 7, wherein:

The main processing circuit is specifically configured to combine and sort multiple intermediate results sent from the processing circuits to obtain the result of the calculation instruction;

Or, the main processing circuit is specifically configured to combine and sort the intermediate results sent by multiple slave processing circuits and obtain the result of the calculation instruction after the activation processing.
The device according to claim 8, wherein the main processing circuit comprises one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;

The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;

The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;

The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
The apparatus according to claim 9, wherein the slave processing circuit comprises: a multiplication processing circuit;

The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
The device according to claim 10, wherein the slave processing circuit further comprises: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
A combined computing device, characterized in that the combined computing device comprises: M computing devices according to claim 1, the M computing devices according to claim 1 are connected, and the M is Value range is an integer greater than or equal to 2.
The combined computing device according to claim 12, wherein the M computing devices according to claim 1 are connected, comprising:

Each of the M computing devices according to claim 1, wherein a main chip of a group of neural network chips in the X group of neural network chips included is used to communicate with the X group of neural networks in other computing devices. A set of neural network chips in a master chip connection.
A computing method for executing a machine learning model, characterized in that the computing method is applied to a computing device, the computing device includes: X groups of neural network chips, each group of the X group of neural network chips It includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the master chip in the X group of neural network chips is connected, and the value of X is greater than or equal to 2 Integer

Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;

The first master chip in the first group of neural network chips in the X group of neural network chips is configured to receive a calculation result of a slave chip connected to the first master chip, and combine the calculation result of the first chip, Obtain the first set of operation results;

The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.
The method according to claim 14, wherein the first main chip is further configured to:

Transmitting all operation results in the first master chip to a slave chip connected to the first master chip.
The method according to claim 14 or 15, wherein the master chip is connected to the slave chip through a tree structure, the tree structure is an n-ary tree structure, and the master chip is the n-ary The root node of the tree structure, and the slave chip is a child node of the n-ary tree structure, and the child node may be a first-level child node or a multi-level child node.
The method according to claim 16, wherein the neural network chip comprises: an arithmetic unit and a controller unit; the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;

The controller unit is configured to obtain input data and calculation instructions;

The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;

The plurality of slave processing circuits are configured to perform intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
The method according to claim 17, wherein the computing device further comprises: a storage unit and a direct memory access unit, the storage unit comprising: any combination of a register and a cache;

The cache is used to store the input data;

The register is used to store scalar data in the input data;

The cache includes a high-speed temporary cache;

The controller unit includes: an instruction cache unit, an instruction processing unit, and a storage queue unit;

The instruction buffer unit is configured to store calculation instructions associated with the artificial neural network operation;

The instruction processing unit is configured to parse the calculation instruction to obtain a plurality of operation instructions;

The storage queue unit is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the order of the queue;

The controller unit includes: a dependency relationship processing unit;

The dependency relationship processing unit is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction, such as the first operation instruction and the zeroth operation instruction have an association relationship, Storing the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extracting the first operation instruction from the instruction storage unit and transmitting the first operation instruction to the operation unit;

The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:

Extracting a first storage address interval of data required in the first operation instruction according to the first operation instruction, and extracting a zeroth storage address interval of data required in the zeroth operation instruction according to the zeroth operation instruction, If the first storage address range and the zeroth storage address range have overlapping areas, it is determined that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage address range and the The zeroth storage address interval does not have an overlapping area, and it is determined that the first operation instruction and the zeroth operation instruction have no correlation.
The method according to claim 18, wherein when the neural network chip is a main chip, the controller unit further comprises a scheduling unit, which is specifically configured to:

Scheduling an operation result in the main chip.
The apparatus according to claim 19, wherein the scheduling the operation result in the main chip comprises:

The main chip in the X group of neural network chips connected to the same direction schedules the operation content of 1 / Y + 1, wherein the same direction includes a clockwise direction or a counterclockwise direction, and Y is the same as the The number of slave chips connected to the master chip in the X-group neural network chip is described.
The method according to any one of claims 18-20, wherein the main processing circuit is specifically configured to combine and sort a plurality of intermediate results sent from the processing circuit to obtain a result of the calculation instruction;

Or, the main processing circuit is specifically configured to combine and sort the intermediate results sent by multiple slave processing circuits and obtain the result of the calculation instruction after the activation processing.
The method according to claim 21, wherein the main processing circuit comprises one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;

The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;

The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;

The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
The method according to claim 22, wherein the slave processing circuit comprises: a multiplication processing circuit;

The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
The method according to claim 23, wherein the slave processing circuit further comprises: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
A calculation method for executing a machine learning model, characterized in that the calculation method is applied to a combination calculation device for performing machine learning calculations; the combination calculation device includes: In the computing device, the M computing devices according to claim 1 are connected, and the value range of M is an integer greater than or equal to two.
The method according to claim 23, wherein the connection between the M computing devices according to claim 1 comprises:

Each of the M computing devices according to claim 1, wherein a main chip of a group of neural network chips in the X group of neural network chips included is used to communicate with the X group of neural networks in other computing devices. A set of neural network chips in a master chip connection.