WO2020063940A1 - Computing apparatus and related product - Google Patents

Computing apparatus and related product Download PDF

Info

Publication number
WO2020063940A1
WO2020063940A1 PCT/CN2019/108842 CN2019108842W WO2020063940A1 WO 2020063940 A1 WO2020063940 A1 WO 2020063940A1 CN 2019108842 W CN2019108842 W CN 2019108842W WO 2020063940 A1 WO2020063940 A1 WO 2020063940A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
processing circuit
neural network
data
instruction
Prior art date
Application number
PCT/CN2019/108842
Other languages
French (fr)
Chinese (zh)
Inventor
杜子东
周诗怡
刘少礼
王秉睿
张尧
周徐达
兰慧盈
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811153022.6A external-priority patent/CN110968532B/en
Priority claimed from CN201811207452.1A external-priority patent/CN111062469B/en
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2020063940A1 publication Critical patent/WO2020063940A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of information processing technology, and in particular, to a computing device and related products.
  • a neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other.
  • Existing neural network operations are based on CPU (Central Processing Unit) or GPU (English: Graphics Processing Unit) to implement neural network operations.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • Existing training equipment has slow training speed and takes a long time .
  • the embodiments of the present application provide a computing device and related products, which can improve the training speed and efficiency of the training device.
  • a computing device includes:
  • each group of the neural network chip in the group X includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the group X neural network chip
  • the connection between the main chips in the X, the value of X is an integer greater than or equal to 2;
  • Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;
  • the first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips.
  • the calculation result shared by the main chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips.
  • a neural network chip includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;
  • the controller unit is configured to obtain input data and calculation instructions
  • the controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;
  • the master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
  • the multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;
  • the main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
  • a combined computing device includes: M computing devices according to claim 1, the M computing devices according to claim 1 being connected, the M The value is an integer greater than or equal to 2.
  • a calculation method for executing a machine learning model is provided, and the calculation method is applied to the calculation device according to the first aspect.
  • a calculation method for executing a machine learning model is provided, and the calculation method is applied to the combination calculation device according to the third aspect.
  • an embodiment of the present application provides a computing device, where the computing device includes multiple computing carriers, an on-chip storage data path control circuit connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers, And an on-chip storage data path connected to the on-chip storage data path control circuit, wherein:
  • the on-chip storage data path control circuit is configured to receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers; and decode the data transmission instruction to obtain a transmission Data address and receiving data address;
  • the on-chip cache circuit data path is configured to obtain target data according to the sending data address and transmit the target data to the receiving data address, where the receiving data address is the second of the plurality of computing carriers. Calculate an address in the carrier's second on-chip cache circuit.
  • an embodiment of the present application provides a combined processing device, where the combined processing device includes the computing device described in the first aspect, a universal interconnection interface, and other processing devices;
  • the computing device interacts with the other processing devices to jointly complete a computing operation designated by the user.
  • an embodiment of the present application provides a system-on-chip including the computing device according to the first aspect or the combined processing device according to the second aspect.
  • an embodiment of the present application provides a data transmission method, which is applied to a computing device according to the first aspect, and the method includes:
  • an embodiment of the present application provides another computing device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are stored in the memory. Configuration is performed by the processor, and the program includes instructions for some or all of the steps as described in the fourth aspect.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the computer program to
  • the processor executes the method of the fourth aspect described above.
  • FIG. 1-1a is a schematic diagram of a neural network training device according to an embodiment of the present application.
  • FIG. 1-1b is a schematic diagram of a chip connection structure of a computing device according to an embodiment of the present application.
  • FIG. 1-1c is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
  • FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
  • FIG. 1-1e is a schematic structural diagram of a neural network chip according to an embodiment of the present application.
  • FIG. 1-1f is a schematic diagram of a scheduling strategy for a computing result of a main chip according to an embodiment of the present application.
  • FIG. 1-1g is a schematic structural diagram of a combined computing device according to an embodiment of the present application.
  • FIG. 1-2 is a schematic diagram of a combination processing device provided by an embodiment of the present application.
  • FIG. 1-3 is a structural diagram of another combination processing device provided by an embodiment of the present application.
  • FIG. 1-3a is a schematic structural diagram of a board card according to an embodiment of the present application.
  • FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • FIG. 2-1a is a schematic structural diagram of a computing unit according to an embodiment of the present application.
  • FIG. 2-1b is a schematic structural diagram of a main processing circuit according to an embodiment of the present application.
  • FIG. 2-1c is a schematic diagram of data distribution of a computing unit according to an embodiment of the present application.
  • FIG. 2-1d is a schematic diagram of data return of a computing unit according to an embodiment of the present application.
  • FIG. 2-1e is a schematic structural diagram of an on-chip storage data path control circuit according to an embodiment of the present application.
  • FIG. 2-1f is a schematic structural diagram of a memory management unit according to an embodiment of the present application.
  • FIG. 2-3 is a schematic structural diagram of a combination processing device according to an embodiment of the present application.
  • FIG. 2-4 is a schematic structural diagram of a board card according to an embodiment of the present application.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • the neural network training device consists of multiple neural network chips, multiple neural network chips perform multiple tasks, or divide a single task into segments based on depth. It learns the characteristics of algorithm to schedule and cooperate to complete the training task.
  • the arrangement and cooperation of multiple neural network chips in the neural network training device are specifically described in the following embodiments.
  • a training device includes: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master chip is connected to the slave chip, The connection between the main chips in the X group of neural network chips.
  • the value of X is an integer greater than or equal to 2.
  • Each neural network chip in the X group of neural network chips is used to obtain input data and weights, and the weights are calculated with the input data corresponding to each neural network chip to obtain the operation result.
  • the input data is different and the obtained weights are the same;
  • the first master chip in the first group of neural network chips in the X group of neural network chips is used to receive the operation result of the slave chip connected to the first master chip;
  • the first master chip It is used to share the operation result of the first master chip and the received operation result of the slave chip with the master chips in other groups of neural network chips, and receive the operation results shared by the master chips in other groups of neural network chips.
  • X can be any integer greater than or equal to 2 such as 2, 3, 5, 8 and so on.
  • each group of neural network chips includes a master chip and at least one slave chip, wherein different groups of neural networks
  • the number of slave chips in the chip can be the same or different.
  • the master chip in the first two sets of neural network chips can be connected with 3 slave chips, and the last set of neural network chips
  • the master chip is connected to 4 slave chips.
  • the slave chips are equally divided and connected to the master chip, so that the master chip receives the operation results of the slave chips and quickly schedules the operation results between the master chips.
  • FIG. 1-1b is a chip connection structure of a computing device according to an embodiment of the present application.
  • X is 4, among which chip 4, chip 8, chip 13, and chip. 10 is the master chip, and 3 slave chips are connected to each master chip.
  • Chips 1 to 16 all get input data and weights, where each chip gets different input data and the weights are the same, so each chip will use the same training model to train different input data .
  • the input data of each chip can be for data corresponding to multiple tasks, or for data sets segmented for the same task.
  • the segmentation of the data set can be completed in other external devices, or in other modules in the computing device. It can be completed in the main chip of a certain group of neural network chips in the computing device.
  • the first master chip is used to receive the operation results of the slave chips connected to the first master chip.
  • the first master chip may be the master chip 4, the master chip 8, the master chip 10, and the master chip. Any one of the master chips in chip 13 respectively obtains the operation results of the slave chips connected to itself, and finally all operation results included in the master chip are its own operation results and the operation results of the slave chips connected to it.
  • the operation results included in the slave chip are shared among the X group of master chips.
  • the operation results are transmitted cyclically in the same direction, for example, in a clockwise direction. That is: chip 4 ⁇ chip 8 ⁇ chip 13 ⁇ chip 10 ⁇ chip 4 or pass in a counterclockwise direction, that is: chip 4 ⁇ chip 10 ⁇ chip 13 ⁇ chip 8 ⁇ chip 4.
  • all the operation results included in the main chip can be transferred to the next adjacent main chip at one time, or it can be transferred in multiple steps.
  • connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.
  • the first master chip is further configured to: transmit all operation results in the first master chip to a slave chip connected to the first master chip.
  • the main chip 4 After the main chip 4, the main chip 8, the main chip 10, and the main chip 13 have shared the transfer, they obtain the calculation results of all the chips, and then each main chip passes the calculation results it contains to the respective connected slave chips, so that each Each slave chip contains the operation results of all chips.
  • the master chip is connected to the slave chip through a tree structure, the tree structure is an n-tree structure, the master chip is a root node of the n-tree structure, and the slave chip is a child node of the n-tree structure.
  • the child node may be One-level child nodes can also be multi-level child nodes.
  • the master chip in the group X neural network chip can be connected to the slave chip through a tree structure, where the master chip is the root node of the tree structure, the slave chip is a child node, and the child node can be a first-level child node or Are multi-level child nodes.
  • the master chip obtains the operation results of the slave chips, the operation results of each slave chip can be directly obtained, or the operation results of other slave chips can be obtained by the slave chip directly connected to the master chip, and then passed to the master chip.
  • this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.
  • the slave chip is connected to the master chip through a tree structure, and the operation results of the slave chip can be integrated before the operation result is sent to the master chip, which reduces the operation pressure of the master chip and further reduces the loss to the master chip.
  • FIG. 1-1c is another chip connection structure of a computing device provided by an embodiment of the present application.
  • X is 4, and among the 4 groups of neural network chips,
  • the master chip is the master chip 31, the master chip 32, the master chip 33, and the master chip 34.
  • Each master chip is connected to the slave chip through a tree structure.
  • the master chip 31 is the root node, and the slave chips connected to it include the chip 311.
  • the chip 312 and the chip 313 are first-level child nodes, and the slave chips connected to the slave chip 311 include a chip 3111, a chip 3112, and a chip 3113, which are second-level child nodes.
  • the other slave chips are also primary child nodes or secondary child nodes.
  • FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application. As shown in FIG. It is connected to the slave chip, and the tree structure includes three levels of sub-nodes, and the operation results of the leaf nodes at the lowest level can be directly transferred to the master chip, or can be transferred to the master chip through the integration of the slave chip of the upper-level sub-node.
  • the neural network computing device involved in the embodiment of the present application includes a neural network chip.
  • FIG. 1-1e is a schematic structural diagram of a neural network chip provided by an embodiment of the present application, as shown in FIG. 1-1e.
  • the neural network chip includes: an arithmetic unit 12 and a controller unit 11; the arithmetic unit 12 includes: a master processing circuit 101 and a plurality of slave processing circuits 102;
  • the controller unit 11 is configured to obtain input data and calculation instructions.
  • the method of obtaining input data and calculation instructions may be obtained through a data input and output unit.
  • the data input and output unit may be one or Multiple data I / O interfaces or I / O pins.
  • the above calculation instructions include, but are not limited to, forward operation instructions or backward training instructions, or other neural network operation instructions, such as convolution operation instructions.
  • the specific implementation manner of this application does not limit the specific expressions of the above calculation instructions.
  • the controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and input data to a main processing circuit;
  • the main processing circuit 101 is configured to perform preprocessing on the input data and perform multiple operations
  • Data and operation instructions are transmitted between two slave processing circuits;
  • multiple slave processing circuits 102 are used to perform intermediate operations in parallel according to the data transmitted from the main processing circuit and the operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to A main processing circuit;
  • a main processing circuit 101 configured to perform subsequent processing on a plurality of intermediate results to obtain an operation result of a calculation instruction.
  • the technical solution provided in this application sets the operation unit into a master-slave structure.
  • the operation unit can split the data according to the forward operation calculation instructions, so that multiple slave processing circuits can The part with a large amount of calculation is performed in parallel, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
  • the aforementioned neural network chip is specifically used for an artificial neural network operation
  • the aforementioned input data may specifically include input neuron data and weight data.
  • the above operation result may be specifically: the result of the operation of the artificial neural network is the output neuron data.
  • the operation in the neural network can be a layer of the neural network.
  • the implementation process is that in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instructions in the next layer are completed.
  • the output neuron calculated by the arithmetic unit will be used as the input neuron of the next layer (or perform some operations on the output neuron and then be used as the input neuron of the next layer), and the weight will also be replaced.
  • the operation instructions of the next layer will use the input neuron gradient calculated in the operation unit as the next layer
  • the output neuron gradient is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weight is replaced with the weight of the next layer.
  • the input neuron and output neuron of the multi-layer operation do not refer to the neurons in the input layer and the output layer of the entire neural network, but to the For any two adjacent layers in the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron.
  • the aforementioned neural network chip may further include a storage unit 10 and a direct memory access unit 50.
  • the storage unit 10 may include one or any combination of a register 201 and a cache 202. Specifically, the cache is used for storing The calculation instruction; the register is used to store the input data and a scalar; and the cache is a high-speed temporary cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
  • An instruction storage unit 110 configured to store calculation instructions associated with the artificial neural network operation
  • the instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions
  • the storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or calculation instructions to be executed according to a sequence of the queue.
  • the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions.
  • the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions.
  • the above micro-instruction may be an instruction next to the instruction.
  • the micro-instruction may be obtained by splitting or decoding the instruction, and may be further decoded into a control signal of each component, each unit, or each processing circuit.
  • controller unit 11 may further include:
  • the dependency relationship processing unit 112 is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is related, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit. Transmitted to the arithmetic unit;
  • the determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:
  • the controller unit 11 when the neural network chip is the main chip, the controller unit 11 further includes a scheduling unit 114 for scheduling the operation results in the main chip.
  • the main chip in each group of neural network chips needs to schedule operation results, so that all the main chips share all the operation results included in each main chip.
  • scheduling you need to follow a certain scheduling strategy.
  • the operation results of the master neural network chip in the X group of neural network chips can be integrated, including the operation results of the master neural network chip itself and the received operation results of the slave chip to obtain X integrated operation results, and then X
  • Each integration operation result is scheduled in the same direction according to the connection order of the main chip.
  • Each integration operation result is dispatched once, and after X 2 dispatches, all the main chips obtain X integration operation results; or after obtaining X integration
  • the X integrated operation results are scheduled in the same direction according to the connection order of the main chip.
  • next main chip After the next main chip receives the operation results transmitted by the previous main chip, it will compare the received operation results with its own operation results. Integrate to form a new calculation result, and then pass it to the next main chip. After 2 * (X-1) scheduling, all the main chips have obtained X integrated operation results; the X main chips can also be The operation results are partially integrated or not integrated, and then multiple partial scheduling is performed between the main chips.
  • scheduling the operation results in the main chip includes: scheduling the main chip in the X group of neural network chips connected to the same direction to schedule the operation content of 1 / Y + 1, where: The same direction includes a clockwise direction or a counterclockwise direction, and Y is the number of slave chips connected to the master chip in the X-group neural network chip.
  • FIG. 1-1f is an operation result scheduling strategy between main chips provided in the embodiment of the present application.
  • FIG. 1-1f corresponding to FIG. 1-1b, there are 4 groups of nerves.
  • Network chip the main chip of which is chip 4, chip 8, chip 13 and chip 10
  • the operation result in main chip 4 includes its own operation result, and the operation results of chip 1, chip 2 and chip 3 received, Correspond to these four calculation results as a1, b1, c1, d1.
  • the operation result of chip 8 corresponds to four parts a2, b2, c2, and d2.
  • the operation result of chip 13 corresponds to a3, b3. , c3, d3.
  • the operation result of chip 10 corresponds to four parts: a4, b4, c4, and d4.
  • Scheduling is clockwise.
  • chip 4 dispatches part a1 to chip 8
  • chip 8 dispatches part b2 to chip 13
  • chip 13 dispatches part c3 to chip 10
  • chip 13 dispatches part d4 to chip 4.
  • This schedule The process can be performed at the same time or at different times.
  • This scheduling method can save the integration time of each chip and improve scheduling efficiency.
  • the main processing circuit 101 is specifically configured to combine and sort multiple intermediate results sent from the processing circuit 102 to obtain a result of the calculation instruction;
  • the main processing circuit 101 is specifically configured to combine and sort the intermediate results sent by the multiple slave processing circuits 102 and obtain the result of the calculation instruction after the activation processing.
  • the main processing circuit includes one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
  • the conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;
  • the activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;
  • the addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
  • the slave processing circuit further includes: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
  • the embodiment of the present application also relates to another combined computing device, where the combined computing device includes: M computing devices according to the first embodiment, and the M computing devices according to the first embodiment are connected to each other.
  • the value range of M is an integer greater than or equal to two.
  • FIG. 1-1g is a schematic structural diagram of a combination computing device provided by an embodiment of the present application.
  • the combination computing device includes four calculations as shown in FIG. 1-1b.
  • the four computing devices are connected to each other, and can be bridged through circuits, can be connected by setting a special connection module, and can also be connected by the main chip in the four computing devices.
  • This connection structure can improve data training efficiency through multiple chips' cooperative operation on the one hand, and can schedule the operation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve the slave chip. Performance and cost savings.
  • selecting a main chip from multiple sets of main chips to connect with an external main chip reduces the loss of the main chip and improves the service life of the main chip.
  • the connections between the M computing devices as in the first embodiment include: each of the M computing devices as in the first embodiment, which includes X groups of neural network chips The main chip of a group of neural network chips is used to connect with the main chip of a group of neural network chips in a group X neural network in another computing device.
  • each of the four computing devices as in the first embodiment includes four sets of neural network chips, and the main chip of one set of neural network chips is used to communicate with four of the other computing devices.
  • the main chips of a group of neural network chips in the group of neural network chips are connected, for example, the main chip 502, the main chip 507, the main chip 512, and the main chip 510 are connected.
  • selecting the master chip in one of the X group of neural network chips it can be randomly selected or selected using a selection strategy, such as selecting the master chip with the most slave chips, or selecting it with other computing devices.
  • the closest physical chip is the closest.
  • multiple groups of neural network chips are divided into a master chip and a slave chip, and then the master chip obtains the operation results of the slave chips, and schedules the calculation results between different sets of master chips, so that each The master chip of the group contains all the calculation results, and the master chip distributes all the calculation results to the slave chips, which improves the training speed of the neural network chip and saves training time.
  • FIG. 1-2 is a schematic diagram of a combined processing device.
  • processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the computing device and external data and control, including data handling, to complete basic control of the computing device such as start and stop; other processing devices can also cooperate with the computing device to complete computing tasks.
  • a universal interconnection interface for transmitting data and control instructions between the computing device and other processing devices.
  • the computing device obtains required input data from other processing devices and writes it to a storage device on the computing device chip; it can obtain control instructions from other processing devices and write it to the control cache on the computing device chip; it can also read the computing device's
  • the data in the module is stored and transmitted to other processing devices.
  • the structure is shown in FIG. 1-3, and may further include a storage device, and the storage device is connected to the computing device and the other processing devices, respectively.
  • the storage device is used to store data in the computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above computing device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the chip package structure described above. Referring to FIG. 1-3a, FIG. 1-3a provides a board card. In addition to the above chip 389, the board card may also include other supporting components, which include, but are not limited to, a storage device 390 and an interface device 391 And control device 392;
  • the memory device 390 is connected to a chip in the chip package structure through a bus, and is used to store data.
  • the memory device may include a plurality of sets of memory cells 393. Each group of the storage units is connected to the chip through a bus. It can be understood that the memory cells in each group may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
  • each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
  • the interface device is electrically connected to a chip in the chip package structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function.
  • the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
  • the control device is electrically connected to the chip.
  • the control device is configured to monitor a state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit).
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
  • neural networks are the basis of many current artificial intelligence applications. With the further expansion of the application scope of neural networks, many neural network models and large batches of requests have appeared.
  • the calculation of the neural network can be performed in parallel using a heterogeneous computing carrier. Therefore, how to improve the data transmission efficiency between heterogeneous computing devices is a technical problem to be solved by those skilled in the art.
  • the computing device may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, and various forms of user equipment (user equipment, UE ), A mobile station (MS), a terminal device (terminal), etc., the computing device may also include a system-on-chip (SOC).
  • UE user equipment
  • MS mobile station
  • terminal terminal
  • SOC system-on-chip
  • the computing carrier may be a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field programmable logic gate array (ASIC) Field-Programmable Gate Array (FPGA), Coarse-Grained Re-configurable Array (CGRA), Digital Signal Processing (DSP), etc.
  • CPU central processing unit
  • GPU graphics processing unit
  • ASIC application specific integrated circuit
  • ASIC field programmable logic gate array
  • FPGA Field-Programmable Gate Array
  • CGRA Coarse-Grained Re-configurable Array
  • DSP Digital Signal Processing
  • the embodiments of the present application provide a data transmission method and related products, which can improve the data transmission efficiency between different computing carriers and facilitate the improvement of the neural network operation efficiency.
  • the present application will be further described in detail below with reference to specific embodiments and with reference to the drawings.
  • FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • the computing device 100 includes a plurality of computing carriers such as a first computing carrier 101, a second computing carrier 102, and an N-th computing carrier 103.
  • N is a positive integer greater than 2
  • the multiple computing carriers may include at least two of the above-mentioned CPUs, GPUs, ASICs, FPGAs, CGRAs, or DSPs, and may also include the above-mentioned two same-type carriers, for example, 2 CPUs, 2 GPUs, 1 ASIC, or 1 FPGA.
  • each computing carrier may include at least one computing unit for a neural network operation, such as a processing chip and the like.
  • the specific structure of the computing unit is not limited.
  • FIG. 2-1a is a schematic structural diagram of a computing unit.
  • the calculation unit includes: a main processing circuit, a basic processing circuit, and a branch processing circuit. Specifically, the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to at least one basic processing circuit.
  • the branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit.
  • FIG. 2-1b is a schematic structural diagram of a main processing circuit.
  • the main processing circuit may include a register and / or an on-chip buffer circuit.
  • the main processing circuit may further include: a control Circuit, vector operator circuit, ALU (arithmetic and logic unit) circuit, accumulator circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in actual applications, the above main processing circuit also It may include a conversion circuit (such as a matrix transposition circuit), a data rearrangement circuit, or an activation circuit and the like.
  • the main processing circuit also includes a data sending circuit, a data receiving circuit, or an interface.
  • the data sending circuit can integrate a data distribution circuit and a data broadcasting circuit.
  • the data distribution circuit and the data broadcasting circuit can also be set separately; in actual applications
  • the above-mentioned data transmitting circuit and data receiving circuit may also be integrated together to form a data transmitting and receiving circuit.
  • broadcast data that is, data that needs to be sent to each basic processing circuit.
  • the specific selection method can be specifically determined by the main processing circuit according to the load and the calculation method.
  • the broadcast transmission method is to broadcast data to each basic processing circuit in a broadcast form.
  • broadcast data can be sent to each basic processing circuit by one broadcast, and broadcast data can be sent to each basic processing circuit by multiple broadcasts.
  • the specific implementation of this application is not limited. The number of broadcasts mentioned above), and the distribution and transmission method is to selectively send the distribution data to some basic processing circuits.
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different. Specifically, if the data is sent in a distributed manner, each basic processing circuit that receives the data receives The data received can be different, of course, the data received by some basic processing circuits can also be the same;
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and each basic processing circuit that receives the data can receive the same data, that is, the broadcast data can include all the basic processing circuits that need to receive The data.
  • Distributing the data may include: part of the data that the basic processing circuit needs to receive.
  • the main processing circuit may send the broadcast data to all the branch processing circuits through one or more broadcasts, and the branch processing circuits forward the broadcast data to all the basic processing circuits.
  • the vector operator circuit of the main processing circuit described above can perform vector operations, including but not limited to: addition, subtraction, multiplication, and division of two vectors, addition and subtraction of vectors and constants, or operations on each element of a vector Perform arbitrary operations.
  • the continuous operation may specifically be addition and subtraction of vectors and constants, multiplication, division operations, activation operations, accumulation operations, and the like.
  • Each basic processing circuit may include a basic register and / or a basic on-chip cache circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the inner product operator circuit, the vector operator circuit, and the accumulator circuit may all be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may also be separately provided circuits.
  • connection structure of the branch processing circuit and the basic circuit may be arbitrary, and is not limited to the H-shaped structure in FIG. 2-1b.
  • the main processing circuit to the basic circuit is a broadcast or distribution structure
  • the basic circuit to the main processing circuit is a gather structure. Broadcasting, distribution and collection are defined as follows:
  • the data transmission mode from the main processing circuit to the basic circuit may include:
  • the main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is respectively connected to a plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit, and the branch processing circuit is further connected to a branch processing circuit, and so on, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is respectively connected to a plurality of basic circuits.
  • the main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is further connected in series with a plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit.
  • the branch processing circuit is further connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series. Then, each branch processing circuit is connected in series to a plurality of basic circuits.
  • the main processing circuit When distributing data, the main processing circuit transmits data to some or all of the basic circuits, and the data received by each basic circuit that receives the data may be different;
  • the main processing circuit When broadcasting data, the main processing circuit transmits data to some or all of the basic circuits, and each basic circuit that receives the data receives the same data.
  • the computing unit shown in Figure 2-1a may be a separate physical chip. Of course, in practical applications, the computing unit may also be integrated in other chips (such as CPU, GPU). This application is specifically implemented. The manner does not limit the physical expression of the chip device.
  • Figure 2-1c is a schematic diagram of data distribution of a computing unit, as shown by the arrow in Figure 2-1c, and this arrow is the data distribution direction.
  • the main processing circuit receives After the external data, the external data is split and distributed to multiple branch processing circuits, and the branch processing circuit sends the split data to the basic processing circuit.
  • Figure 2-1d is a schematic diagram of data return of a computing unit, as shown by the arrow in Figure 2-1d, the arrow is the direction of data return, as shown in Figure 2-1d, basic processing
  • the circuit returns the data (such as the result of the inner product operation) to the branch processing circuit, and the branch processing circuit is returning to the main processing circuit.
  • the specific data may be vector, matrix, multi-dimensional (three-dimensional or four-dimensional or more) data, and for a specific value of the input data, it may be called an element of the input data.
  • the embodiment of the present disclosure also provides a calculation method of a calculation unit shown in FIG. 2-1a.
  • the calculation method is applied to the calculation of a neural network.
  • the calculation unit may be used for a layer or a multi-layer neural network. Multiple layers of input data and weight data perform operations.
  • the calculation unit is configured to perform an operation on one or more input data and weight data of the trained multi-layer neural network
  • the calculation unit is configured to perform an operation on one or more layers of input data and weight data in a multi-layer neural network in a forward operation.
  • the above operations include, but are not limited to, one or any combination of convolution operations, matrix multiplication matrix operations, matrix multiplication vector operations, offset operations, fully connected operations, GEMM operations, GEMV operations, and activation operations.
  • GEMM calculation refers to the matrix-matrix multiplication operation in the BLAS library.
  • GEMV calculation refers to the matrix-vector multiplication operation in the BLAS library.
  • This application does not limit the connection relationship between computing carriers in a computing device, and may be a homogeneous or heterogeneous computing carrier. It also does not limit the connection relationship between computing units in a computing carrier.
  • the computing unit executes parallel tasks, which can improve computing efficiency.
  • each computing carrier further includes at least one on-chip cache circuit and one off-chip cache circuit.
  • the first computing carrier 101 includes a first on-chip cache circuit 1011 and a first off-chip cache circuit 1012.
  • the second computing carrier 102 includes a second on-chip cache circuit 1021 and a second off-chip cache circuit 1022.
  • the N-th computing carrier 103 includes an N-th on-chip cache circuit 1031 and an N-th off-chip cache circuit 1032.
  • the on-chip cache circuit may include on-chip memory, including, but not limited to, double data rate memory (DDRM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and three times dynamic random access memory. Take three memory (ThreeDataDRAM), three times static random access memory (ThreeDataStaticRandom-AccessMemory, 3DSRAM) and other forms; the off-chip cache circuit can be off-chip memory (Off-chipMemory), including but not limited to SharedMemory, Cache and so on.
  • the cache may include a multilayer structure, such as an N-layer cache structure, including L1Cache, L2Cache, ..., LNCache.
  • the computing device 100 further includes an on-chip storage data path control circuit 110 connected to each on-chip cache circuit, and an on-chip storage data path 121 connected to the on-chip storage data path control circuit 110, wherein: on-chip The storage data path control circuit 110 is configured to receive a data transmission instruction sent by the first on-chip cache circuit 1011 of the first computing carrier 101 among the plurality of computing carriers; and decode the data transmission instruction to obtain the transmitted data. An address and a received data address; the on-chip buffer circuit data path 121 is configured to obtain target data according to the sent data address, and transmit the target data to the received data address.
  • the first computing carrier 101 is any one of a plurality of computing carriers, and the data transmission instruction is a binary file.
  • a data transmission instruction is decoded to obtain a sending data address and a receiving data address, and parameters such as a data capacity and a data identifier for determining target data can also be obtained.
  • the sending data address is an address where the target data is stored in the first on-chip cache circuit
  • the receiving data address is an address in the second on-chip cache circuit 1021 of the second computing carrier 102 of the plurality of computing carriers, that is, the
  • the data transfer instruction instructs the on-chip storage data path control unit 110 to transfer the target data buffered in the first on-chip cache circuit 1011 to the second on-chip cache circuit 1021, that is, it is determined that the computing carrier that the first computing carrier 101 performs data transmission in advance is the second Computing carrier 1021.
  • the on-chip storage data path control circuit 110 when the on-chip storage data path control circuit 110 receives a data transmission instruction sent by the first on-chip cache circuit 1011, it decodes the data transmission instruction to obtain a sending data address and a receiving data address.
  • the circuit data path 121 obtains the target data corresponding to the sending data address and transmits the target data to the receiving data address, and the second on-chip cache circuit 1021 caches the target data, thereby completing the on-chip cache circuit between the two computing carriers. data transmission.
  • the on-chip storage data path control circuit 110 multiple data transmission instructions can be received at the same time. Therefore, it is necessary to determine the transmission order between the data transmission instructions. This application does not limit how to determine the execution order.
  • the priorities corresponding to the data transmission instructions can be obtained to obtain multiple priorities. Each data in the multiple data transmission instructions is determined according to the multiple priorities.
  • the execution order of the transfer instruction is not limited to the priority corresponding to the data transmission instructions.
  • the priority can be obtained through multiple dimensions such as the quantity and capacity of the target data, the priority of the target data, or the priority of the first on-chip cache circuit and the remaining memory size.
  • the on-chip storage data path control circuit 110 determines the execution order between the data transmission instructions, and controls the on-chip cache circuit data path 121 to perform data transmission according to the execution order, which can improve the stability of the transmission.
  • the on-chip storage data path control circuit 110 includes an instruction cache unit 1101, an instruction decoding unit 1102 connected to the instruction cache unit 1101, and A memory management unit 1103 connected to the instruction cache unit 1101 and the instruction decoding unit 1102, where:
  • the instruction buffer unit 1101 is configured to buffer the data transmission instruction
  • the instruction decoding unit 1102 is configured to decode the data transmission instruction to obtain the sending data address and the receiving data address;
  • the memory management unit 1103 is configured to manage the data transmission instruction.
  • the on-chip storage data path control circuit 110 is further divided to obtain an instruction cache unit 1101, an instruction decoding unit 1102, and a memory management unit 1103, respectively, and execute corresponding steps, that is, the data management instruction is managed by the memory management unit 1103, that is, When the data transmission instruction is executed, it is directly called from the instruction buffer unit 1101, and the data decoding instruction is translated by the instruction decoding unit 1102 to complete the data transmission, so that the execution efficiency and the stability of execution are improved.
  • the memory management unit 1103 includes an address mapping module 11031, a request arbitration module 11032, and a consistency control module 11033, where:
  • the address mapping module 11031 is configured to determine the second on-chip cache circuit corresponding to the received data address
  • the request arbitration module 11032 is configured to allocate an execution order of each data transmission instruction in the plurality of data transmission instructions if the instruction cache unit includes a plurality of the data transmission instructions;
  • the consistency control module 11033 is configured to ensure consistency of data transmission.
  • the memory management unit 1103 is further divided to obtain the address mapping module 11031, the request arbitration module 11032, and the consistency control module 11033, respectively, and corresponding steps are performed, that is, the address mapping module 11031 is used to determine the target data to be cached.
  • the request arbitration module 11032 determines the execution order of each data transmission instruction, and controls the on-chip cache circuit data path 121 for data transmission according to the transmission order, which can improve the stability of the transmission.
  • the consistency control module 11033 ensures the consistency of data transmission, which improves the stability of the transmission and the security of execution.
  • the computing device 100 further includes a fast peripheral device interconnect bus (PCIE) data path 122 connected to each off-chip cache circuit, for implementing the described Data transmission between off-chip cache circuits of any two computing carriers in multiple computing carriers.
  • PCIE peripheral device interconnect bus
  • the off-chip storage data between the various computing carriers can be directly used for data interaction through the PCIE data path 122, that is, the off-chip cached data is exchanged through the dedicated off-chip storage data path 122 to support larger-scale machines Learning operations. It can also be connected to various types of servers through the PCIE interface, which improves transmission efficiency.
  • FIG. 2-2 is a schematic flowchart of a data transmission method proposed by this application.
  • the data transmission method is applied to a computing device shown in FIG. 2-1, that is, the computing device includes multiple computing carriers, and an on-chip storage data path control connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers.
  • S201 Receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier in a plurality of computing carriers through an on-chip storage data path control circuit.
  • S202 Decode the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address.
  • S203 Obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address.
  • the received data address is an address in a second on-chip cache circuit of a second computing carrier of the plurality of computing carriers.
  • the on-chip storage data path control circuit receives the data transmission instruction sent by the first on-chip cache circuit of the first computing carrier in the plurality of computing carriers, and then the on-chip storage data path control circuit sends the data transmission instruction.
  • Decode to obtain a sending data address and a receiving data address obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address. In this way, the data transmission efficiency between different computing carriers can be improved, and it is convenient to improve the operation efficiency of the neural network.
  • the on-chip storage data path control circuit includes an instruction cache unit, an instruction decoding unit connected to the instruction cache unit, and an instruction cache unit connected to the instruction cache unit and the instruction decoding unit.
  • the memory management unit which decodes the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, includes:
  • Decoding the data transmission instruction by the instruction decoding unit to obtain the sending data address and the receiving data address;
  • the method further includes:
  • the data transmission instruction is managed by the memory management unit.
  • the memory management unit includes an address mapping module, a request arbitration module, and a consistency control module, and the managing the data transmission instruction by the memory management unit includes:
  • the instruction buffer unit includes a plurality of the data transmission instructions, determining an execution order of each data transmission instruction in the plurality of data transmission instructions through the request arbitration module;
  • the consistency control module ensures data transmission consistency.
  • the computing device further includes a fast external device interconnect bus PCIE data path
  • the method further includes:
  • the multiple computing carriers include a central processing unit CPU, an image processor GPU, an application specific integrated circuit ASIC, a field programmable logic gate array FPGA, a coarse-grained reconfigurable array CGRA, or digital signal processing. At least two of the processor DSPs.
  • the calculation carrier includes at least one calculation unit.
  • the calculation unit includes: a main processing circuit, a branch processing circuit, and a basic processing circuit.
  • the main processing circuit is connected to the branch processing circuit.
  • the basic processing circuit is connected to the branch processing circuit, and the method further includes:
  • the sending the broadcast data to all the branch processing circuits in a broadcast manner by using the main processing circuit includes:
  • the operation performed by the basic processing circuit on the broadcast data and the distribution data to obtain an operation result includes:
  • the basic processing circuit performs an inner product operation, a product operation, or a vector operation on the broadcast data and the distribution data to obtain an operation result.
  • a computing device including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured by The processor executes the implementation manner described in the data transmission method.
  • a computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the processor executes the instructions. The implementation manner described in the data transmission method.
  • the present application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with other processing devices to jointly complete the operation specified by the user.
  • Figure 2-3 is a schematic structural diagram of a combined processing device.
  • Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of the machine learning computing device such as start and stop; other processing devices can also cooperate with the machine learning computing device to complete the computing tasks.
  • a universal interconnection interface for transmitting data and control instructions between the machine learning computing device and other processing devices.
  • the machine learning computing device obtains required input data from other processing devices and writes it to the storage device on the chip of the machine learning computing device; it can obtain control instructions from other processing devices and write it to the control cache on the machine learning computing device chip;
  • the data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
  • the combined processing device shown in FIG. 2-3 may further include a storage device, and the storage device is connected to the machine learning computing device and the other processing devices, respectively.
  • the storage device is configured to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the machine learning computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above-mentioned machine learning computing device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the chip package structure described above.
  • FIG. 2-4 provides a board card.
  • the board card may also include other supporting components.
  • the supporting components include, but are not limited to, a storage device, an interface device, and a control device. ;
  • the memory device is connected to a chip in the chip package structure through a bus, and is used to store data.
  • the memory device may include a plurality of groups of memory cells. Each group of the storage units is connected to the chip through a bus. It can be understood that each group of the storage units may be a double-rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM).
  • the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
  • each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
  • the interface device is electrically connected to a chip in the chip package structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function.
  • the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
  • the control device is electrically connected to the chip.
  • the control device is configured to monitor a state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit).
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
  • an electronic device which includes the board card described above.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, camcorders, projectors, watches, headphones , Mobile storage, wearables, transportation, home appliances, and / or medical devices.
  • the vehicles include airplanes, ships, and / or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software program modules.
  • the integrated unit When the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
  • the technical solution of the present application essentially or part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the foregoing memories include: U disks, Read-Only Memory (ROM), Random Access Memory (RAM), mobile hard disks, magnetic disks, or optical disks and other media that can store program codes.
  • the program may be stored in a computer-readable memory, and the memory may include a flash disk.
  • ROM Read-only memory
  • RAM Random Access Memory
  • magnetic disks or optical disks etc.

Abstract

A computing apparatus and a related product. The computing apparatus comprises: X groups of neural network chips, each group of neural network chips in the X groups of neural network chips comprise one master chip and at least one slave chip, the master chip is connected to the slave chip, the master chips in the X groups of neural network chips are connected, and the value range of X is integers greater than or equal to 2. The computing apparatus divides multiple groups of neural network chips into the master chips and the slave chips, and then performs data scheduling in the chips according to a connection relationship between the master chips, so as to improve a training speed of the neural network chips and save training duration.

Description

计算装置及相关产品Computing devices and related products 技术领域Technical field
本申请涉及信息处理技术领域,尤其涉及一种计算装置及相关产品。The present application relates to the field of information processing technology, and in particular, to a computing device and related products.
背景技术Background technique
人工神经网络(Artificial Neural Network,即ANN),是20世纪80年代以来人工智能领域兴起的研究热点。它从信息处理角度对人脑神经元网络进行抽象,建立某种简单模型,按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成。Artificial neural network (Artificial Neural Network, ANN) has been a research hotspot in the field of artificial intelligence since the 1980s. It abstracts the human brain neuron network from the perspective of information processing, establishes some simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to as neural network or neural network. A neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other.
现有的神经网络的运算基于CPU(Central Processing Unit,中央处理器)或GPU(英文:Graphics Processing Unit,图形处理器)来实现神经网络的运算,现有的训练设备训练速度慢,耗时久。Existing neural network operations are based on CPU (Central Processing Unit) or GPU (English: Graphics Processing Unit) to implement neural network operations. Existing training equipment has slow training speed and takes a long time .
发明内容Summary of the Invention
本申请实施例提供了一种计算装置及相关产品,可提升训练装置的训练速度,提高效率。The embodiments of the present application provide a computing device and related products, which can improve the training speed and efficiency of the training device.
第一方面,提供一种计算装置,所述计算装置包括:According to a first aspect, a computing device is provided. The computing device includes:
X组神经网络芯片,所述X组神经网络芯片中的每一组神经网络芯片中包括一个主芯片和至少一个从芯片,所述主芯片与所述从芯片连接,所述X组神经网络芯片中的主芯片之间连接,所述X的取值范围为大于或等于2的整数;A group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the group X neural network chip The connection between the main chips in the X, the value of X is an integer greater than or equal to 2;
所述X组神经网络芯片中的每一个神经网络芯片用于获取输入数据和权值,并将所述权值与所述每一个神经网络芯片对应的输入数据进行运算,获得运算结果,其中所述每一个神经网络芯片获取的所述输入数据不同,获取的所述权值相同;Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;
所述X组神经网络芯片中的第一组神经网络芯片中的第一主芯片,用于接收与所述第一主芯片连接的从芯片的运算结果;A first master chip in a first group of neural network chips in the X group of neural network chips, configured to receive an operation result of a slave chip connected to the first master chip;
所述第一主芯片用于将所述第一主芯片的运算结果和接收的所述从芯片的运算结果共享给其他组神经网路芯片中的主芯片,并接收其他组神经网络芯片中的主芯片共享的运算结果。The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.
第二方面,提供一种神经网络芯片,所述神经网络芯片包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和多个从处理电路;According to a second aspect, a neural network chip is provided. The neural network chip includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;
所述控制器单元,用于获取输入数据以及计算指令;The controller unit is configured to obtain input data and calculation instructions;
所述控制器单元,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;
所述主处理电路,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据和运算指令;The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
所述多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;The multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;
所述主处理电路,用于对所述多个中间结果执行后续处理得到所述计算指令的运算结果。The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
第三方面,提供一种组合计算装置,所述组合计算装置包括:M个如权利要求1所述的计算装置,所述M个如权利要求1所述的计算装置之间连接,所述M取值范围为大于或等于2的整数。According to a third aspect, a combined computing device is provided. The combined computing device includes: M computing devices according to claim 1, the M computing devices according to claim 1 being connected, the M The value is an integer greater than or equal to 2.
第四方面,提供一种执行机器学习模型的计算方法,所述计算方法应用于如第一方面所述的计算装置。According to a fourth aspect, a calculation method for executing a machine learning model is provided, and the calculation method is applied to the calculation device according to the first aspect.
第五方面,提供一种执行机器学习模型的计算方法,所述计算方法应用于如第三方面所述的组合计算装置。In a fifth aspect, a calculation method for executing a machine learning model is provided, and the calculation method is applied to the combination calculation device according to the third aspect.
第六方面,本申请实施例提供了一种计算装置,所述计算装置包括多个计算载体、与所述多个计算载体中每一计算载体的片上缓存电路连接的片上存储数据通路控制电路、以及与所述片上存储数据通路控制电路连接的片上存储数据通路,其中:According to a sixth aspect, an embodiment of the present application provides a computing device, where the computing device includes multiple computing carriers, an on-chip storage data path control circuit connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers, And an on-chip storage data path connected to the on-chip storage data path control circuit, wherein:
所述片上存储数据通路控制电路,用于接收所述多个计算载体中的第一计算载体的第一片上缓存电路发送的数据传输指令;对所述数据传输指令进行译码,以得到发送数据地址和接收数据地址;The on-chip storage data path control circuit is configured to receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers; and decode the data transmission instruction to obtain a transmission Data address and receiving data address;
所述片上缓存电路数据通路,用于根据所述发送数据地址获取目标数据,并将所述目标数据传输至所述接收数据地址,所述接收数据地址为所述多个计算载体中的第二计算载体的第二片上缓存电路中的一个地址。The on-chip cache circuit data path is configured to obtain target data according to the sending data address and transmit the target data to the receiving data address, where the receiving data address is the second of the plurality of computing carriers. Calculate an address in the carrier's second on-chip cache circuit.
第七方面,本申请实施例提供了一种组合处理装置,所述组合处理装置包括第一方面所述的计算装置,通用互联接口和其他处理装置;In a seventh aspect, an embodiment of the present application provides a combined processing device, where the combined processing device includes the computing device described in the first aspect, a universal interconnection interface, and other processing devices;
所述计算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The computing device interacts with the other processing devices to jointly complete a computing operation designated by the user.
第八方面,本申请实施例提供了一种系统级芯片,包括所述第一方面所述的计算装置或所述第二方面所述的组合处理装置。In an eighth aspect, an embodiment of the present application provides a system-on-chip including the computing device according to the first aspect or the combined processing device according to the second aspect.
第九方面,本申请实施例提供了一种数据传输方法,应用于如所述第一方面所述的计算装置,所述方法包括:In a ninth aspect, an embodiment of the present application provides a data transmission method, which is applied to a computing device according to the first aspect, and the method includes:
通过所述片上存储数据通路控制电路接收所述多个计算载体中的第一计算载体的第一片上缓存电路发送的数据传输指令;Receiving, via the on-chip storage data path control circuit, a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers;
通过所述片上存储数据通路控制电路对所述数据传输指令进行译码,以得到发送数据地址和接收数据地址,所述接收数据地址为所述多个计算载体中的第二计算载体的第二片上缓存电路中的一个地址;And decoding the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, where the receiving data address is a second of a second computing carrier of the plurality of computing carriers. An address in the on-chip cache circuit;
通过所述片上缓存电路数据通路根据所述发送数据地址获取目标数据,并将所述目标数据传输至所述接收数据地址。Acquiring target data according to the sending data address through the on-chip buffer circuit data path, and transmitting the target data to the receiving data address.
第十方面,本申请实施例提供了另一种计算装置,包括处理器、存储器、通信接口以及一个或多个程序,其中,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于如第四方面中所描述的部分或全部步骤的指令。In a tenth aspect, an embodiment of the present application provides another computing device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are stored in the memory. Configuration is performed by the processor, and the program includes instructions for some or all of the steps as described in the fourth aspect.
第十一方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述 处理器执行上述第四方面的方法。According to an eleventh aspect, an embodiment of the present application provides a computer-readable storage medium. The computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the computer program to The processor executes the method of the fourth aspect described above.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are some embodiments of the present application. Those of ordinary skill in the art can also obtain other drawings according to these drawings without paying creative labor.
图1-1a是本申请实施例提供的一种神经网络训练装置示意图。FIG. 1-1a is a schematic diagram of a neural network training device according to an embodiment of the present application.
图1-1b是本申请实施例提供的一种计算装置的芯片连接结构示意图。FIG. 1-1b is a schematic diagram of a chip connection structure of a computing device according to an embodiment of the present application.
图1-1c是本申请实施例提供的另一种计算装置的芯片连接结构示意图。FIG. 1-1c is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
图1-1d是本申请实施例提供的另一种计算装置的芯片连接结构示意图。FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
图1-1e是本申请实施例提供的一种神经网络芯片结构示意图。FIG. 1-1e is a schematic structural diagram of a neural network chip according to an embodiment of the present application.
图1-1f是本申请实施例提供的一种主芯片运算结果调度策略示意图。FIG. 1-1f is a schematic diagram of a scheduling strategy for a computing result of a main chip according to an embodiment of the present application.
图1-1g是本申请实施例提供的一种组合计算装置结构示意图。FIG. 1-1g is a schematic structural diagram of a combined computing device according to an embodiment of the present application.
图1-2是本申请实施例提供的一种组合处理装置的示意图。FIG. 1-2 is a schematic diagram of a combination processing device provided by an embodiment of the present application.
图1-3是本申请实施例提供的另一种组合处理装置的结构图。FIG. 1-3 is a structural diagram of another combination processing device provided by an embodiment of the present application.
图1-3a是本申请实施例提供的一种板卡的结构示意图。FIG. 1-3a is a schematic structural diagram of a board card according to an embodiment of the present application.
图2-1是本申请实施例提供的一种计算装置的结构示意图;FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application; FIG.
图2-1a是本申请实施例提供的一种计算单元的结构示意图;FIG. 2-1a is a schematic structural diagram of a computing unit according to an embodiment of the present application; FIG.
图2-1b是本申请实施例提供的一种主处理电路的结构示意图;FIG. 2-1b is a schematic structural diagram of a main processing circuit according to an embodiment of the present application; FIG.
图2-1c是本申请实施例提供的一种计算单元的数据分发示意图;FIG. 2-1c is a schematic diagram of data distribution of a computing unit according to an embodiment of the present application; FIG.
图2-1d是本申请实施例提供的一种计算单元的数据回传示意图;FIG. 2-1d is a schematic diagram of data return of a computing unit according to an embodiment of the present application; FIG.
图2-1e是本申请实施例提供的一种片上存储数据通路控制电路的结构示意图;FIG. 2-1e is a schematic structural diagram of an on-chip storage data path control circuit according to an embodiment of the present application;
图2-1f是本申请实施例提供的一种内存管理单元的结构示意图;FIG. 2-1f is a schematic structural diagram of a memory management unit according to an embodiment of the present application; FIG.
图2-2是本申请实施例提供的一种数据传输方法的流程示意图;2-2 is a schematic flowchart of a data transmission method according to an embodiment of the present application;
图2-3是本申请实施例提供的一种组合处理装置的的结构示意图;FIG. 2-3 is a schematic structural diagram of a combination processing device according to an embodiment of the present application;
图2-4是本申请实施例提供的一种板卡的结构示意图。FIG. 2-4 is a schematic structural diagram of a board card according to an embodiment of the present application.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in combination with the drawings in the embodiments of the present application. Obviously, the described embodiments are only These are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference to "an embodiment" herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
首先介绍本申请涉及的神经网络训练装置,如图1-1a所示,神经网络训练装置由多个 神经网络芯片组成,多个神经网络芯片执行多任务,或者将单任务进行切分,根据深度学习算法特性进行调度,协同完成训练任务。神经网络训练装置中的多个神经网络芯片的排列与协同方式在如下实施例中具体介绍。First introduce the neural network training device involved in this application. As shown in Figure 1-1a, the neural network training device consists of multiple neural network chips, multiple neural network chips perform multiple tasks, or divide a single task into segments based on depth. It learns the characteristics of algorithm to schedule and cooperate to complete the training task. The arrangement and cooperation of multiple neural network chips in the neural network training device are specifically described in the following embodiments.
本申请实施例涉及的一种训练装置,包括:X组神经网络芯片,X组神经网络芯片中的每一组神经网络芯片中包括一个主芯片和至少一个从芯片,主芯片与从芯片连接,X组神经网络芯片中的主芯片之间连接,X的取值范围为大于或等于2的整数;A training device according to an embodiment of the present application includes: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master chip is connected to the slave chip, The connection between the main chips in the X group of neural network chips. The value of X is an integer greater than or equal to 2.
X组神经网络芯片中的每一个神经网络芯片用于获取输入数据和权值,并将权值与每一个神经网络芯片对应的输入数据进行运算,获得运算结果,其中每一个神经网络芯片获取的输入数据不同,获取的权值相同;X组神经网络芯片中的第一组神经网络芯片中的第一主芯片,用于接收与第一主芯片连接的从芯片的运算结果;第一主芯片用于将第一主芯片的运算结果和接收的从芯片的运算结果共享给其他组神经网路芯片中的主芯片,并接收其他组神经网络芯片中的主芯片共享的运算结果。Each neural network chip in the X group of neural network chips is used to obtain input data and weights, and the weights are calculated with the input data corresponding to each neural network chip to obtain the operation result. The input data is different and the obtained weights are the same; the first master chip in the first group of neural network chips in the X group of neural network chips is used to receive the operation result of the slave chip connected to the first master chip; the first master chip It is used to share the operation result of the first master chip and the received operation result of the slave chip with the master chips in other groups of neural network chips, and receive the operation results shared by the master chips in other groups of neural network chips.
具体地,X可以为2,3,5,8等任意大于或等于2的整数,X组神经网络芯片中,每组神经网络芯片包括一个主芯片和至少一个从芯片,其中,不同组神经网络芯片中的从芯片数量可以相同,也可以不同,例如X为3时,总共包含10个从芯片,那么前两组神经网络芯片中主芯片可连接3个从芯片,最后一组神经网络芯片中的主芯片连接4个从芯片。优选地,将从芯片进行等分量地划分并和主芯片进行连接,以便于主芯片接收从芯片的运算结果,并快速将运算结果在主芯片之间进行调度。Specifically, X can be any integer greater than or equal to 2 such as 2, 3, 5, 8 and so on. In the group X of neural network chips, each group of neural network chips includes a master chip and at least one slave chip, wherein different groups of neural networks The number of slave chips in the chip can be the same or different. For example, when X is 3, a total of 10 slave chips are included, then the master chip in the first two sets of neural network chips can be connected with 3 slave chips, and the last set of neural network chips The master chip is connected to 4 slave chips. Preferably, the slave chips are equally divided and connected to the master chip, so that the master chip receives the operation results of the slave chips and quickly schedules the operation results between the master chips.
请参阅图1-1b,图1-1b为本申请实施例提供的一种计算装置的芯片连接结构,如图1-1b所示,X为4,其中芯片4,芯片8,芯片13和芯片10为主芯片,每个主芯片上连接了3个从芯片。芯片1~芯片16都获取输入数据和权值,其中每个芯片获取的输入数据不同,而获取的权值是相同的,这样每个芯片都将采用相同的训练模型对不同的输入数据进行训练。每个芯片的输入数据可以针对多个任务对应的数据,也可以针对同一个任务进行的数据集切分,数据集的切分可以在其他外部设备中完成,也可以在计算装置中的其他模块中完成,还可以在计算装置中的某一组神经网络芯片中的主芯片中完成。Please refer to FIG. 1-1b. FIG. 1-1b is a chip connection structure of a computing device according to an embodiment of the present application. As shown in FIG. 1-1b, X is 4, among which chip 4, chip 8, chip 13, and chip. 10 is the master chip, and 3 slave chips are connected to each master chip. Chips 1 to 16 all get input data and weights, where each chip gets different input data and the weights are the same, so each chip will use the same training model to train different input data . The input data of each chip can be for data corresponding to multiple tasks, or for data sets segmented for the same task. The segmentation of the data set can be completed in other external devices, or in other modules in the computing device. It can be completed in the main chip of a certain group of neural network chips in the computing device.
由于计算装置中的每一个芯片的输入数据不同,权值相同,那么获得的运算结果不同。当所有的芯片完成训练获得运算结果后,第一主芯片用于接收与第一主芯片连接的从芯片的运算结果,第一主芯片可以是主芯片4,主芯片8,主芯片10和主芯片13中的任意一个主芯片,分别获取与自身连接的从芯片的运算结果,最终主芯片中包括的所有运算结果为自身的运算结果以及与其连接的从芯片的运算结果。Since the input data of each chip in the computing device is different and the weights are the same, the obtained operation results are different. After all the chips have completed training to obtain the operation results, the first master chip is used to receive the operation results of the slave chips connected to the first master chip. The first master chip may be the master chip 4, the master chip 8, the master chip 10, and the master chip. Any one of the master chips in chip 13 respectively obtains the operation results of the slave chips connected to itself, and finally all operation results included in the master chip are its own operation results and the operation results of the slave chips connected to it.
当第一主芯片获得从芯片的运算结果后,再将自身包括的所有运算结果在X组主芯片之间进行共享,共享时按照同一个方向将运算结果进行循环传递,例如按照顺时针方向传递,即:芯片4→芯片8→芯片13→芯片10→芯片4,或者按照逆时针方向传递,即:芯片4→芯片10→芯片13→芯片8→芯片4。共享时可以一次性将主芯片包括的所有运算结果全部传递给下一个相邻主芯片,也可以分为多次逐步进行传递。After the first master chip obtains the operation results of the slave chip, all the operation results included in it are shared among the X group of master chips. When sharing, the operation results are transmitted cyclically in the same direction, for example, in a clockwise direction. That is: chip 4 → chip 8 → chip 13 → chip 10 → chip 4 or pass in a counterclockwise direction, that is: chip 4 → chip 10 → chip 13 → chip 8 → chip 4. When sharing, all the operation results included in the main chip can be transferred to the next adjacent main chip at one time, or it can be transferred in multiple steps.
可见,这种连接结构一方面可以通过多个芯片协同运算提升数据训练效率,另一方面,可以通过主芯片对各个从芯片的运算结果进行调度,从而只需要提升主芯片的性能而不需 要提升从芯片的性能,节约了成本。It can be seen that this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.
可选的,第一主芯片还用于:将第一主芯片中的所有运算结果传递给与第一主芯片连接的从芯片。Optionally, the first master chip is further configured to: transmit all operation results in the first master chip to a slave chip connected to the first master chip.
主芯片4,主芯片8,主芯片10和主芯片13在经过共享传递后,都获得了所有芯片的运算结果,然后各个主芯片将其包含的运算结果传递给各自连接的从芯片,使得每个从芯片都包含所有芯片的运算结果。After the main chip 4, the main chip 8, the main chip 10, and the main chip 13 have shared the transfer, they obtain the calculation results of all the chips, and then each main chip passes the calculation results it contains to the respective connected slave chips, so that each Each slave chip contains the operation results of all chips.
可选的,主芯片通过树型结构与从芯片连接,树型结构为n叉树结构,主芯片为n叉树结构的根节点,从芯片为n叉树结构的子节点,子节点可以是一级子节点,也可以是多级子节点。Optionally, the master chip is connected to the slave chip through a tree structure, the tree structure is an n-tree structure, the master chip is a root node of the n-tree structure, and the slave chip is a child node of the n-tree structure. The child node may be One-level child nodes can also be multi-level child nodes.
具体地,X组神经网络芯片中的主芯片可以通过树型结构与从芯片连接,其中主芯片为树型结构的根节点,从芯片为子节点,子节点可以是一级子节点,也可以是多级子节点。主芯片获取从芯片的运算结果时,可以直接获取每一个从芯片的运算结果,也可以由主芯片直接连接的从芯片获取其他从芯片的运算结果,然后传递给主芯片。Specifically, the master chip in the group X neural network chip can be connected to the slave chip through a tree structure, where the master chip is the root node of the tree structure, the slave chip is a child node, and the child node can be a first-level child node or Are multi-level child nodes. When the master chip obtains the operation results of the slave chips, the operation results of each slave chip can be directly obtained, or the operation results of other slave chips can be obtained by the slave chip directly connected to the master chip, and then passed to the master chip.
可见,这种连接结构一方面可以通过多个芯片协同运算提升数据训练效率,另一方面,可以通过主芯片对各个从芯片的运算结果进行调度,从而只需要提升主芯片的性能而不需要提升从芯片的性能,节约了成本。而从芯片通过树型结构与主芯片进行连接,可以在将运算结果发送给主芯片之前先对从芯片的运算结果进行整合,减小主芯片的运算压力,进而降低对主芯片的损耗。It can be seen that this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved. The slave chip is connected to the master chip through a tree structure, and the operation results of the slave chip can be integrated before the operation result is sent to the master chip, which reduces the operation pressure of the master chip and further reduces the loss to the master chip.
请参阅图1-1c,图1-1c为本申请实施例提供的另一种计算装置的芯片连接结构,如图1-1c所示,X为4,在4组神经网络芯片中,包含的主芯片为主芯片31,主芯片32,主芯片33和主芯片34,每一个主芯片都通过树型结构与从芯片连接,例如主芯片31为根节点,与其连接的从芯片包括芯片311,芯片312和芯片313,为一级子节点,与从芯片311连接的从芯片包括芯片3111,芯片3112和芯片3113,为二级子节点。其他从芯片也都为一级子节点或二级子节点。Please refer to FIG. 1-1c. FIG. 1-1c is another chip connection structure of a computing device provided by an embodiment of the present application. As shown in FIG. 1-1c, X is 4, and among the 4 groups of neural network chips, The master chip is the master chip 31, the master chip 32, the master chip 33, and the master chip 34. Each master chip is connected to the slave chip through a tree structure. For example, the master chip 31 is the root node, and the slave chips connected to it include the chip 311. The chip 312 and the chip 313 are first-level child nodes, and the slave chips connected to the slave chip 311 include a chip 3111, a chip 3112, and a chip 3113, which are second-level child nodes. The other slave chips are also primary child nodes or secondary child nodes.
或者,请参阅图1-1d,图1-1d为本申请实施例提供的另一种计算装置的芯片连接结构示意图,如图1-1d所示,X为1,主芯片35通过树型结构与从芯片连接,且树型结构包括三级子节点,最下层的叶子节点的运算结果,可以直接传递给主芯片,也可以通过上一层子节点的从芯片整合后传递给主芯片。Alternatively, please refer to FIG. 1-1d. FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application. As shown in FIG. It is connected to the slave chip, and the tree structure includes three levels of sub-nodes, and the operation results of the leaf nodes at the lowest level can be directly transferred to the master chip, or can be transferred to the master chip through the integration of the slave chip of the upper-level sub-node.
本申请实施例中涉及的神经网络计算装置包括神经网络芯片,请参阅图1-1e,图1-1e为本申请实施例提供的一种神经网络芯片结构示意图,如图1-1e所示,神经网络芯片包括:运算单元12以及控制器单元11;运算单元12包括:一个主处理电路101和多个从处理电路102;The neural network computing device involved in the embodiment of the present application includes a neural network chip. Please refer to FIG. 1-1e. FIG. 1-1e is a schematic structural diagram of a neural network chip provided by an embodiment of the present application, as shown in FIG. 1-1e. The neural network chip includes: an arithmetic unit 12 and a controller unit 11; the arithmetic unit 12 includes: a master processing circuit 101 and a plurality of slave processing circuits 102;
控制器单元11,用于获取输入数据以及计算指令;在一种可选方案中,具体的,获取输入数据以及计算指令方式可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚。The controller unit 11 is configured to obtain input data and calculation instructions. In an optional solution, specifically, the method of obtaining input data and calculation instructions may be obtained through a data input and output unit. The data input and output unit may be one or Multiple data I / O interfaces or I / O pins.
上述计算指令包括但不限于:正向运算指令或反向训练指令,或其他神经网络运算指令等等,例如卷积运算指令,本申请具体实施方式并不限制上述计算指令的具体表现形式。The above calculation instructions include, but are not limited to, forward operation instructions or backward training instructions, or other neural network operation instructions, such as convolution operation instructions. The specific implementation manner of this application does not limit the specific expressions of the above calculation instructions.
控制器单元11,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及输入数据发送给主处理电路;主处理电路101,用于对输入数据执行前序处理以及与多个从处理电路之间传输数据和运算指令;多个从处理电路102,用于依据从主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主处理电路;主处理电路101,用于对多个中间结果执行后续处理得到计算指令的运算结果。The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and input data to a main processing circuit; the main processing circuit 101 is configured to perform preprocessing on the input data and perform multiple operations Data and operation instructions are transmitted between two slave processing circuits; multiple slave processing circuits 102 are used to perform intermediate operations in parallel according to the data transmitted from the main processing circuit and the operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to A main processing circuit; a main processing circuit 101, configured to perform subsequent processing on a plurality of intermediate results to obtain an operation result of a calculation instruction.
本申请提供的技术方案将运算单元设置成一主多从结构,对于正向运算的计算指令,运算单元可以依据正向运算的计算指令将数据进行拆分,这样通过多个从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗。The technical solution provided in this application sets the operation unit into a master-slave structure. For forward operation calculation instructions, the operation unit can split the data according to the forward operation calculation instructions, so that multiple slave processing circuits can The part with a large amount of calculation is performed in parallel, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
可选的,上述神经网络芯片具体用于人工神经网络运算,上述输入数据具体可以包括:输入神经元数据和权值数据。上述运算结果具体可以为:人工神经网络运算的结果即输出神经元数据。Optionally, the aforementioned neural network chip is specifically used for an artificial neural network operation, and the aforementioned input data may specifically include input neuron data and weight data. The above operation result may be specifically: the result of the operation of the artificial neural network is the output neuron data.
神经网络中的运算可以为神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层人工神经网络执行完成之后,下一层的运算指令会将运算单元计算出的输出神经元作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值;在反向运算中,当上一层人工神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。The operation in the neural network can be a layer of the neural network. For a multilayer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instructions in the next layer are completed. The output neuron calculated by the arithmetic unit will be used as the input neuron of the next layer (or perform some operations on the output neuron and then be used as the input neuron of the next layer), and the weight will also be replaced. Is the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the operation instructions of the next layer will use the input neuron gradient calculated in the operation unit as the next layer The output neuron gradient is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weight is replaced with the weight of the next layer.
对于人工神经网络运算,如果该人工神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。For an artificial neural network operation, if the artificial neural network operation has a multi-layer operation, the input neuron and output neuron of the multi-layer operation do not refer to the neurons in the input layer and the output layer of the entire neural network, but to the For any two adjacent layers in the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron. Taking a convolutional neural network as an example, suppose a convolutional neural network has L layers, K = 1, 2, ..., L-1, and for the Kth layer and the K + 1th layer, we will use the It is called the input layer, where the neuron is the input neuron, the K + 1th layer is called the output layer, and the neuron is the output neuron. That is, except for the top layer, each layer can be used as the input layer, and the next layer is the corresponding output layer.
可选的,上述神经网络芯片还可以包括:存储单元10和直接内存访问单元50,存储单元10可以包括:寄存器201、缓存202中的一个或任意组合,具体的,所述缓存,用于存储所述计算指令;所述寄存器,用于存储所述输入数据和标量;所述缓存为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。Optionally, the aforementioned neural network chip may further include a storage unit 10 and a direct memory access unit 50. The storage unit 10 may include one or any combination of a register 201 and a cache 202. Specifically, the cache is used for storing The calculation instruction; the register is used to store the input data and a scalar; and the cache is a high-speed temporary cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.
可选的,该控制器单元包括:指令存储单元110、指令处理单元111和存储队列单元113;Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
指令存储单元110,用于存储所述人工神经网络运算关联的计算指令;An instruction storage unit 110, configured to store calculation instructions associated with the artificial neural network operation;
所述指令处理单元111,用于对所述计算指令解析得到多个运算指令;The instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions;
存储队列单元113,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。The storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or calculation instructions to be executed according to a sequence of the queue.
举例说明,在一个可选的技术方案中,主运算处理电路也可以包括一个控制器单元,该控制器单元可以包括主指令处理单元,具体用于将指令译码成微指令。当然在另一种可 选方案中,从运算处理电路也可以包括另一个控制器单元,该另一个控制器单元包括从指令处理单元,具体用于接收并处理微指令。上述微指令可以为指令的下一级指令,该微指令可以通过对指令的拆分或解码后获得,能被进一步解码为各部件、各单元或各处理电路的控制信号。For example, in an optional technical solution, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions. Of course, in another alternative, the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions. The above micro-instruction may be an instruction next to the instruction. The micro-instruction may be obtained by splitting or decoding the instruction, and may be further decoded into a control signal of each component, each unit, or each processing circuit.
可选的,该控制器单元11还可以包括:Optionally, the controller unit 11 may further include:
所述依赖关系处理单元112,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;The dependency relationship processing unit 112 is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is related, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit. Transmitted to the arithmetic unit;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:
依据所述第一运算指令提取所述第一运算指令中所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。Extracting a first storage address range of required data (such as a matrix) in the first operation instruction according to the first operation instruction, and extracting a zeroth position of the required matrix in the zeroth operation instruction according to the zeroth operation instruction Storage address range, if the first storage address range and the zeroth storage address range have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage If there is no overlapping area between the address interval and the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction have no correlation.
在另一个可选的实施例中,当神经网络芯片为主芯片时,控制器单元11还包括调度单元114,用于对主芯片中的运算结果进行调度。In another optional embodiment, when the neural network chip is the main chip, the controller unit 11 further includes a scheduling unit 114 for scheduling the operation results in the main chip.
具体地,各组神经网络芯片中的主芯片之间需要对运算结果进行调度,使得所有的主芯片共享各个主芯片中包括的所有运算结果。在进行调度时,需要遵循一定的调度策略。可以先将X组神经网络芯片中的主神经网络芯片中的运算结果进行整合,包括主神经网络芯片自身的运算结果和接收到的从芯片的运算结果,获得X个整合运算结果,然后将X个整合运算结果按照主芯片连接顺序进行同一个方向的调度,每一次调度1个整合运算结果,进行X 2次调度后,所有主芯片都获得了X个整合运算结果;或者在获得X个整合运算结果后,然后将X个整合运算结果按照主芯片连接顺序进行同一个方向的调度,下一个主芯片接收到上一个主芯片传递的运算结果后,将接收到的运算结果与本身的运算结果进行整合,形成新的运算结果,然后传递给再下一个主芯片,经过2*(X-1)次调度后,所有主芯片都获得了X个整合运算结果;也可以将X个主芯片中运算结果进行部分整合或者不进行整合,然后在主芯片间进行多次的部分调度。 Specifically, the main chip in each group of neural network chips needs to schedule operation results, so that all the main chips share all the operation results included in each main chip. When scheduling, you need to follow a certain scheduling strategy. First, the operation results of the master neural network chip in the X group of neural network chips can be integrated, including the operation results of the master neural network chip itself and the received operation results of the slave chip to obtain X integrated operation results, and then X Each integration operation result is scheduled in the same direction according to the connection order of the main chip. Each integration operation result is dispatched once, and after X 2 dispatches, all the main chips obtain X integration operation results; or after obtaining X integration After the operation results, then the X integrated operation results are scheduled in the same direction according to the connection order of the main chip. After the next main chip receives the operation results transmitted by the previous main chip, it will compare the received operation results with its own operation results. Integrate to form a new calculation result, and then pass it to the next main chip. After 2 * (X-1) scheduling, all the main chips have obtained X integrated operation results; the X main chips can also be The operation results are partially integrated or not integrated, and then multiple partial scheduling is performed between the main chips.
在一个可选的实施例中,对主芯片中的运算结果进行调度,包括:将X组神经网络芯片中的主芯片向同一方向连接的主芯片调度1/Y+1的运算内容,其中,同一方向包括顺时针方向或逆时针方向,Y为与X组神经网络芯片中主芯片连接的从芯片的数量。In an optional embodiment, scheduling the operation results in the main chip includes: scheduling the main chip in the X group of neural network chips connected to the same direction to schedule the operation content of 1 / Y + 1, where: The same direction includes a clockwise direction or a counterclockwise direction, and Y is the number of slave chips connected to the master chip in the X-group neural network chip.
请参阅图1-1f,图1-1f为本申请实施例提供的一种主芯片之间的运算结果调度策略,如图1-1f所示,与图1-1b相对应,有4组神经网络芯片,其中的主芯片为芯片4,芯片8,芯片13和芯片10,主芯片4中的运算结果包括其本身的运算结果,以及接收到的芯片1,芯片2和芯片3的运算结果,将这4个运算结果对应为a1,b1,c1,d1四个部分,相应地,芯片8的运算结果对应为a2,b2,c2,d2四个部分,芯片13的运算结果对应为a3,b3,c3,d3四个 部分,芯片10的运算结果对应为a4,b4,c4,d4四个部分。按照顺时针方向调度,第一次调度时,芯片4向芯片8调度a1部分,芯片8向芯片13调度b2部分,芯片13向芯片10调度c3部分,芯片13向芯片4调度d4部分,这个调度过程可以在同一时刻进行,也可以在不同时刻进行。每次调度每个主芯片1/Y+1部分的运算内容,经过(X-1) 2次调度后,所有主芯片获得所有运算结果,完成调度。这种调度方式可以节省各个芯片的整合时间,提升调度效率。 Please refer to FIG. 1-1f. FIG. 1-1f is an operation result scheduling strategy between main chips provided in the embodiment of the present application. As shown in FIG. 1-1f, corresponding to FIG. 1-1b, there are 4 groups of nerves. Network chip, the main chip of which is chip 4, chip 8, chip 13 and chip 10, the operation result in main chip 4 includes its own operation result, and the operation results of chip 1, chip 2 and chip 3 received, Correspond to these four calculation results as a1, b1, c1, d1. Correspondingly, the operation result of chip 8 corresponds to four parts a2, b2, c2, and d2. The operation result of chip 13 corresponds to a3, b3. , c3, d3. The operation result of chip 10 corresponds to four parts: a4, b4, c4, and d4. Scheduling is clockwise. For the first scheduling, chip 4 dispatches part a1 to chip 8, chip 8 dispatches part b2 to chip 13, chip 13 dispatches part c3 to chip 10, and chip 13 dispatches part d4 to chip 4. This schedule The process can be performed at the same time or at different times. Each time the operation content of the 1 / Y + 1 part of each main chip is scheduled, after (X-1) 2 times of scheduling, all the main chips obtain all the operation results and complete the scheduling. This scheduling method can save the integration time of each chip and improve scheduling efficiency.
在一个可选的实施例中,所述主处理电路101,具体用于将多个从处理电路102发送的中间结果进行组合排序得到该计算指令的结果;In an optional embodiment, the main processing circuit 101 is specifically configured to combine and sort multiple intermediate results sent from the processing circuit 102 to obtain a result of the calculation instruction;
或所述主处理电路101,具体用于将多个从处理电路102的发送的中间结果进行组合排序以及激活处理后得到该计算指令的结果。Or the main processing circuit 101 is specifically configured to combine and sort the intermediate results sent by the multiple slave processing circuits 102 and obtain the result of the calculation instruction after the activation processing.
在一个可选的实施例中,主处理电路包括:转换处理电路、激活处理电路、加法处理电路中的一种或任意组合;In an optional embodiment, the main processing circuit includes one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
所述转换处理电路,用于对所述数据执行前序处理,具体为:将主处理电路接收的数据或中间结果执行第一数据结构与第二数据结构之间的互换;或将主处理电路接收的数据或中间结果执行第一数据类型与第二数据类型之间的互换;The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;
所述激活处理电路,用于执行所述后续处理,具体为执行主处理电路内数据的激活运算;The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;
所述加法处理电路,用于执行所述后续处理,具体为执行加法运算或累加运算。The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
所述从处理电路包括:乘法处理电路;The slave processing circuit includes: a multiplication processing circuit;
所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果。The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
可选的,所述从处理电路还包括:累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。Optionally, the slave processing circuit further includes: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
本申请实施例还涉及另一种组合计算装置,所述组合计算装置包括:M个如实施例一所述的计算装置,所述M个如实施例一所述的计算装置之间连接,所述M取值范围为大于或等于2的整数。The embodiment of the present application also relates to another combined computing device, where the combined computing device includes: M computing devices according to the first embodiment, and the M computing devices according to the first embodiment are connected to each other. The value range of M is an integer greater than or equal to two.
请参阅图1-1g,图1-1g为本申请实施例提供的一种组合计算装置结构示意图,如图1-1g所示,该组合计算装置由4个如图1-1b所示的计算装置组合而成,这4个计算装置之间互相连接,可以通过电路进行桥接,也可以通过设置专门的连接模块进行连接,还可以通过4个计算装置中的主芯片进行连接。这种连接结构一方面可以通过多个芯片协同运算提升数据训练效率,另一方面,可以通过主芯片对各个从芯片的运算结果进行调度,从而只需要提升主芯片的性能而不需要提升从芯片的性能,节约了成本。而从多组主芯片中选择一个主芯片与外部主芯片进行连接,降低了主芯片的损耗,提升了主芯片的使用寿命。Please refer to FIG. 1-1g. FIG. 1-1g is a schematic structural diagram of a combination computing device provided by an embodiment of the present application. As shown in FIG. 1-1g, the combination computing device includes four calculations as shown in FIG. 1-1b. The four computing devices are connected to each other, and can be bridged through circuits, can be connected by setting a special connection module, and can also be connected by the main chip in the four computing devices. This connection structure can improve data training efficiency through multiple chips' cooperative operation on the one hand, and can schedule the operation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve the slave chip. Performance and cost savings. And selecting a main chip from multiple sets of main chips to connect with an external main chip reduces the loss of the main chip and improves the service life of the main chip.
在一个可选的实施例中,M个如实施例一的计算装置之间连接,包括:M个如实施例一的计算装置中的每一个计算装置,其包含的X组神经网络芯片中的一组神经网络芯片的主芯片用于与其他计算装置中的X组神经网络中的一组神经网络芯片的主芯片连接。In an optional embodiment, the connections between the M computing devices as in the first embodiment include: each of the M computing devices as in the first embodiment, which includes X groups of neural network chips The main chip of a group of neural network chips is used to connect with the main chip of a group of neural network chips in a group X neural network in another computing device.
如图1-1g所示,4个如实施例一的计算装置中的每一个计算装置,都包含4组神经网络芯片,其中一组神经网络芯片的主芯片用于与其他计算装置中的4组神经网络芯片中的 一组神经网络芯片的主芯片进行连接,例如主芯片502,主芯片507,主芯片512和主芯片510进行连接。在选择X组神经网络芯片中的其中一组神经网络芯片中的主芯片时,可以随机挑选,也可以采用挑选策略进行挑选,例如选择连接从芯片最多的主芯片,或者选择与其他计算装置中的4组神经网络芯片中物理距离最近的主芯片等。As shown in Figure 1-1g, each of the four computing devices as in the first embodiment includes four sets of neural network chips, and the main chip of one set of neural network chips is used to communicate with four of the other computing devices. The main chips of a group of neural network chips in the group of neural network chips are connected, for example, the main chip 502, the main chip 507, the main chip 512, and the main chip 510 are connected. When selecting the master chip in one of the X group of neural network chips, it can be randomly selected or selected using a selection strategy, such as selecting the master chip with the most slave chips, or selecting it with other computing devices. Among the 4 groups of neural network chips, the closest physical chip is the closest.
可见,在本申请实施例中,将多组神经网络芯片各自分为主芯片和从芯片,然后主芯片获取从芯片的运算结果,并将计算结果在不同组的主芯片之间调度,使得每组主芯片都包含所有的运算结果,再由主芯片将所有的运算结果分发给从芯片,提升了神经网络芯片的训练速度,节省了训练时间。It can be seen that, in the embodiment of the present application, multiple groups of neural network chips are divided into a master chip and a slave chip, and then the master chip obtains the operation results of the slave chips, and schedules the calculation results between different sets of master chips, so that each The master chip of the group contains all the calculation results, and the master chip distributes all the calculation results to the slave chips, which improves the training speed of the neural network chip and saves training time.
本申请还揭露了一个组合处理装置,其包括上述的计算装置,通用互联接口,和其他处理装置。计算装置与其他处理装置进行交互,共同完成用户指定的操作。图1-2为组合处理装置的示意图。This application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices. The computing device interacts with other processing devices to complete a user-specified operation. Figure 1-2 is a schematic diagram of a combined processing device.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为计算装置与外部数据和控制的接口,包括数据搬运,完成对本计算装置的开启、停止等基本控制;其他处理装置也可以和计算装置协作共同完成运算任务。Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the computing device and external data and control, including data handling, to complete basic control of the computing device such as start and stop; other processing devices can also cooperate with the computing device to complete computing tasks.
通用互联接口,用于在所述计算装置与其他处理装置间传输数据和控制指令。该计算装置从其他处理装置中获取所需的输入数据,写入计算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入计算装置片上的控制缓存;也可以读取计算装置的存储模块中的数据并传输给其他处理装置。A universal interconnection interface for transmitting data and control instructions between the computing device and other processing devices. The computing device obtains required input data from other processing devices and writes it to a storage device on the computing device chip; it can obtain control instructions from other processing devices and write it to the control cache on the computing device chip; it can also read the computing device's The data in the module is stored and transmitted to other processing devices.
可选的,该结构如图1-3所示,还可以包括存储装置,存储装置分别与所述计算装置和所述其他处理装置连接。存储装置用于保存在所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。Optionally, the structure is shown in FIG. 1-3, and may further include a storage device, and the storage device is connected to the computing device and the other processing devices, respectively. The storage device is used to store data in the computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption. In this case, the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
在一些实施例里,还申请了一种芯片,其包括了上述计算装置或组合处理装置。In some embodiments, a chip is also applied, which includes the above computing device or combined processing device.
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。In some embodiments, a chip packaging structure is applied, which includes the above chip.
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图1-3a,图1-3a提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;In some embodiments, a board card is applied, which includes the chip package structure described above. Referring to FIG. 1-3a, FIG. 1-3a provides a board card. In addition to the above chip 389, the board card may also include other supporting components, which include, but are not limited to, a storage device 390 and an interface device 391 And control device 392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The memory device 390 is connected to a chip in the chip package structure through a bus, and is used to store data. The memory device may include a plurality of sets of memory cells 393. Each group of the storage units is connected to the chip through a bus. It can be understood that the memory cells in each group may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。In one embodiment, each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的运算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected to a chip in the chip package structure. The interface device is used to implement data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer. Preferably, when the PCIE 3.0 X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB / s. In another embodiment, the interface device may also be other interfaces. The present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function. In addition, the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。The control device is electrically connected to the chip. The control device is configured to monitor a state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit). For example, the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
在信息处理技术领域,进行数据传输方面,神经网络是目前许多人工智能应用的基础,随着神经网络的应用范围的进一步扩大,出现了众多神经网络模型和大批量的请求。在现有技术中,神经网络的计算可采用异构的计算载体进行并行计算,因此,如何提高异构的计算装置之间的数据传输效率是本领域技术人员待解决的技术问题。In the field of information processing technology, in terms of data transmission, neural networks are the basis of many current artificial intelligence applications. With the further expansion of the application scope of neural networks, many neural network models and large batches of requests have appeared. In the prior art, the calculation of the neural network can be performed in parallel using a heterogeneous computing carrier. Therefore, how to improve the data transmission efficiency between heterogeneous computing devices is a technical problem to be solved by those skilled in the art.
为了解决上述的问题,我们提出了如下的方案。In order to solve the above problems, we propose the following scheme.
在本申请中,计算装置可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(user equipment,UE),移动台(mobile station,MS),终端设备(terminal device)等,计算装置也可以包括系统级芯片(System on Chip,SOC)。In this application, the computing device may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, and various forms of user equipment (user equipment, UE ), A mobile station (MS), a terminal device (terminal), etc., the computing device may also include a system-on-chip (SOC).
在本申请中,计算载体可以是中央处理器(Central Processing Unit,CPU)、图像处理器(graphics processing unit,GPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程逻辑门阵列(Field-Programmable Gate Array,FPGA)、粗粒度可重构阵列(Coarse-Grained Re-configurable Array,CGRA)、数字信号处理(Digital Signal Processing,DSP)器等。In this application, the computing carrier may be a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field programmable logic gate array (ASIC) Field-Programmable Gate Array (FPGA), Coarse-Grained Re-configurable Array (CGRA), Digital Signal Processing (DSP), etc.
本申请实施例提出了一种数据传输方法及相关产品,可提高不同计算载体之间的数据 传输效率,便于提高神经网络运算效率。以下结合具体实施例,并参照附图,对本申请进一步详细说明。The embodiments of the present application provide a data transmission method and related products, which can improve the data transmission efficiency between different computing carriers and facilitate the improvement of the neural network operation efficiency. The present application will be further described in detail below with reference to specific embodiments and with reference to the drawings.
请参照图2-1,图2-1是本申请实施例提供的一种计算装置的结构示意图。如图2-1所示,上述计算装置100包括第一计算载体101、第二计算载体102、第N计算载体103等多个计算载体。其中,N为大于2的正整数,该多个计算载体可包括上述的CPU、GPU、ASIC、FPGA、CGRA或DSP中的至少两种,也可以包括上述的两种同类型的载体,例如,2个CPU、2个GPU、1个ASIC或1个FPGA。Please refer to FIG. 2-1, which is a schematic structural diagram of a computing device according to an embodiment of the present application. As shown in FIG. 2-1, the computing device 100 includes a plurality of computing carriers such as a first computing carrier 101, a second computing carrier 102, and an N-th computing carrier 103. Wherein N is a positive integer greater than 2, the multiple computing carriers may include at least two of the above-mentioned CPUs, GPUs, ASICs, FPGAs, CGRAs, or DSPs, and may also include the above-mentioned two same-type carriers, for example, 2 CPUs, 2 GPUs, 1 ASIC, or 1 FPGA.
在一种可能的实现方式中,每一计算载体可包括至少一个用于神经网络运算的计算单元,例如:处理芯片等。对于计算单元的具体结构不作限定,请参阅图2-1a,图2-1a为一种计算单元的结构示意图。如图2-1a所示,该计算单元包括:主处理电路、基本处理电路和分支处理电路。具体的,主处理电路与分支处理电路连接,分支处理电路连接至少一个基本处理电路。In a possible implementation manner, each computing carrier may include at least one computing unit for a neural network operation, such as a processing chip and the like. The specific structure of the computing unit is not limited. Please refer to FIG. 2-1a. FIG. 2-1a is a schematic structural diagram of a computing unit. As shown in Figure 2-1a, the calculation unit includes: a main processing circuit, a basic processing circuit, and a branch processing circuit. Specifically, the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to at least one basic processing circuit.
该分支处理电路,用于收发主处理电路或基本处理电路的数据。The branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit.
参阅图2-1b,图2-1b为主处理电路的一种结构示意图,如图2-1b所示,主处理电路可以包括寄存器和/或片上缓存电路,该主处理电路还可以包括:控制电路、向量运算器电路、ALU(arithmetic and logic unit,算数逻辑电路)电路、累加器电路、DMA(Direct Memory Access,直接内存存取)电路等电路,当然在实际应用中,上述主处理电路还可以包括转换电路(例如矩阵转置电路)、数据重排电路或激活电路等等其他的电路。Referring to FIG. 2-1b, FIG. 2-1b is a schematic structural diagram of a main processing circuit. As shown in FIG. 2-1b, the main processing circuit may include a register and / or an on-chip buffer circuit. The main processing circuit may further include: a control Circuit, vector operator circuit, ALU (arithmetic and logic unit) circuit, accumulator circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in actual applications, the above main processing circuit also It may include a conversion circuit (such as a matrix transposition circuit), a data rearrangement circuit, or an activation circuit and the like.
主处理电路还包括数据发送电路、数据接收电路或接口,该数据发送电路可以集成数据分发电路以及数据广播电路,当然在实际应用中,数据分发电路以及数据广播电路也可以分别设置;在实际应用中上述数据发送电路以及数据接收电路也可以集成在一起形成数据收发电路。对于广播数据,即需要发送给每个基础处理电路的数据。对于分发数据,即需要有选择的发送给部分基础处理电路的数据,具体的选择方式可以由主处理电路依据负载以及计算方式进行具体的确定。广播发送方式为将广播数据以广播形式发送至每个基础处理电路。(在实际应用中,可以通过一次广播的方式将广播数据发送至每个基础处理电路,也可以通过多次广播的方式将广播数据发送至每个基础处理电路,本申请具体实施方式并不限制上述广播的次数),分发发送方式为将分发数据有选择的发送给部分基础处理电路。The main processing circuit also includes a data sending circuit, a data receiving circuit, or an interface. The data sending circuit can integrate a data distribution circuit and a data broadcasting circuit. Of course, in practical applications, the data distribution circuit and the data broadcasting circuit can also be set separately; in actual applications The above-mentioned data transmitting circuit and data receiving circuit may also be integrated together to form a data transmitting and receiving circuit. For broadcast data, that is, data that needs to be sent to each basic processing circuit. For the distribution data, that is, the data that needs to be selectively sent to part of the basic processing circuit, the specific selection method can be specifically determined by the main processing circuit according to the load and the calculation method. The broadcast transmission method is to broadcast data to each basic processing circuit in a broadcast form. (In practical applications, broadcast data can be sent to each basic processing circuit by one broadcast, and broadcast data can be sent to each basic processing circuit by multiple broadcasts. The specific implementation of this application is not limited. The number of broadcasts mentioned above), and the distribution and transmission method is to selectively send the distribution data to some basic processing circuits.
在实现分发数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据(该数据可以相同,也可以不同,具体的,如果采用分发的方式发送数据,各个接收数据的基础处理电路收到的数据可以不同,当然也可以有部分基础处理电路收到的数据相同;When the data is distributed, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different. Specifically, if the data is sent in a distributed manner, each basic processing circuit that receives the data receives The data received can be different, of course, the data received by some basic processing circuits can also be the same;
具体地,广播数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据,各个接收数据的基础处理电路可以收到相同的数据,即广播数据可以包括所有基础处理电路均需要接收到的数据。分发数据可以包括:部分基础处理电路需要接收到的数据。主处理电路可以通过一次或多次广播将该广播数据发送给所有分支处理电路,分支处理电路将该广播数据转发给所有的基础处理电路。Specifically, when broadcasting data, the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and each basic processing circuit that receives the data can receive the same data, that is, the broadcast data can include all the basic processing circuits that need to receive The data. Distributing the data may include: part of the data that the basic processing circuit needs to receive. The main processing circuit may send the broadcast data to all the branch processing circuits through one or more broadcasts, and the branch processing circuits forward the broadcast data to all the basic processing circuits.
可选的,上述主处理电路的向量运算器电路可以执行向量运算,包括但不限于:两个向量加减乘除,向量与常数加、减、乘、除运算,或者对向量中的每个元素执行任意运算。 其中,连续的运算具体可以为,向量与常数加、减、乘、除运算、激活运算、累加运算等等。Optionally, the vector operator circuit of the main processing circuit described above can perform vector operations, including but not limited to: addition, subtraction, multiplication, and division of two vectors, addition and subtraction of vectors and constants, or operations on each element of a vector Perform arbitrary operations. Among them, the continuous operation may specifically be addition and subtraction of vectors and constants, multiplication, division operations, activation operations, accumulation operations, and the like.
每个基础处理电路可以包括基础寄存器和/或基础片上缓存电路;每个基础处理电路还可以包括:内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。上述内积运算器电路、向量运算器电路、累加器电路都可以是集成电路,上述内积运算器电路、向量运算器电路、累加器电路也可以为单独设置的电路。Each basic processing circuit may include a basic register and / or a basic on-chip cache circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like. The inner product operator circuit, the vector operator circuit, and the accumulator circuit may all be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may also be separately provided circuits.
分支处理电路和基础电路的连接结构可以是任意的,不局限在图2-1b的H型结构。可选的,主处理电路到基础电路是广播或分发的结构,基础电路到主处理电路是收集(gather)的结构。广播,分发和收集的定义如下:The connection structure of the branch processing circuit and the basic circuit may be arbitrary, and is not limited to the H-shaped structure in FIG. 2-1b. Optionally, the main processing circuit to the basic circuit is a broadcast or distribution structure, and the basic circuit to the main processing circuit is a gather structure. Broadcasting, distribution and collection are defined as follows:
所述主处理电路到基础电路的数据传递方式可以包括:The data transmission mode from the main processing circuit to the basic circuit may include:
主处理电路与多个分支处理电路分别相连,每个分支处理电路再与多个基础电路分别相连。The main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is respectively connected to a plurality of basic circuits.
主处理电路与一个分支处理电路相连,该分支处理电路再连接一个分支处理电路,依次类推,串联多个分支处理电路,然后,每个分支处理电路再与多个基础电路分别相连。The main processing circuit is connected to a branch processing circuit, and the branch processing circuit is further connected to a branch processing circuit, and so on, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is respectively connected to a plurality of basic circuits.
主处理电路与多个分支处理电路分别相连,每个分支处理电路再串联多个基础电路。The main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is further connected in series with a plurality of basic circuits.
主处理电路与一个分支处理电路相连,该分支处理电路再连接一个分支处理电路,依次类推,串联多个分支处理电路,然后,每个分支处理电路再串联多个基础电路。The main processing circuit is connected to a branch processing circuit. The branch processing circuit is further connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series. Then, each branch processing circuit is connected in series to a plurality of basic circuits.
分发数据时,主处理电路向部分或者全部基础电路传输数据,各个接收数据的基础电路收到的数据可以不同;When distributing data, the main processing circuit transmits data to some or all of the basic circuits, and the data received by each basic circuit that receives the data may be different;
广播数据时,主处理电路向部分或者全部基础电路传输数据,各个接收数据的基础电路收到相同的数据。When broadcasting data, the main processing circuit transmits data to some or all of the basic circuits, and each basic circuit that receives the data receives the same data.
收集数据时,部分或全部基础电路向主处理电路传输数据。需要说明的,如图2-1a所示的计算单元可以是一个单独的物理芯片,当然在实际应用中,该计算单元也可以集成在其他的芯片内(例如CPU,GPU),本申请具体实施方式并不限制上述芯片装置的物理表现形式。When data is collected, some or all of the underlying circuits transmit data to the main processing circuit. It should be noted that the computing unit shown in Figure 2-1a may be a separate physical chip. Of course, in practical applications, the computing unit may also be integrated in other chips (such as CPU, GPU). This application is specifically implemented. The manner does not limit the physical expression of the chip device.
参阅图2-1c,图2-1c为一种计算单元的数据分发示意图,如图2-1c的箭头所示,该箭头为数据的分发方向,如图2-1c所示,主处理电路接收到外部数据以后,将外部数据拆分以后,分发至多个分支处理电路,分支处理电路将拆分数据发送至基本处理电路。Referring to Figure 2-1c, Figure 2-1c is a schematic diagram of data distribution of a computing unit, as shown by the arrow in Figure 2-1c, and this arrow is the data distribution direction. As shown in Figure 2-1c, the main processing circuit receives After the external data, the external data is split and distributed to multiple branch processing circuits, and the branch processing circuit sends the split data to the basic processing circuit.
参阅图2-1d,图2-1d为一种计算单元的数据回传示意图,如图2-1d的箭头所示,该箭头为数据的回传方向,如图2-1d所示,基本处理电路将数据(例如内积运算结果)回传给分支处理电路,分支处理电路在回传至主处理电路。Refer to Figure 2-1d, Figure 2-1d is a schematic diagram of data return of a computing unit, as shown by the arrow in Figure 2-1d, the arrow is the direction of data return, as shown in Figure 2-1d, basic processing The circuit returns the data (such as the result of the inner product operation) to the branch processing circuit, and the branch processing circuit is returning to the main processing circuit.
对于输入数据,具体的可以为向量、矩阵、多维(三维或四维及以上)数据,对于输入数据的一个具体的值,可以称为该输入数据的一个元素。For the input data, the specific data may be vector, matrix, multi-dimensional (three-dimensional or four-dimensional or more) data, and for a specific value of the input data, it may be called an element of the input data.
本披露实施例还提供一种如图2-1a所示的计算单元的计算方法,该计算方法应用与神经网络计算中,具体的,该计算单元可以用于对多层神经网络中一层或多层的输入数据与权值数据执行运算。The embodiment of the present disclosure also provides a calculation method of a calculation unit shown in FIG. 2-1a. The calculation method is applied to the calculation of a neural network. Specifically, the calculation unit may be used for a layer or a multi-layer neural network. Multiple layers of input data and weight data perform operations.
具体的,上述所述计算单元用于对训练的多层神经网络中一层或多层的输入数据与权 值数据执行运算;Specifically, the calculation unit is configured to perform an operation on one or more input data and weight data of the trained multi-layer neural network;
或所述计算单元用于对正向运算的多层神经网络中一层或多层的输入数据与权值数据执行运算。Or, the calculation unit is configured to perform an operation on one or more layers of input data and weight data in a multi-layer neural network in a forward operation.
上述运算包括但不限于:卷积运算、矩阵乘矩阵运算、矩阵乘向量运算、偏置运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种或任意组合。The above operations include, but are not limited to, one or any combination of convolution operations, matrix multiplication matrix operations, matrix multiplication vector operations, offset operations, fully connected operations, GEMM operations, GEMV operations, and activation operations.
GEMM计算是指:BLAS库中的矩阵-矩阵乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*op(P)+beta*C,其中,S和P为输入的两个矩阵,C为输出矩阵,alpha和beta为标量,op代表对矩阵S或P的某种操作,此外,还会有一些辅助的整数作为参数来说明矩阵的S和P的宽高;GEMM calculation refers to the matrix-matrix multiplication operation in the BLAS library. The usual representation of this operation is: C = alpha * op (S) * op (P) + beta * C, where S and P are the two matrices of the input, C is the output matrix, alpha and beta are the scalars, and op Represents some kind of operation on the matrix S or P, in addition, there will be some auxiliary integers as parameters to explain the width and height of the matrix S and P;
GEMV计算是指:BLAS库中的矩阵-向量乘法的运算。该运算的通常表示形式为:C=alpha*op(S)*P+beta*C,其中,S为输入矩阵,P为输入的向量,C为输出向量,alpha和beta为标量,op代表对矩阵S的某种操作。GEMV calculation refers to the matrix-vector multiplication operation in the BLAS library. The general representation of this operation is: C = alpha * op (S) * P + beta * C, where S is the input matrix, P is the input vector, C is the output vector, alpha and beta are scalars, and op represents the pair Some operation of the matrix S.
本申请对于计算装置中计算载体之间的连接关系不作限定,可以是同构或异构的计算载体,对于计算载体中计算单元之间的连接关系也不作限定,通过上述异构的计算载体或计算单元执行并行任务,可提高运算效率。This application does not limit the connection relationship between computing carriers in a computing device, and may be a homogeneous or heterogeneous computing carrier. It also does not limit the connection relationship between computing units in a computing carrier. The computing unit executes parallel tasks, which can improve computing efficiency.
在如图2-1中,每一计算载体还至少包括一个片上缓存电路和一个片外缓存电路,例如:第一计算载体101包括第一片上缓存电路1011和第一片外缓存电路1012,第二计算载体102包括第二片上缓存电路1021和第二片外缓存电路1022,第N计算载体103包括第N片上缓存电路1031和第N片外缓存电路1032。In Figure 2-1, each computing carrier further includes at least one on-chip cache circuit and one off-chip cache circuit. For example, the first computing carrier 101 includes a first on-chip cache circuit 1011 and a first off-chip cache circuit 1012. The second computing carrier 102 includes a second on-chip cache circuit 1021 and a second off-chip cache circuit 1022. The N-th computing carrier 103 includes an N-th on-chip cache circuit 1031 and an N-th off-chip cache circuit 1032.
该片上缓存电路可以片上存储器(On-chipMemory),具体包括但不仅限于双倍数据速率存储器(Double Data RateMemory,DDRM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、三倍动态随机存取存储器(threeDataDRAM)、三倍静态随机存取存储器(threeDataStaticRandom-AccessMemory,3DSRAM)等形式;该片外缓存电路可以是片外存储器(Off-chipMemory),具体包括但不仅限于SharedMemory,Cache等。其中Cache可以包括多层结构,如N层Cache结构,包括L1Cache、L2Cache、……、LNCache。The on-chip cache circuit may include on-chip memory, including, but not limited to, double data rate memory (DDRM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and three times dynamic random access memory. Take three memory (ThreeDataDRAM), three times static random access memory (ThreeDataStaticRandom-AccessMemory, 3DSRAM) and other forms; the off-chip cache circuit can be off-chip memory (Off-chipMemory), including but not limited to SharedMemory, Cache and so on. The cache may include a multilayer structure, such as an N-layer cache structure, including L1Cache, L2Cache, ..., LNCache.
在如图2-1中,计算装置100还包括与各个片上缓存电路连接的片上存储数据通路控制电路110,以及与片上存储数据通路控制电路110连接的片上存储数据通路121,其中:所述片上存储数据通路控制电路110用于接收所述多个计算载体中的第一计算载体101的第一片上缓存电路1011发送的数据传输指令;对所述数据传输指令进行译码,以得到发送数据地址和接收数据地址;所述片上缓存电路数据通路121用于根据所述发送数据地址获取目标数据,并将所述目标数据传输至所述接收数据地址。As shown in FIG. 2-1, the computing device 100 further includes an on-chip storage data path control circuit 110 connected to each on-chip cache circuit, and an on-chip storage data path 121 connected to the on-chip storage data path control circuit 110, wherein: on-chip The storage data path control circuit 110 is configured to receive a data transmission instruction sent by the first on-chip cache circuit 1011 of the first computing carrier 101 among the plurality of computing carriers; and decode the data transmission instruction to obtain the transmitted data. An address and a received data address; the on-chip buffer circuit data path 121 is configured to obtain target data according to the sent data address, and transmit the target data to the received data address.
其中,第一计算载体101为多个计算载体中的任意一个计算载体,数据传输指令为二进制文件。在本申请中,对数据传输指令进行译码以得到发送数据地址和接收数据地址,还可得到数据容量、数据标识等用于确定目标数据的参数。该发送数据地址为第一片上缓存电路中存储目标数据的地址,接收数据地址为多个计算载体中的第二计算载体102的第二片上缓存电路1021中的一个地址,也就是说,该数据传输指令指示片上存储数据通路控制单元110将第一片上缓存电路1011中缓存的目标数据传输至第二片上缓存电路1021,即 确定第一计算载体101预进行数据传输的计算载体为第二计算载体1021。The first computing carrier 101 is any one of a plurality of computing carriers, and the data transmission instruction is a binary file. In this application, a data transmission instruction is decoded to obtain a sending data address and a receiving data address, and parameters such as a data capacity and a data identifier for determining target data can also be obtained. The sending data address is an address where the target data is stored in the first on-chip cache circuit, and the receiving data address is an address in the second on-chip cache circuit 1021 of the second computing carrier 102 of the plurality of computing carriers, that is, the The data transfer instruction instructs the on-chip storage data path control unit 110 to transfer the target data buffered in the first on-chip cache circuit 1011 to the second on-chip cache circuit 1021, that is, it is determined that the computing carrier that the first computing carrier 101 performs data transmission in advance is the second Computing carrier 1021.
可以理解,当片上存储数据通路控制电路110接收到第一片上缓存电路1011发送的数据传输指令时,对该数据传输指令进行译码以得到发送数据地址和接收数据地址,如此,通过片上缓存电路数据通路121获取发送数据地址对应的目标数据,并将该目标数据传输至接收数据地址,则第二片上缓存电路1021缓存该目标数据,从而完成了两个计算载体的片上缓存电路之间的数据传输。It can be understood that when the on-chip storage data path control circuit 110 receives a data transmission instruction sent by the first on-chip cache circuit 1011, it decodes the data transmission instruction to obtain a sending data address and a receiving data address. The circuit data path 121 obtains the target data corresponding to the sending data address and transmits the target data to the receiving data address, and the second on-chip cache circuit 1021 caches the target data, thereby completing the on-chip cache circuit between the two computing carriers. data transmission.
对于片上存储数据通路控制电路110而言,可同时接收到多个数据传输指令,因此,需要确定各个数据传输指令之间的传输顺序。本申请对于如何确定执行顺序不做限定,可获取各个所述数据传输指令对应的优先级,以得到多个优先级;根据所述多个优先级确定所述多个数据传输指令中每一数据传输指令的执行顺序。For the on-chip storage data path control circuit 110, multiple data transmission instructions can be received at the same time. Therefore, it is necessary to determine the transmission order between the data transmission instructions. This application does not limit how to determine the execution order. The priorities corresponding to the data transmission instructions can be obtained to obtain multiple priorities. Each data in the multiple data transmission instructions is determined according to the multiple priorities. The execution order of the transfer instruction.
其中,优先级可通过目标数据的数量容量、目标数据的优先级,或者第一片上缓存电路的优先级、剩余内存大小等多个维度进行获取。Among them, the priority can be obtained through multiple dimensions such as the quantity and capacity of the target data, the priority of the target data, or the priority of the first on-chip cache circuit and the remaining memory size.
可以理解,通过片上存储数据通路控制电路110确定各个数据传输指令之间的执行顺序,并按照该执行顺序控制片上缓存电路数据通路121进行数据传输,可提高传输的稳定性。It can be understood that the on-chip storage data path control circuit 110 determines the execution order between the data transmission instructions, and controls the on-chip cache circuit data path 121 to perform data transmission according to the execution order, which can improve the stability of the transmission.
在一种可能的实施例中,如图2-1e所示,所述片上存储数据通路控制电路110包括指令缓存单元1101、与所述指令缓存单元1101连接的指令译码单元1102以及与所述指令缓存单元1101和所述指令译码单元1102连接的内存管理单元1103,其中:In a possible embodiment, as shown in FIG. 2-1e, the on-chip storage data path control circuit 110 includes an instruction cache unit 1101, an instruction decoding unit 1102 connected to the instruction cache unit 1101, and A memory management unit 1103 connected to the instruction cache unit 1101 and the instruction decoding unit 1102, where:
所述指令缓存单元1101用于缓存所述数据传输指令;The instruction buffer unit 1101 is configured to buffer the data transmission instruction;
所述指令译码单元1102用于对所述数据传输指令进行译码,以得到所述发送数据地址和所述接收数据地址;The instruction decoding unit 1102 is configured to decode the data transmission instruction to obtain the sending data address and the receiving data address;
所述内存管理单元1103用于管理所述数据传输指令。The memory management unit 1103 is configured to manage the data transmission instruction.
可以理解,将片上存储数据通路控制电路110进一步划分,以得到指令缓存单元1101、指令译码单元1102和内存管理单元1103,分别执行对应的步骤,即通过内存管理单元1103管理数据传输指令,即执行该数据传输指令时直接从指令缓存单元1101中调用,并通过指令译码单元1102翻译该数据传输指令以完成数据传输,如此,提高了执行效率和执行的稳定性。It can be understood that the on-chip storage data path control circuit 110 is further divided to obtain an instruction cache unit 1101, an instruction decoding unit 1102, and a memory management unit 1103, respectively, and execute corresponding steps, that is, the data management instruction is managed by the memory management unit 1103, that is, When the data transmission instruction is executed, it is directly called from the instruction buffer unit 1101, and the data decoding instruction is translated by the instruction decoding unit 1102 to complete the data transmission, so that the execution efficiency and the stability of execution are improved.
进一步的,如图2-1f所示,所述内存管理单元1103包括地址映射模块11031、请求仲裁模块11032和一致性控制模块11033,其中:Further, as shown in FIG. 2-1f, the memory management unit 1103 includes an address mapping module 11031, a request arbitration module 11032, and a consistency control module 11033, where:
所述地址映射模块11031用于确定所述接收数据地址对应的所述第二片上缓存电路;The address mapping module 11031 is configured to determine the second on-chip cache circuit corresponding to the received data address;
所述请求仲裁模块11032用于若所述指令缓存单元包括多个所述数据传输指令,则分配所述多个数据传输指令中每一数据传输指令的执行顺序;The request arbitration module 11032 is configured to allocate an execution order of each data transmission instruction in the plurality of data transmission instructions if the instruction cache unit includes a plurality of the data transmission instructions;
所述一致性控制模块11033用于保证数据传输一致性。The consistency control module 11033 is configured to ensure consistency of data transmission.
可以理解,将内存管理单元1103进一步划分,以得到地址映射模块11031、请求仲裁模块11032和一致性控制模块11033,分别执行对应的步骤,即通过地址映射模块11031确定目标数据的待缓存位置,通过请求仲裁模块11032确定各个数据传输指令的执行顺序,按照该传输顺序控制片上缓存电路数据通路121进行数据传输,可提高传输的稳定性。且 通过一致性控制模块11033保证数据传输一致性,提高了传输的稳定性和执行的安全性。It can be understood that the memory management unit 1103 is further divided to obtain the address mapping module 11031, the request arbitration module 11032, and the consistency control module 11033, respectively, and corresponding steps are performed, that is, the address mapping module 11031 is used to determine the target data to be cached. The request arbitration module 11032 determines the execution order of each data transmission instruction, and controls the on-chip cache circuit data path 121 for data transmission according to the transmission order, which can improve the stability of the transmission. And the consistency control module 11033 ensures the consistency of data transmission, which improves the stability of the transmission and the security of execution.
在一个实施例中,如图2-1所示,计算装置100还包括与各个片外缓存电路连接的快速外部设备互连总线(peripheral component interconnect express,PCIE)数据通路122,用于实现所述多个计算载体中任意两个计算载体的片外缓存电路之间的数据传输。In one embodiment, as shown in FIG. 2-1, the computing device 100 further includes a fast peripheral device interconnect bus (PCIE) data path 122 connected to each off-chip cache circuit, for implementing the described Data transmission between off-chip cache circuits of any two computing carriers in multiple computing carriers.
可以看出,各个计算载体之间的片外存储数据可直接通过PCIE数据通路122进行数据交互,即片外缓存数据通过专门的片外存储数据通路122进行数据交互,以支持更大规模的机器学习的运算。且还可通过PCIE接口与各种类型的服务器相连接,提高了传输效率。It can be seen that the off-chip storage data between the various computing carriers can be directly used for data interaction through the PCIE data path 122, that is, the off-chip cached data is exchanged through the dedicated off-chip storage data path 122 to support larger-scale machines Learning operations. It can also be connected to various types of servers through the PCIE interface, which improves transmission efficiency.
请参考图2-2,图2-2为本申请提出的一种数据传输方法的流程示意图。该数据传输方法应用于如图2-1所示的计算装置,即该计算装置包括多个计算载体、与所述多个计算载体中每一计算载体的片上缓存电路连接的片上存储数据通路控制电路、以及与所述片上存储数据通路控制电路连接的片上存储数据通路。具体的,如图2-2所示:Please refer to FIG. 2-2, which is a schematic flowchart of a data transmission method proposed by this application. The data transmission method is applied to a computing device shown in FIG. 2-1, that is, the computing device includes multiple computing carriers, and an on-chip storage data path control connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers. A circuit, and an on-chip storage data path connected to the on-chip storage data path control circuit. Specifically, as shown in Figure 2-2:
S201:通过片上存储数据通路控制电路接收多个计算载体中的第一计算载体的第一片上缓存电路发送的数据传输指令。S201: Receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier in a plurality of computing carriers through an on-chip storage data path control circuit.
S202:通过所述片上存储数据通路控制电路对所述数据传输指令进行译码,以得到发送数据地址和接收数据地址。S202: Decode the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address.
S203:通过片上缓存电路数据通路根据所述发送数据地址获取目标数据,并将所述目标数据传输至所述接收数据地址。S203: Obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address.
其中,所述接收数据地址为所述多个计算载体中的第二计算载体的第二片上缓存电路中的一个地址。The received data address is an address in a second on-chip cache circuit of a second computing carrier of the plurality of computing carriers.
可以理解,通过片上存储数据通路控制电路接收多个计算载体中的第一计算载体的第一片上缓存电路发送的数据传输指令,再通过所述片上存储数据通路控制电路对所述数据传输指令进行译码以得到发送数据地址和接收数据地址,通过片上缓存电路数据通路根据所述发送数据地址获取目标数据,并将所述目标数据传输至所述接收数据地址。如此,可提高不同计算载体之间的数据传输效率,便于提高神经网络运算效率。It can be understood that the on-chip storage data path control circuit receives the data transmission instruction sent by the first on-chip cache circuit of the first computing carrier in the plurality of computing carriers, and then the on-chip storage data path control circuit sends the data transmission instruction. Decode to obtain a sending data address and a receiving data address, obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address. In this way, the data transmission efficiency between different computing carriers can be improved, and it is convenient to improve the operation efficiency of the neural network.
在一种可能的实施例中,所述片上存储数据通路控制电路包括指令缓存单元、与所述指令缓存单元连接的指令译码单元及与所述指令缓存单元和所述指令译码单元连接的内存管理单元,所述通过所述片上存储数据通路控制电路对所述数据传输指令进行译码,以得到发送数据地址和接收数据地址,包括:In a possible embodiment, the on-chip storage data path control circuit includes an instruction cache unit, an instruction decoding unit connected to the instruction cache unit, and an instruction cache unit connected to the instruction cache unit and the instruction decoding unit. The memory management unit, which decodes the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, includes:
通过所述指令译码单元对所述数据传输指令进行译码,以得到所述发送数据地址和所述接收数据地址;Decoding the data transmission instruction by the instruction decoding unit to obtain the sending data address and the receiving data address;
所述方法还包括:The method further includes:
通过所述指令缓存单元缓存所述数据传输指令;Buffering the data transmission instruction by the instruction buffer unit;
通过所述内存管理单元管理所述数据传输指令。The data transmission instruction is managed by the memory management unit.
在一种可能的实施例中,所述内存管理单元包括地址映射模块、请求仲裁模块和一致性控制模块,所述通过所述内存管理单元管理所述数据传输指令,包括:In a possible embodiment, the memory management unit includes an address mapping module, a request arbitration module, and a consistency control module, and the managing the data transmission instruction by the memory management unit includes:
通过所述地址映射模块确定所述接收数据地址对应的所述第二片上缓存电路;Determining, by the address mapping module, the second on-chip cache circuit corresponding to the received data address;
若所述指令缓存单元包括多个所述数据传输指令,则通过所述请求仲裁模块确定所述多个数据传输指令中每一数据传输指令的执行顺序;If the instruction buffer unit includes a plurality of the data transmission instructions, determining an execution order of each data transmission instruction in the plurality of data transmission instructions through the request arbitration module;
通过所述一致性控制模块保证数据传输一致性。The consistency control module ensures data transmission consistency.
在一种可能的实施例中,所述计算装置还包括快速外部设备互连总线PCIE数据通路,所述方法还包括:In a possible embodiment, the computing device further includes a fast external device interconnect bus PCIE data path, and the method further includes:
通过所述PCIE数据通路实现所述多个计算载体中任意两个计算载体的片外缓存电路之间的数据传输。Data transmission between the off-chip cache circuits of any two computing carriers in the plurality of computing carriers is implemented through the PCIE data path.
在一种可能的实施例中,所述多个计算载体包括中央处理器CPU、图像处理器GPU、专用集成电路ASIC、现场可编程逻辑门阵列FPGA、粗粒度可重构阵列CGRA或数字信号处理器DSP中的至少两种。In a possible embodiment, the multiple computing carriers include a central processing unit CPU, an image processor GPU, an application specific integrated circuit ASIC, a field programmable logic gate array FPGA, a coarse-grained reconfigurable array CGRA, or digital signal processing. At least two of the processor DSPs.
在一种可能的实施例中,所述计算载体包括至少一个计算单元,所述计算单元包括:主处理电路、分支处理电路与基础处理电路,所述主处理电路与所述分支处理电路连接,所述基础处理电路与所述分支处理电路连接,所述方法还包括:In a possible embodiment, the calculation carrier includes at least one calculation unit. The calculation unit includes: a main processing circuit, a branch processing circuit, and a basic processing circuit. The main processing circuit is connected to the branch processing circuit. The basic processing circuit is connected to the branch processing circuit, and the method further includes:
通过所述主处理电路获取所述计算单元以外的数据,并将该数据划分成广播数据和分发数据;Acquiring data other than the computing unit through the main processing circuit, and dividing the data into broadcast data and distribution data;
通过所述主处理电路将所述广播数据以广播方式发送给所有分支处理电路,将所述分发数据选择性的分发给不同的分支处理电路;Sending the broadcast data to all branch processing circuits in a broadcast manner through the main processing circuit, and selectively distributing the distribution data to different branch processing circuits;
通过所述分支处理电路在所述主处理电路与所述基础处理电路之间转发数据;Forwarding data between the main processing circuit and the basic processing circuit through the branch processing circuit;
通过所述基础处理电路接收所述分支处理电路转发的广播数据和分发数据,并对该广播数据和分发数据执行运算得到运算结果,将该运算结果发送至所述分支处理电路;Receiving broadcast data and distribution data forwarded by the branch processing circuit through the basic processing circuit, and performing operations on the broadcast data and distribution data to obtain an operation result, and sending the operation result to the branch processing circuit;
通过所述主处理电路接收分支处理电路转发的所述基础处理电路的运算结果,将该运算结果进行处理得到运算结果。Receiving the operation result of the basic processing circuit forwarded by the branch processing circuit through the main processing circuit, and processing the operation result to obtain an operation result.
在一种可能的实施例中,所述通过所述主处理电路将所述广播数据以广播方式发送给所有分支处理电路,包括:In a possible embodiment, the sending the broadcast data to all the branch processing circuits in a broadcast manner by using the main processing circuit includes:
通过所述主处理电路将所述广播数据以一次广播或多次广播发送给所有分支处理电路;Sending the broadcast data to all branch processing circuits in one broadcast or multiple broadcasts through the main processing circuit;
所述通过所述基础处理电路对该广播数据和分发数据执行运算得到运算结果,包括:The operation performed by the basic processing circuit on the broadcast data and the distribution data to obtain an operation result includes:
通过所述基础处理电路对该广播数据和分发数据执行内积运算、乘积运算或向量运算得到运算结果。The basic processing circuit performs an inner product operation, a product operation, or a vector operation on the broadcast data and the distribution data to obtain an operation result.
在本发明的一个实施例中提供一种计算装置,包括处理器、存储器、通信接口以及一个或多个程序,其中,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行所述数据传输方法中所描述的实现方式。In one embodiment of the present invention, a computing device is provided, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured by The processor executes the implementation manner described in the data transmission method.
在本发明的另一实施例中提供一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序包括程序指令,上述程序指令当被处理器执行时使上述处理器执行所述数据传输方法中所描述的实现方式。In another embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program. The computer program includes program instructions. When the program instructions are executed by a processor, the processor executes the instructions. The implementation manner described in the data transmission method.
本申请还揭露了一个组合处理装置,其包括上述的计算装置,通用互联接口,和其他 处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。图2-3为组合处理装置的结构示意图。The present application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to jointly complete the operation specified by the user. Figure 2-3 is a schematic structural diagram of a combined processing device.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of the machine learning computing device such as start and stop; other processing devices can also cooperate with the machine learning computing device to complete the computing tasks.
通用互联接口,用于在所述机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。A universal interconnection interface for transmitting data and control instructions between the machine learning computing device and other processing devices. The machine learning computing device obtains required input data from other processing devices and writes it to the storage device on the chip of the machine learning computing device; it can obtain control instructions from other processing devices and write it to the control cache on the machine learning computing device chip; The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
可选的,如图2-3所示的组合处理装置还可以包括存储装置,存储装置分别与所述机器学习运算装置和所述其他处理装置连接。存储装置用于保存在所述机器学习运算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。Optionally, the combined processing device shown in FIG. 2-3 may further include a storage device, and the storage device is connected to the machine learning computing device and the other processing devices, respectively. The storage device is configured to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the machine learning computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption. In this case, the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
在一些实施例里,还申请了一种芯片,其包括了上述机器学习运算装置或组合处理装置。In some embodiments, a chip is also applied, which includes the above-mentioned machine learning computing device or combined processing device.
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。In some embodiments, a chip packaging structure is applied, which includes the above chip.
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图2-4,图2-4提供了一种板卡,上述板卡除了包括上述芯片以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件、接口装置和控制器件;In some embodiments, a board card is applied, which includes the chip package structure described above. Referring to FIG. 2-4, FIG. 2-4 provides a board card. In addition to the above chip, the board card may also include other supporting components. The supporting components include, but are not limited to, a storage device, an interface device, and a control device. ;
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是双倍速率同步动态随机存储器(Double Data Rate SDRAM,DDR SDRAM)。The memory device is connected to a chip in the chip package structure through a bus, and is used to store data. The memory device may include a plurality of groups of memory cells. Each group of the storage units is connected to the chip through a bus. It can be understood that each group of the storage units may be a double-rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器, 用于对每个所述存储单元的数据传输与数据存储的控制。In one embodiment, each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的运算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected to a chip in the chip package structure. The interface device is used to implement data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer. Preferably, when the PCIE 3.0 X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB / s. In another embodiment, the interface device may also be other interfaces. The present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function. In addition, the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。The control device is electrically connected to the chip. The control device is configured to monitor a state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit). For example, the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
在一些实施例里,申请了一种电子设备,其包括了上述板卡。In some embodiments, an electronic device is applied, which includes the board card described above.
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, camcorders, projectors, watches, headphones , Mobile storage, wearables, transportation, home appliances, and / or medical devices.
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships, and / or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all described as a series of action combinations. However, those skilled in the art should know that this application is not limited by the described action order. Because according to the present application, certain steps may be performed in another order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required for this application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in one embodiment, reference may be made to related descriptions in other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner. For example, multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit. The above integrated unit may be implemented in the form of hardware or in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。When the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present application essentially or part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. The foregoing memories include: U disks, Read-Only Memory (ROM), Random Access Memory (RAM), mobile hard disks, magnetic disks, or optical disks and other media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the foregoing embodiments may be completed by a program instructing related hardware. The program may be stored in a computer-readable memory, and the memory may include a flash disk. , Read-only memory (English: Read-Only Memory, referred to as ROM), random access device (English: Random Access Memory, referred to as RAM), magnetic disks or optical disks, etc.
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The embodiments of the present application have been described in detail above. Specific examples have been used in this document to explain the principles and implementation of the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. Persons of ordinary skill in the art may change the specific implementation and application scope according to the idea of the present application. In summary, the content of this description should not be construed as a limitation on the present application.

Claims (26)

  1. 一种计算装置,其特征在于,所述计算装置包括:X组神经网络芯片,所述X组神经网络芯片中的每一组神经网络芯片中包括一个主芯片和至少一个从芯片,所述主芯片与所述从芯片连接,所述X组神经网络芯片中的主芯片之间连接,所述X的取值范围为大于或等于2的整数;A computing device, characterized in that the computing device comprises: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master The chip is connected to the slave chip, and the master chip in the X group of neural network chips is connected, and the value of X is an integer greater than or equal to 2;
    所述X组神经网络芯片中的每一个神经网络芯片用于获取输入数据和权值,并将所述权值与所述每一个神经网络芯片对应的输入数据进行运算,获得运算结果,其中所述每一个神经网络芯片获取的所述输入数据不同,获取的所述权值相同;Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;
    所述X组神经网络芯片中的第一组神经网络芯片中的第一主芯片,用于接收与所述第一主芯片连接的从芯片的运算结果;A first master chip in a first group of neural network chips in the X group of neural network chips, configured to receive an operation result of a slave chip connected to the first master chip;
    所述第一主芯片用于将所述第一主芯片的运算结果和接收的所述从芯片的运算结果共享给其他组神经网路芯片中的主芯片,并接收其他组神经网络芯片中的主芯片共享的运算结果。The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.
  2. 根据权利要求1所述的装置,其特征在于,所述第一主芯片还用于:The device according to claim 1, wherein the first main chip is further configured to:
    将所述第一主芯片中的所有运算结果传递给与所述第一主芯片连接的从芯片。Transmitting all operation results in the first master chip to a slave chip connected to the first master chip.
  3. 根据权利要求1或2所述的装置,其特征在于,所述主芯片通过树型结构与所述从芯片连接,所述树型结构为n叉树结构,所述主芯片为所述n叉树结构的根节点,所述从芯片为所述n叉树结构的子节点,所述子节点可以是一级子节点,也可以是多级子节点。The device according to claim 1 or 2, wherein the master chip is connected to the slave chip through a tree structure, the tree structure is an n-ary tree structure, and the master chip is the n-ary The root node of the tree structure, and the slave chip is a child node of the n-ary tree structure, and the child node may be a first-level child node or a multi-level child node.
  4. 根据权利要求1所述装置,其特征在于,所述神经网络芯片包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和多个从处理电路;The device according to claim 1, wherein the neural network chip comprises: an arithmetic unit and a controller unit; the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;
    所述控制器单元,用于获取输入数据以及计算指令;The controller unit is configured to obtain input data and calculation instructions;
    所述控制器单元,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;
    所述主处理电路,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据和运算指令;The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
    所述多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;The multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;
    所述主处理电路,用于对所述多个中间结果执行后续处理得到所述计算指令的运算结果。The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
  5. 根据权利要求4所述的装置,其特征在于,所述神经网络芯片还包括:存储单元和直接内存访问单元,所述存储单元包括:寄存器、缓存中的任意组合;The device according to claim 4, wherein the neural network chip further comprises: a storage unit and a direct memory access unit, and the storage unit comprises: any combination of a register and a cache;
    所述缓存,用于存储所述输入数据;The cache is used to store the input data;
    所述寄存器,用于存储所述输入数据中标量数据;The register is used to store scalar data in the input data;
    所述缓存包括高速暂存缓存;The cache includes a high-speed temporary cache;
    所述控制器单元包括:指令缓存单元、指令处理单元和存储队列单元;The controller unit includes: an instruction cache unit, an instruction processing unit, and a storage queue unit;
    所述指令缓存单元,用于存储所述人工神经网络运算关联的计算指令;The instruction buffer unit is configured to store calculation instructions associated with the artificial neural network operation;
    所述指令处理单元,用于对所述计算指令解析得到多个运算指令;The instruction processing unit is configured to parse the calculation instruction to obtain a plurality of operation instructions;
    所述存储队列单元,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令;The storage queue unit is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the order of the queue;
    所述控制器单元包括:依赖关系处理单元;The controller unit includes: a dependency relationship processing unit;
    所述依赖关系处理单元,用于确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;The dependency relationship processing unit is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction, such as the first operation instruction and the zeroth operation instruction have an association relationship, Storing the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extracting the first operation instruction from the instruction storage unit and transmitting the first operation instruction to the operation unit;
    所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:
    依据所述第一运算指令提取所述第一运算指令中所需数据的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需数据的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,确定所述第一运算指令与所述第零运算指令不具有关联关系。Extracting a first storage address interval of data required in the first operation instruction according to the first operation instruction, and extracting a zeroth storage address interval of data required in the zeroth operation instruction according to the zeroth operation instruction, If the first storage address range and the zeroth storage address range have overlapping areas, it is determined that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage address range and the The zeroth storage address interval does not have an overlapping area, and it is determined that the first operation instruction and the zeroth operation instruction have no correlation.
  6. 根据权利要求4或5所述的装置,其特征在于,当所述神经网络芯片为主芯片时,所述控制器单元还包括调度单元,具体用于:The device according to claim 4 or 5, wherein when the neural network chip is a main chip, the controller unit further comprises a scheduling unit, and is specifically configured to:
    对所述主芯片中的运算结果进行调度。Scheduling an operation result in the main chip.
  7. 根据权利要求6所述的装置,其特征在于,所述对所述主芯片中的运算结果进行调度,包括:The apparatus according to claim 6, wherein the scheduling the operation result in the main chip comprises:
    将所述X组神经网络芯片中的主芯片向同一方向连接的主芯片调度1/Y+1的运算内容,其中,所述同一方向包括顺时针方向或逆时针方向,所述Y为与所述X组神经网络芯片中主芯片连接的从芯片的数量。The main chip in the X group of neural network chips connected to the same direction schedules the operation content of 1 / Y + 1, wherein the same direction includes a clockwise direction or a counterclockwise direction, and Y is the The number of slave chips connected to the master chip in the X-group neural network chip is described.
  8. 根据权利要求4-7任一项所述的装置,其特征在于,The device according to any one of claims 4 to 7, wherein:
    所述主处理电路,具体用于将多个从处理电路发送的中间结果进行组合排序得到该计算指令的结果;The main processing circuit is specifically configured to combine and sort multiple intermediate results sent from the processing circuits to obtain the result of the calculation instruction;
    或所述主处理电路,具体用于将多个从处理电路的发送的中间结果进行组合排序以及激活处理后得到该计算指令的结果。Or, the main processing circuit is specifically configured to combine and sort the intermediate results sent by multiple slave processing circuits and obtain the result of the calculation instruction after the activation processing.
  9. 根据权利要求8所述的装置,其特征在于,所述主处理电路包括:转换处理电路、激活处理电路、加法处理电路中的一种或任意组合;The device according to claim 8, wherein the main processing circuit comprises one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
    所述转换处理电路,用于对所述数据执行前序处理,具体为:将主处理电路接收的数据或中间结果执行第一数据结构与第二数据结构之间的互换;或将主处理电路接收的数据或中间结果执行第一数据类型与第二数据类型之间的互换;The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;
    所述激活处理电路,用于执行所述后续处理,具体为执行主处理电路内数据的激活运算;The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;
    所述加法处理电路,用于执行所述后续处理,具体为执行加法运算或累加运算。The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
  10. 根据权利要求9所述的装置,其特征在于,所述从处理电路包括:乘法处理电路;The apparatus according to claim 9, wherein the slave processing circuit comprises: a multiplication processing circuit;
    所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果。The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
  11. 根据权利要求10所述的装置,其特征在于,所述从处理电路还包括:累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。The device according to claim 10, wherein the slave processing circuit further comprises: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
  12. 一种组合计算装置,其特征在于,所述组合计算装置包括:M个如权利要求1所述的计算装置,所述M个如权利要求1所述的计算装置之间连接,所述M取值范围为大于或等于2的整数。A combined computing device, characterized in that the combined computing device comprises: M computing devices according to claim 1, the M computing devices according to claim 1 are connected, and the M is Value range is an integer greater than or equal to 2.
  13. 如权利要求12所述的组合计算装置,其特征在于,所述M个如权利要求1所述的计算装置之间连接,包括:The combined computing device according to claim 12, wherein the M computing devices according to claim 1 are connected, comprising:
    所述M个如权利要求1所述的计算装置中的每一个计算装置,其包含的X组神经网络芯片中的一组神经网络芯片的主芯片用于与其他计算装置中的X组神经网络中的一组神经网络芯片的主芯片连接。Each of the M computing devices according to claim 1, wherein a main chip of a group of neural network chips in the X group of neural network chips included is used to communicate with the X group of neural networks in other computing devices. A set of neural network chips in a master chip connection.
  14. 一种执行机器学习模型的计算方法,其特征在于,所述计算方法应用于计算装置,所述计算装置包括:X组神经网络芯片,所述X组神经网络芯片中的每一组神经网络芯片中包括一个主芯片和至少一个从芯片,所述主芯片与所述从芯片连接,所述X组神经网络芯片中的主芯片之间连接,所述X的取值范围为大于或等于2的整数;A computing method for executing a machine learning model, characterized in that the computing method is applied to a computing device, the computing device includes: X groups of neural network chips, each group of the X group of neural network chips It includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the master chip in the X group of neural network chips is connected, and the value of X is greater than or equal to 2 Integer
    所述X组神经网络芯片中的每一个神经网络芯片用于获取输入数据和权值,并将所述权值与所述每一个神经网络芯片对应的输入数据进行运算,获得运算结果,其中所述每一个神经网络芯片获取的所述输入数据不同,获取的所述权值相同;Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;
    所述X组神经网络芯片中的第一组神经网络芯片中的第一主芯片,用于接收与所述第一主芯片连接的从芯片的运算结果,结合所述第一芯片的运算结果,获得第一组运算结果;The first master chip in the first group of neural network chips in the X group of neural network chips is configured to receive a calculation result of a slave chip connected to the first master chip, and combine the calculation result of the first chip, Obtain the first set of operation results;
    所述第一主芯片用于将所述第一主芯片的运算结果和接收的所述从芯片的运算结果共享给其他组神经网路芯片中的主芯片,并接收其他组神经网络芯片中的主芯片共享的运算结果。The first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips. The calculation result shared by the main chip.
  15. 根据权利要求14所述方法,其特征在于,所述第一主芯片还用于:The method according to claim 14, wherein the first main chip is further configured to:
    将所述第一主芯片中的所有运算结果传递给与所述第一主芯片连接的从芯片。Transmitting all operation results in the first master chip to a slave chip connected to the first master chip.
  16. 根据权利要求14或15所述的方法,其特征在于,所述主芯片通过树型结构与所述从芯片连接,所述树型结构为n叉树结构,所述主芯片为所述n叉树结构的根节点,所述从芯片为所述n叉树结构的子节点,所述子节点可以是一级子节点,也可以是多级子节点。The method according to claim 14 or 15, wherein the master chip is connected to the slave chip through a tree structure, the tree structure is an n-ary tree structure, and the master chip is the n-ary The root node of the tree structure, and the slave chip is a child node of the n-ary tree structure, and the child node may be a first-level child node or a multi-level child node.
  17. 根据权利要求16所述的方法,其特征在于,所述神经网络芯片包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和多个从处理电路;The method according to claim 16, wherein the neural network chip comprises: an arithmetic unit and a controller unit; the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;
    所述控制器单元,用于获取输入数据以及计算指令;The controller unit is configured to obtain input data and calculation instructions;
    所述控制器单元,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;The controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;
    所述主处理电路,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据和运算指令;The master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
    所述多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行 中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;The plurality of slave processing circuits are configured to perform intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
    所述主处理电路,用于对所述多个中间结果执行后续处理得到所述计算指令的运算结果。The main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
  18. 根据权利要求17所述的方法,其特征在于,所述计算装置还包括:存储单元和直接内存访问单元,所述存储单元包括:寄存器、缓存中的任意组合;The method according to claim 17, wherein the computing device further comprises: a storage unit and a direct memory access unit, the storage unit comprising: any combination of a register and a cache;
    所述缓存,用于存储所述输入数据;The cache is used to store the input data;
    所述寄存器,用于存储所述输入数据中标量数据;The register is used to store scalar data in the input data;
    所述缓存包括高速暂存缓存;The cache includes a high-speed temporary cache;
    所述控制器单元包括:指令缓存单元、指令处理单元和存储队列单元;The controller unit includes: an instruction cache unit, an instruction processing unit, and a storage queue unit;
    所述指令缓存单元,用于存储所述人工神经网络运算关联的计算指令;The instruction buffer unit is configured to store calculation instructions associated with the artificial neural network operation;
    所述指令处理单元,用于对所述计算指令解析得到多个运算指令;The instruction processing unit is configured to parse the calculation instruction to obtain a plurality of operation instructions;
    所述存储队列单元,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令;The storage queue unit is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the order of the queue;
    所述控制器单元包括:依赖关系处理单元;The controller unit includes: a dependency relationship processing unit;
    所述依赖关系处理单元,用于确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;The dependency relationship processing unit is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction, such as the first operation instruction and the zeroth operation instruction have an association relationship, Storing the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extracting the first operation instruction from the instruction storage unit and transmitting the first operation instruction to the operation unit;
    所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:The determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:
    依据所述第一运算指令提取所述第一运算指令中所需数据的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需数据的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,确定所述第一运算指令与所述第零运算指令不具有关联关系。Extracting a first storage address interval of data required in the first operation instruction according to the first operation instruction, and extracting a zeroth storage address interval of data required in the zeroth operation instruction according to the zeroth operation instruction, If the first storage address range and the zeroth storage address range have overlapping areas, it is determined that the first operation instruction and the zeroth operation instruction have an associated relationship, such as the first storage address range and the The zeroth storage address interval does not have an overlapping area, and it is determined that the first operation instruction and the zeroth operation instruction have no correlation.
  19. 根据权利要求18所述的方法,其特征在于,当所述神经网络芯片为主芯片时,所述控制器单元还包括调度单元,具体用于:The method according to claim 18, wherein when the neural network chip is a main chip, the controller unit further comprises a scheduling unit, which is specifically configured to:
    对所述主芯片中的运算结果进行调度。Scheduling an operation result in the main chip.
  20. 根据权利要求19所述的装置,其特征在于,所述对所述主芯片中的运算结果进行调度,包括:The apparatus according to claim 19, wherein the scheduling the operation result in the main chip comprises:
    将所述X组神经网络芯片中的主芯片向同一方向连接的主芯片调度1/Y+1的运算内容,其中,所述同一方向包括顺时针方向或逆时针方向,所述Y为与所述X组神经网络芯片中主芯片连接的从芯片的数量。The main chip in the X group of neural network chips connected to the same direction schedules the operation content of 1 / Y + 1, wherein the same direction includes a clockwise direction or a counterclockwise direction, and Y is the same as the The number of slave chips connected to the master chip in the X-group neural network chip is described.
  21. 根据权利要求18-20任一项所述的方法,其特征在于,所述主处理电路,具体用于将多个从处理电路发送的中间结果进行组合排序得到该计算指令的结果;The method according to any one of claims 18-20, wherein the main processing circuit is specifically configured to combine and sort a plurality of intermediate results sent from the processing circuit to obtain a result of the calculation instruction;
    或所述主处理电路,具体用于将多个从处理电路的发送的中间结果进行组合排序以及激活处理后得到该计算指令的结果。Or, the main processing circuit is specifically configured to combine and sort the intermediate results sent by multiple slave processing circuits and obtain the result of the calculation instruction after the activation processing.
  22. 根据权利要求21所述的方法,其特征在于,所述主处理电路包括:转换处理电路、激活处理电路、加法处理电路中的一种或任意组合;The method according to claim 21, wherein the main processing circuit comprises one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
    所述转换处理电路,用于对所述数据执行前序处理,具体为:将主处理电路接收的数据或中间结果执行第一数据结构与第二数据结构之间的互换;或将主处理电路接收的数据或中间结果执行第一数据类型与第二数据类型之间的互换;The conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;
    所述激活处理电路,用于执行所述后续处理,具体为执行主处理电路内数据的激活运算;The activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;
    所述加法处理电路,用于执行所述后续处理,具体为执行加法运算或累加运算。The addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
  23. 根据权利要求22所述的方法,其特征在于,所述从处理电路包括:乘法处理电路;The method according to claim 22, wherein the slave processing circuit comprises: a multiplication processing circuit;
    所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果。The multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
  24. 根据权利要求23所述的方法,其特征在于,所述从处理电路还包括:累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。The method according to claim 23, wherein the slave processing circuit further comprises: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
  25. 一种执行机器学习模型的计算方法,其特征在于,所述计算方法应用于组合计算装置,所述组合计算装置用于执行机器学习计算;所述组合计算装置包括:M个如权利要求1所述的计算装置,所述M个如权利要求1所述的计算装置之间连接,所述M取值范围为大于或等于2的整数。A calculation method for executing a machine learning model, characterized in that the calculation method is applied to a combination calculation device for performing machine learning calculations; the combination calculation device includes: In the computing device, the M computing devices according to claim 1 are connected, and the value range of M is an integer greater than or equal to two.
  26. 如权利要求23所述的方法,其特征在于,所述M个如权利要求1所述的计算装置之间连接,包括:The method according to claim 23, wherein the connection between the M computing devices according to claim 1 comprises:
    所述M个如权利要求1所述的计算装置中的每一个计算装置,其包含的X组神经网络芯片中的一组神经网络芯片的主芯片用于与其他计算装置中的X组神经网络中的一组神经网络芯片的主芯片连接。Each of the M computing devices according to claim 1, wherein a main chip of a group of neural network chips in the X group of neural network chips included is used to communicate with the X group of neural networks in other computing devices. A set of neural network chips in a master chip connection.
PCT/CN2019/108842 2018-09-29 2019-09-29 Computing apparatus and related product WO2020063940A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811153022.6 2018-09-29
CN201811153022.6A CN110968532B (en) 2018-09-29 2018-09-29 Data transmission method and related product
CN201811207452.1 2018-10-17
CN201811207452.1A CN111062469B (en) 2018-10-17 2018-10-17 Computing device and related product

Publications (1)

Publication Number Publication Date
WO2020063940A1 true WO2020063940A1 (en) 2020-04-02

Family

ID=69950992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108842 WO2020063940A1 (en) 2018-09-29 2019-09-29 Computing apparatus and related product

Country Status (1)

Country Link
WO (1) WO2020063940A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786736A (en) * 2014-12-18 2016-07-20 深圳市中兴微电子技术有限公司 Method, chip and device for multi-chip cascading
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN108510064A (en) * 2016-04-18 2018-09-07 中国科学院计算技术研究所 The processing system and method for artificial neural network including multiple cores processing module
CN108549934A (en) * 2018-04-25 2018-09-18 福州瑞芯微电子股份有限公司 A kind of operation method and device based on automated cluster neural network chip group

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786736A (en) * 2014-12-18 2016-07-20 深圳市中兴微电子技术有限公司 Method, chip and device for multi-chip cascading
CN108510064A (en) * 2016-04-18 2018-09-07 中国科学院计算技术研究所 The processing system and method for artificial neural network including multiple cores processing module
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN108549934A (en) * 2018-04-25 2018-09-18 福州瑞芯微电子股份有限公司 A kind of operation method and device based on automated cluster neural network chip group

Similar Documents

Publication Publication Date Title
WO2020078470A1 (en) Network-on-chip data processing method and device
CN112799726B (en) Data processing device, method and related product
CN110750351B (en) Multi-core task scheduler, multi-core task scheduling method, multi-core task scheduling device and related products
CN110968532B (en) Data transmission method and related product
KR102539571B1 (en) Network-on-chip data processing method and device
CN110059797B (en) Computing device and related product
WO2020063940A1 (en) Computing apparatus and related product
CN111209230B (en) Data processing device, method and related product
KR102539573B1 (en) Network-on-chip data processing method and device
KR102539572B1 (en) Network-on-chip data processing method and device
CN111078625B (en) Network-on-chip processing system and network-on-chip data processing method
CN111078624B (en) Network-on-chip processing system and network-on-chip data processing method
CN111078623B (en) Network-on-chip processing system and network-on-chip data processing method
CN111260070B (en) Operation method, device and related product
KR20200139256A (en) Network-on-chip data processing method and device
CN111209245B (en) Data processing device, method and related product
CN111062469B (en) Computing device and related product
CN111210011B (en) Data processing device and related product
CN112394990A (en) Floating point to half precision floating point instruction processing device and method and related products
CN112394993A (en) Half-precision floating point to short shaping instruction processing device and method and related product
CN112394986A (en) Device and method for processing half-precision floating point to floating point instruction and related products
CN112394903A (en) Short shaping to half precision floating point instruction processing device, method and related product
CN117908959A (en) Method for performing atomic operations and related products
CN112394987A (en) Short shaping to half precision floating point instruction processing device, method and related product
CN111047027A (en) Operation method, device and related product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19865073

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 250621)

122 Ep: pct application non-entry in european phase

Ref document number: 19865073

Country of ref document: EP

Kind code of ref document: A1