WO2020063940A1 - Appareil informatique et produit associé - Google Patents

Appareil informatique et produit associé Download PDF

Info

Publication number
WO2020063940A1
WO2020063940A1 PCT/CN2019/108842 CN2019108842W WO2020063940A1 WO 2020063940 A1 WO2020063940 A1 WO 2020063940A1 CN 2019108842 W CN2019108842 W CN 2019108842W WO 2020063940 A1 WO2020063940 A1 WO 2020063940A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
processing circuit
neural network
data
instruction
Prior art date
Application number
PCT/CN2019/108842
Other languages
English (en)
Chinese (zh)
Inventor
杜子东
周诗怡
刘少礼
王秉睿
张尧
周徐达
兰慧盈
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811153022.6A external-priority patent/CN110968532B/zh
Priority claimed from CN201811207452.1A external-priority patent/CN111062469B/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2020063940A1 publication Critical patent/WO2020063940A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of information processing technology, and in particular, to a computing device and related products.
  • a neural network is a computing model that consists of a large number of nodes (or neurons) connected to each other.
  • Existing neural network operations are based on CPU (Central Processing Unit) or GPU (English: Graphics Processing Unit) to implement neural network operations.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • Existing training equipment has slow training speed and takes a long time .
  • the embodiments of the present application provide a computing device and related products, which can improve the training speed and efficiency of the training device.
  • a computing device includes:
  • each group of the neural network chip in the group X includes a master chip and at least one slave chip, the master chip is connected to the slave chip, and the group X neural network chip
  • the connection between the main chips in the X, the value of X is an integer greater than or equal to 2;
  • Each neural network chip in the X group of neural network chips is configured to obtain input data and weights, and perform calculations on the weights and input data corresponding to each of the neural network chips to obtain an operation result. It is said that the input data obtained by each neural network chip is different, and the obtained weights are the same;
  • the first master chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips.
  • the calculation result shared by the main chip is configured to share the operation result of the first master chip and the received operation result of the slave chip to the master chips in other groups of neural network chips, and receive the master chip in other groups of neural network chips.
  • a neural network chip includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;
  • the controller unit is configured to obtain input data and calculation instructions
  • the controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;
  • the master processing circuit is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
  • the multiple slave processing circuits are configured to perform multiple intermediate operations in parallel according to data transmitted from the master processing circuit and operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to the master processing circuit;
  • the main processing circuit is configured to perform subsequent processing on the multiple intermediate results to obtain an operation result of the calculation instruction.
  • a combined computing device includes: M computing devices according to claim 1, the M computing devices according to claim 1 being connected, the M The value is an integer greater than or equal to 2.
  • a calculation method for executing a machine learning model is provided, and the calculation method is applied to the calculation device according to the first aspect.
  • a calculation method for executing a machine learning model is provided, and the calculation method is applied to the combination calculation device according to the third aspect.
  • an embodiment of the present application provides a computing device, where the computing device includes multiple computing carriers, an on-chip storage data path control circuit connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers, And an on-chip storage data path connected to the on-chip storage data path control circuit, wherein:
  • the on-chip storage data path control circuit is configured to receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier of the plurality of computing carriers; and decode the data transmission instruction to obtain a transmission Data address and receiving data address;
  • the on-chip cache circuit data path is configured to obtain target data according to the sending data address and transmit the target data to the receiving data address, where the receiving data address is the second of the plurality of computing carriers. Calculate an address in the carrier's second on-chip cache circuit.
  • an embodiment of the present application provides a combined processing device, where the combined processing device includes the computing device described in the first aspect, a universal interconnection interface, and other processing devices;
  • the computing device interacts with the other processing devices to jointly complete a computing operation designated by the user.
  • an embodiment of the present application provides a system-on-chip including the computing device according to the first aspect or the combined processing device according to the second aspect.
  • an embodiment of the present application provides a data transmission method, which is applied to a computing device according to the first aspect, and the method includes:
  • an embodiment of the present application provides another computing device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are stored in the memory. Configuration is performed by the processor, and the program includes instructions for some or all of the steps as described in the fourth aspect.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer storage medium stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the computer program to
  • the processor executes the method of the fourth aspect described above.
  • FIG. 1-1a is a schematic diagram of a neural network training device according to an embodiment of the present application.
  • FIG. 1-1b is a schematic diagram of a chip connection structure of a computing device according to an embodiment of the present application.
  • FIG. 1-1c is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
  • FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application.
  • FIG. 1-1e is a schematic structural diagram of a neural network chip according to an embodiment of the present application.
  • FIG. 1-1f is a schematic diagram of a scheduling strategy for a computing result of a main chip according to an embodiment of the present application.
  • FIG. 1-1g is a schematic structural diagram of a combined computing device according to an embodiment of the present application.
  • FIG. 1-2 is a schematic diagram of a combination processing device provided by an embodiment of the present application.
  • FIG. 1-3 is a structural diagram of another combination processing device provided by an embodiment of the present application.
  • FIG. 1-3a is a schematic structural diagram of a board card according to an embodiment of the present application.
  • FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • FIG. 2-1a is a schematic structural diagram of a computing unit according to an embodiment of the present application.
  • FIG. 2-1b is a schematic structural diagram of a main processing circuit according to an embodiment of the present application.
  • FIG. 2-1c is a schematic diagram of data distribution of a computing unit according to an embodiment of the present application.
  • FIG. 2-1d is a schematic diagram of data return of a computing unit according to an embodiment of the present application.
  • FIG. 2-1e is a schematic structural diagram of an on-chip storage data path control circuit according to an embodiment of the present application.
  • FIG. 2-1f is a schematic structural diagram of a memory management unit according to an embodiment of the present application.
  • FIG. 2-3 is a schematic structural diagram of a combination processing device according to an embodiment of the present application.
  • FIG. 2-4 is a schematic structural diagram of a board card according to an embodiment of the present application.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • the neural network training device consists of multiple neural network chips, multiple neural network chips perform multiple tasks, or divide a single task into segments based on depth. It learns the characteristics of algorithm to schedule and cooperate to complete the training task.
  • the arrangement and cooperation of multiple neural network chips in the neural network training device are specifically described in the following embodiments.
  • a training device includes: a group X neural network chip, each group of the neural network chip in the group X includes a master chip and at least one slave chip, and the master chip is connected to the slave chip, The connection between the main chips in the X group of neural network chips.
  • the value of X is an integer greater than or equal to 2.
  • Each neural network chip in the X group of neural network chips is used to obtain input data and weights, and the weights are calculated with the input data corresponding to each neural network chip to obtain the operation result.
  • the input data is different and the obtained weights are the same;
  • the first master chip in the first group of neural network chips in the X group of neural network chips is used to receive the operation result of the slave chip connected to the first master chip;
  • the first master chip It is used to share the operation result of the first master chip and the received operation result of the slave chip with the master chips in other groups of neural network chips, and receive the operation results shared by the master chips in other groups of neural network chips.
  • X can be any integer greater than or equal to 2 such as 2, 3, 5, 8 and so on.
  • each group of neural network chips includes a master chip and at least one slave chip, wherein different groups of neural networks
  • the number of slave chips in the chip can be the same or different.
  • the master chip in the first two sets of neural network chips can be connected with 3 slave chips, and the last set of neural network chips
  • the master chip is connected to 4 slave chips.
  • the slave chips are equally divided and connected to the master chip, so that the master chip receives the operation results of the slave chips and quickly schedules the operation results between the master chips.
  • FIG. 1-1b is a chip connection structure of a computing device according to an embodiment of the present application.
  • X is 4, among which chip 4, chip 8, chip 13, and chip. 10 is the master chip, and 3 slave chips are connected to each master chip.
  • Chips 1 to 16 all get input data and weights, where each chip gets different input data and the weights are the same, so each chip will use the same training model to train different input data .
  • the input data of each chip can be for data corresponding to multiple tasks, or for data sets segmented for the same task.
  • the segmentation of the data set can be completed in other external devices, or in other modules in the computing device. It can be completed in the main chip of a certain group of neural network chips in the computing device.
  • the first master chip is used to receive the operation results of the slave chips connected to the first master chip.
  • the first master chip may be the master chip 4, the master chip 8, the master chip 10, and the master chip. Any one of the master chips in chip 13 respectively obtains the operation results of the slave chips connected to itself, and finally all operation results included in the master chip are its own operation results and the operation results of the slave chips connected to it.
  • the operation results included in the slave chip are shared among the X group of master chips.
  • the operation results are transmitted cyclically in the same direction, for example, in a clockwise direction. That is: chip 4 ⁇ chip 8 ⁇ chip 13 ⁇ chip 10 ⁇ chip 4 or pass in a counterclockwise direction, that is: chip 4 ⁇ chip 10 ⁇ chip 13 ⁇ chip 8 ⁇ chip 4.
  • all the operation results included in the main chip can be transferred to the next adjacent main chip at one time, or it can be transferred in multiple steps.
  • connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.
  • the first master chip is further configured to: transmit all operation results in the first master chip to a slave chip connected to the first master chip.
  • the main chip 4 After the main chip 4, the main chip 8, the main chip 10, and the main chip 13 have shared the transfer, they obtain the calculation results of all the chips, and then each main chip passes the calculation results it contains to the respective connected slave chips, so that each Each slave chip contains the operation results of all chips.
  • the master chip is connected to the slave chip through a tree structure, the tree structure is an n-tree structure, the master chip is a root node of the n-tree structure, and the slave chip is a child node of the n-tree structure.
  • the child node may be One-level child nodes can also be multi-level child nodes.
  • the master chip in the group X neural network chip can be connected to the slave chip through a tree structure, where the master chip is the root node of the tree structure, the slave chip is a child node, and the child node can be a first-level child node or Are multi-level child nodes.
  • the master chip obtains the operation results of the slave chips, the operation results of each slave chip can be directly obtained, or the operation results of other slave chips can be obtained by the slave chip directly connected to the master chip, and then passed to the master chip.
  • this connection structure can improve data training efficiency through multiple chips on the one hand, and on the other hand, can schedule the calculation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve. From the performance of the chip, the cost is saved.
  • the slave chip is connected to the master chip through a tree structure, and the operation results of the slave chip can be integrated before the operation result is sent to the master chip, which reduces the operation pressure of the master chip and further reduces the loss to the master chip.
  • FIG. 1-1c is another chip connection structure of a computing device provided by an embodiment of the present application.
  • X is 4, and among the 4 groups of neural network chips,
  • the master chip is the master chip 31, the master chip 32, the master chip 33, and the master chip 34.
  • Each master chip is connected to the slave chip through a tree structure.
  • the master chip 31 is the root node, and the slave chips connected to it include the chip 311.
  • the chip 312 and the chip 313 are first-level child nodes, and the slave chips connected to the slave chip 311 include a chip 3111, a chip 3112, and a chip 3113, which are second-level child nodes.
  • the other slave chips are also primary child nodes or secondary child nodes.
  • FIG. 1-1d is a schematic diagram of a chip connection structure of another computing device according to an embodiment of the present application. As shown in FIG. It is connected to the slave chip, and the tree structure includes three levels of sub-nodes, and the operation results of the leaf nodes at the lowest level can be directly transferred to the master chip, or can be transferred to the master chip through the integration of the slave chip of the upper-level sub-node.
  • the neural network computing device involved in the embodiment of the present application includes a neural network chip.
  • FIG. 1-1e is a schematic structural diagram of a neural network chip provided by an embodiment of the present application, as shown in FIG. 1-1e.
  • the neural network chip includes: an arithmetic unit 12 and a controller unit 11; the arithmetic unit 12 includes: a master processing circuit 101 and a plurality of slave processing circuits 102;
  • the controller unit 11 is configured to obtain input data and calculation instructions.
  • the method of obtaining input data and calculation instructions may be obtained through a data input and output unit.
  • the data input and output unit may be one or Multiple data I / O interfaces or I / O pins.
  • the above calculation instructions include, but are not limited to, forward operation instructions or backward training instructions, or other neural network operation instructions, such as convolution operation instructions.
  • the specific implementation manner of this application does not limit the specific expressions of the above calculation instructions.
  • the controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and input data to a main processing circuit;
  • the main processing circuit 101 is configured to perform preprocessing on the input data and perform multiple operations
  • Data and operation instructions are transmitted between two slave processing circuits;
  • multiple slave processing circuits 102 are used to perform intermediate operations in parallel according to the data transmitted from the main processing circuit and the operation instructions to obtain multiple intermediate results, and transmit the multiple intermediate results to A main processing circuit;
  • a main processing circuit 101 configured to perform subsequent processing on a plurality of intermediate results to obtain an operation result of a calculation instruction.
  • the technical solution provided in this application sets the operation unit into a master-slave structure.
  • the operation unit can split the data according to the forward operation calculation instructions, so that multiple slave processing circuits can The part with a large amount of calculation is performed in parallel, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
  • the aforementioned neural network chip is specifically used for an artificial neural network operation
  • the aforementioned input data may specifically include input neuron data and weight data.
  • the above operation result may be specifically: the result of the operation of the artificial neural network is the output neuron data.
  • the operation in the neural network can be a layer of the neural network.
  • the implementation process is that in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instructions in the next layer are completed.
  • the output neuron calculated by the arithmetic unit will be used as the input neuron of the next layer (or perform some operations on the output neuron and then be used as the input neuron of the next layer), and the weight will also be replaced.
  • the operation instructions of the next layer will use the input neuron gradient calculated in the operation unit as the next layer
  • the output neuron gradient is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weight is replaced with the weight of the next layer.
  • the input neuron and output neuron of the multi-layer operation do not refer to the neurons in the input layer and the output layer of the entire neural network, but to the For any two adjacent layers in the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron.
  • the aforementioned neural network chip may further include a storage unit 10 and a direct memory access unit 50.
  • the storage unit 10 may include one or any combination of a register 201 and a cache 202. Specifically, the cache is used for storing The calculation instruction; the register is used to store the input data and a scalar; and the cache is a high-speed temporary cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
  • An instruction storage unit 110 configured to store calculation instructions associated with the artificial neural network operation
  • the instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions
  • the storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes a plurality of operation instructions or calculation instructions to be executed according to a sequence of the queue.
  • the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions.
  • the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions.
  • the above micro-instruction may be an instruction next to the instruction.
  • the micro-instruction may be obtained by splitting or decoding the instruction, and may be further decoded into a control signal of each component, each unit, or each processing circuit.
  • controller unit 11 may further include:
  • the dependency relationship processing unit 112 is configured to determine whether there is an association relationship between a first operation instruction and a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is related, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit. Transmitted to the arithmetic unit;
  • the determining whether there is an association between the first operation instruction and a zeroth operation instruction before the first operation instruction includes:
  • the controller unit 11 when the neural network chip is the main chip, the controller unit 11 further includes a scheduling unit 114 for scheduling the operation results in the main chip.
  • the main chip in each group of neural network chips needs to schedule operation results, so that all the main chips share all the operation results included in each main chip.
  • scheduling you need to follow a certain scheduling strategy.
  • the operation results of the master neural network chip in the X group of neural network chips can be integrated, including the operation results of the master neural network chip itself and the received operation results of the slave chip to obtain X integrated operation results, and then X
  • Each integration operation result is scheduled in the same direction according to the connection order of the main chip.
  • Each integration operation result is dispatched once, and after X 2 dispatches, all the main chips obtain X integration operation results; or after obtaining X integration
  • the X integrated operation results are scheduled in the same direction according to the connection order of the main chip.
  • next main chip After the next main chip receives the operation results transmitted by the previous main chip, it will compare the received operation results with its own operation results. Integrate to form a new calculation result, and then pass it to the next main chip. After 2 * (X-1) scheduling, all the main chips have obtained X integrated operation results; the X main chips can also be The operation results are partially integrated or not integrated, and then multiple partial scheduling is performed between the main chips.
  • scheduling the operation results in the main chip includes: scheduling the main chip in the X group of neural network chips connected to the same direction to schedule the operation content of 1 / Y + 1, where: The same direction includes a clockwise direction or a counterclockwise direction, and Y is the number of slave chips connected to the master chip in the X-group neural network chip.
  • FIG. 1-1f is an operation result scheduling strategy between main chips provided in the embodiment of the present application.
  • FIG. 1-1f corresponding to FIG. 1-1b, there are 4 groups of nerves.
  • Network chip the main chip of which is chip 4, chip 8, chip 13 and chip 10
  • the operation result in main chip 4 includes its own operation result, and the operation results of chip 1, chip 2 and chip 3 received, Correspond to these four calculation results as a1, b1, c1, d1.
  • the operation result of chip 8 corresponds to four parts a2, b2, c2, and d2.
  • the operation result of chip 13 corresponds to a3, b3. , c3, d3.
  • the operation result of chip 10 corresponds to four parts: a4, b4, c4, and d4.
  • Scheduling is clockwise.
  • chip 4 dispatches part a1 to chip 8
  • chip 8 dispatches part b2 to chip 13
  • chip 13 dispatches part c3 to chip 10
  • chip 13 dispatches part d4 to chip 4.
  • This schedule The process can be performed at the same time or at different times.
  • This scheduling method can save the integration time of each chip and improve scheduling efficiency.
  • the main processing circuit 101 is specifically configured to combine and sort multiple intermediate results sent from the processing circuit 102 to obtain a result of the calculation instruction;
  • the main processing circuit 101 is specifically configured to combine and sort the intermediate results sent by the multiple slave processing circuits 102 and obtain the result of the calculation instruction after the activation processing.
  • the main processing circuit includes one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
  • the conversion processing circuit is configured to perform pre-processing on the data. Specifically, the conversion processing circuit is configured to perform an interchange between a first data structure and a second data structure on data received by the main processing circuit or an intermediate result; or The data received by the circuit or the intermediate result performs an interchange between the first data type and the second data type;
  • the activation processing circuit is configured to perform the subsequent processing, and is specifically to perform an activation operation of data in the main processing circuit;
  • the addition processing circuit is configured to perform the subsequent processing, and is specifically to perform an addition operation or an accumulation operation.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a multiplication operation on a received data block to obtain a multiplication result.
  • the slave processing circuit further includes: an accumulation processing circuit configured to perform an accumulation operation on the product result to obtain the intermediate result.
  • the embodiment of the present application also relates to another combined computing device, where the combined computing device includes: M computing devices according to the first embodiment, and the M computing devices according to the first embodiment are connected to each other.
  • the value range of M is an integer greater than or equal to two.
  • FIG. 1-1g is a schematic structural diagram of a combination computing device provided by an embodiment of the present application.
  • the combination computing device includes four calculations as shown in FIG. 1-1b.
  • the four computing devices are connected to each other, and can be bridged through circuits, can be connected by setting a special connection module, and can also be connected by the main chip in the four computing devices.
  • This connection structure can improve data training efficiency through multiple chips' cooperative operation on the one hand, and can schedule the operation results of each slave chip through the master chip, so that only the performance of the master chip needs to be improved without the need to improve the slave chip. Performance and cost savings.
  • selecting a main chip from multiple sets of main chips to connect with an external main chip reduces the loss of the main chip and improves the service life of the main chip.
  • the connections between the M computing devices as in the first embodiment include: each of the M computing devices as in the first embodiment, which includes X groups of neural network chips The main chip of a group of neural network chips is used to connect with the main chip of a group of neural network chips in a group X neural network in another computing device.
  • each of the four computing devices as in the first embodiment includes four sets of neural network chips, and the main chip of one set of neural network chips is used to communicate with four of the other computing devices.
  • the main chips of a group of neural network chips in the group of neural network chips are connected, for example, the main chip 502, the main chip 507, the main chip 512, and the main chip 510 are connected.
  • selecting the master chip in one of the X group of neural network chips it can be randomly selected or selected using a selection strategy, such as selecting the master chip with the most slave chips, or selecting it with other computing devices.
  • the closest physical chip is the closest.
  • multiple groups of neural network chips are divided into a master chip and a slave chip, and then the master chip obtains the operation results of the slave chips, and schedules the calculation results between different sets of master chips, so that each The master chip of the group contains all the calculation results, and the master chip distributes all the calculation results to the slave chips, which improves the training speed of the neural network chip and saves training time.
  • FIG. 1-2 is a schematic diagram of a combined processing device.
  • processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the computing device and external data and control, including data handling, to complete basic control of the computing device such as start and stop; other processing devices can also cooperate with the computing device to complete computing tasks.
  • a universal interconnection interface for transmitting data and control instructions between the computing device and other processing devices.
  • the computing device obtains required input data from other processing devices and writes it to a storage device on the computing device chip; it can obtain control instructions from other processing devices and write it to the control cache on the computing device chip; it can also read the computing device's
  • the data in the module is stored and transmitted to other processing devices.
  • the structure is shown in FIG. 1-3, and may further include a storage device, and the storage device is connected to the computing device and the other processing devices, respectively.
  • the storage device is used to store data in the computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above computing device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the chip package structure described above. Referring to FIG. 1-3a, FIG. 1-3a provides a board card. In addition to the above chip 389, the board card may also include other supporting components, which include, but are not limited to, a storage device 390 and an interface device 391 And control device 392;
  • the memory device 390 is connected to a chip in the chip package structure through a bus, and is used to store data.
  • the memory device may include a plurality of sets of memory cells 393. Each group of the storage units is connected to the chip through a bus. It can be understood that the memory cells in each group may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
  • each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
  • the interface device is electrically connected to a chip in the chip package structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function.
  • the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
  • the control device is electrically connected to the chip.
  • the control device is configured to monitor a state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit).
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
  • neural networks are the basis of many current artificial intelligence applications. With the further expansion of the application scope of neural networks, many neural network models and large batches of requests have appeared.
  • the calculation of the neural network can be performed in parallel using a heterogeneous computing carrier. Therefore, how to improve the data transmission efficiency between heterogeneous computing devices is a technical problem to be solved by those skilled in the art.
  • the computing device may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, and various forms of user equipment (user equipment, UE ), A mobile station (MS), a terminal device (terminal), etc., the computing device may also include a system-on-chip (SOC).
  • UE user equipment
  • MS mobile station
  • terminal terminal
  • SOC system-on-chip
  • the computing carrier may be a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field programmable logic gate array (ASIC) Field-Programmable Gate Array (FPGA), Coarse-Grained Re-configurable Array (CGRA), Digital Signal Processing (DSP), etc.
  • CPU central processing unit
  • GPU graphics processing unit
  • ASIC application specific integrated circuit
  • ASIC field programmable logic gate array
  • FPGA Field-Programmable Gate Array
  • CGRA Coarse-Grained Re-configurable Array
  • DSP Digital Signal Processing
  • the embodiments of the present application provide a data transmission method and related products, which can improve the data transmission efficiency between different computing carriers and facilitate the improvement of the neural network operation efficiency.
  • the present application will be further described in detail below with reference to specific embodiments and with reference to the drawings.
  • FIG. 2-1 is a schematic structural diagram of a computing device according to an embodiment of the present application.
  • the computing device 100 includes a plurality of computing carriers such as a first computing carrier 101, a second computing carrier 102, and an N-th computing carrier 103.
  • N is a positive integer greater than 2
  • the multiple computing carriers may include at least two of the above-mentioned CPUs, GPUs, ASICs, FPGAs, CGRAs, or DSPs, and may also include the above-mentioned two same-type carriers, for example, 2 CPUs, 2 GPUs, 1 ASIC, or 1 FPGA.
  • each computing carrier may include at least one computing unit for a neural network operation, such as a processing chip and the like.
  • the specific structure of the computing unit is not limited.
  • FIG. 2-1a is a schematic structural diagram of a computing unit.
  • the calculation unit includes: a main processing circuit, a basic processing circuit, and a branch processing circuit. Specifically, the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to at least one basic processing circuit.
  • the branch processing circuit is used to send and receive data from the main processing circuit or the basic processing circuit.
  • FIG. 2-1b is a schematic structural diagram of a main processing circuit.
  • the main processing circuit may include a register and / or an on-chip buffer circuit.
  • the main processing circuit may further include: a control Circuit, vector operator circuit, ALU (arithmetic and logic unit) circuit, accumulator circuit, DMA (Direct Memory Access) circuit and other circuits, of course, in actual applications, the above main processing circuit also It may include a conversion circuit (such as a matrix transposition circuit), a data rearrangement circuit, or an activation circuit and the like.
  • the main processing circuit also includes a data sending circuit, a data receiving circuit, or an interface.
  • the data sending circuit can integrate a data distribution circuit and a data broadcasting circuit.
  • the data distribution circuit and the data broadcasting circuit can also be set separately; in actual applications
  • the above-mentioned data transmitting circuit and data receiving circuit may also be integrated together to form a data transmitting and receiving circuit.
  • broadcast data that is, data that needs to be sent to each basic processing circuit.
  • the specific selection method can be specifically determined by the main processing circuit according to the load and the calculation method.
  • the broadcast transmission method is to broadcast data to each basic processing circuit in a broadcast form.
  • broadcast data can be sent to each basic processing circuit by one broadcast, and broadcast data can be sent to each basic processing circuit by multiple broadcasts.
  • the specific implementation of this application is not limited. The number of broadcasts mentioned above), and the distribution and transmission method is to selectively send the distribution data to some basic processing circuits.
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits (the data may be the same or different. Specifically, if the data is sent in a distributed manner, each basic processing circuit that receives the data receives The data received can be different, of course, the data received by some basic processing circuits can also be the same;
  • the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and each basic processing circuit that receives the data can receive the same data, that is, the broadcast data can include all the basic processing circuits that need to receive The data.
  • Distributing the data may include: part of the data that the basic processing circuit needs to receive.
  • the main processing circuit may send the broadcast data to all the branch processing circuits through one or more broadcasts, and the branch processing circuits forward the broadcast data to all the basic processing circuits.
  • the vector operator circuit of the main processing circuit described above can perform vector operations, including but not limited to: addition, subtraction, multiplication, and division of two vectors, addition and subtraction of vectors and constants, or operations on each element of a vector Perform arbitrary operations.
  • the continuous operation may specifically be addition and subtraction of vectors and constants, multiplication, division operations, activation operations, accumulation operations, and the like.
  • Each basic processing circuit may include a basic register and / or a basic on-chip cache circuit; each basic processing circuit may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
  • the inner product operator circuit, the vector operator circuit, and the accumulator circuit may all be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may also be separately provided circuits.
  • connection structure of the branch processing circuit and the basic circuit may be arbitrary, and is not limited to the H-shaped structure in FIG. 2-1b.
  • the main processing circuit to the basic circuit is a broadcast or distribution structure
  • the basic circuit to the main processing circuit is a gather structure. Broadcasting, distribution and collection are defined as follows:
  • the data transmission mode from the main processing circuit to the basic circuit may include:
  • the main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is respectively connected to a plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit, and the branch processing circuit is further connected to a branch processing circuit, and so on, a plurality of branch processing circuits are connected in series, and then each branch processing circuit is respectively connected to a plurality of basic circuits.
  • the main processing circuit is respectively connected to a plurality of branch processing circuits, and each branch processing circuit is further connected in series with a plurality of basic circuits.
  • the main processing circuit is connected to a branch processing circuit.
  • the branch processing circuit is further connected to a branch processing circuit, and so on, and a plurality of branch processing circuits are connected in series. Then, each branch processing circuit is connected in series to a plurality of basic circuits.
  • the main processing circuit When distributing data, the main processing circuit transmits data to some or all of the basic circuits, and the data received by each basic circuit that receives the data may be different;
  • the main processing circuit When broadcasting data, the main processing circuit transmits data to some or all of the basic circuits, and each basic circuit that receives the data receives the same data.
  • the computing unit shown in Figure 2-1a may be a separate physical chip. Of course, in practical applications, the computing unit may also be integrated in other chips (such as CPU, GPU). This application is specifically implemented. The manner does not limit the physical expression of the chip device.
  • Figure 2-1c is a schematic diagram of data distribution of a computing unit, as shown by the arrow in Figure 2-1c, and this arrow is the data distribution direction.
  • the main processing circuit receives After the external data, the external data is split and distributed to multiple branch processing circuits, and the branch processing circuit sends the split data to the basic processing circuit.
  • Figure 2-1d is a schematic diagram of data return of a computing unit, as shown by the arrow in Figure 2-1d, the arrow is the direction of data return, as shown in Figure 2-1d, basic processing
  • the circuit returns the data (such as the result of the inner product operation) to the branch processing circuit, and the branch processing circuit is returning to the main processing circuit.
  • the specific data may be vector, matrix, multi-dimensional (three-dimensional or four-dimensional or more) data, and for a specific value of the input data, it may be called an element of the input data.
  • the embodiment of the present disclosure also provides a calculation method of a calculation unit shown in FIG. 2-1a.
  • the calculation method is applied to the calculation of a neural network.
  • the calculation unit may be used for a layer or a multi-layer neural network. Multiple layers of input data and weight data perform operations.
  • the calculation unit is configured to perform an operation on one or more input data and weight data of the trained multi-layer neural network
  • the calculation unit is configured to perform an operation on one or more layers of input data and weight data in a multi-layer neural network in a forward operation.
  • the above operations include, but are not limited to, one or any combination of convolution operations, matrix multiplication matrix operations, matrix multiplication vector operations, offset operations, fully connected operations, GEMM operations, GEMV operations, and activation operations.
  • GEMM calculation refers to the matrix-matrix multiplication operation in the BLAS library.
  • GEMV calculation refers to the matrix-vector multiplication operation in the BLAS library.
  • This application does not limit the connection relationship between computing carriers in a computing device, and may be a homogeneous or heterogeneous computing carrier. It also does not limit the connection relationship between computing units in a computing carrier.
  • the computing unit executes parallel tasks, which can improve computing efficiency.
  • each computing carrier further includes at least one on-chip cache circuit and one off-chip cache circuit.
  • the first computing carrier 101 includes a first on-chip cache circuit 1011 and a first off-chip cache circuit 1012.
  • the second computing carrier 102 includes a second on-chip cache circuit 1021 and a second off-chip cache circuit 1022.
  • the N-th computing carrier 103 includes an N-th on-chip cache circuit 1031 and an N-th off-chip cache circuit 1032.
  • the on-chip cache circuit may include on-chip memory, including, but not limited to, double data rate memory (DDRM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and three times dynamic random access memory. Take three memory (ThreeDataDRAM), three times static random access memory (ThreeDataStaticRandom-AccessMemory, 3DSRAM) and other forms; the off-chip cache circuit can be off-chip memory (Off-chipMemory), including but not limited to SharedMemory, Cache and so on.
  • the cache may include a multilayer structure, such as an N-layer cache structure, including L1Cache, L2Cache, ..., LNCache.
  • the computing device 100 further includes an on-chip storage data path control circuit 110 connected to each on-chip cache circuit, and an on-chip storage data path 121 connected to the on-chip storage data path control circuit 110, wherein: on-chip The storage data path control circuit 110 is configured to receive a data transmission instruction sent by the first on-chip cache circuit 1011 of the first computing carrier 101 among the plurality of computing carriers; and decode the data transmission instruction to obtain the transmitted data. An address and a received data address; the on-chip buffer circuit data path 121 is configured to obtain target data according to the sent data address, and transmit the target data to the received data address.
  • the first computing carrier 101 is any one of a plurality of computing carriers, and the data transmission instruction is a binary file.
  • a data transmission instruction is decoded to obtain a sending data address and a receiving data address, and parameters such as a data capacity and a data identifier for determining target data can also be obtained.
  • the sending data address is an address where the target data is stored in the first on-chip cache circuit
  • the receiving data address is an address in the second on-chip cache circuit 1021 of the second computing carrier 102 of the plurality of computing carriers, that is, the
  • the data transfer instruction instructs the on-chip storage data path control unit 110 to transfer the target data buffered in the first on-chip cache circuit 1011 to the second on-chip cache circuit 1021, that is, it is determined that the computing carrier that the first computing carrier 101 performs data transmission in advance is the second Computing carrier 1021.
  • the on-chip storage data path control circuit 110 when the on-chip storage data path control circuit 110 receives a data transmission instruction sent by the first on-chip cache circuit 1011, it decodes the data transmission instruction to obtain a sending data address and a receiving data address.
  • the circuit data path 121 obtains the target data corresponding to the sending data address and transmits the target data to the receiving data address, and the second on-chip cache circuit 1021 caches the target data, thereby completing the on-chip cache circuit between the two computing carriers. data transmission.
  • the on-chip storage data path control circuit 110 multiple data transmission instructions can be received at the same time. Therefore, it is necessary to determine the transmission order between the data transmission instructions. This application does not limit how to determine the execution order.
  • the priorities corresponding to the data transmission instructions can be obtained to obtain multiple priorities. Each data in the multiple data transmission instructions is determined according to the multiple priorities.
  • the execution order of the transfer instruction is not limited to the priority corresponding to the data transmission instructions.
  • the priority can be obtained through multiple dimensions such as the quantity and capacity of the target data, the priority of the target data, or the priority of the first on-chip cache circuit and the remaining memory size.
  • the on-chip storage data path control circuit 110 determines the execution order between the data transmission instructions, and controls the on-chip cache circuit data path 121 to perform data transmission according to the execution order, which can improve the stability of the transmission.
  • the on-chip storage data path control circuit 110 includes an instruction cache unit 1101, an instruction decoding unit 1102 connected to the instruction cache unit 1101, and A memory management unit 1103 connected to the instruction cache unit 1101 and the instruction decoding unit 1102, where:
  • the instruction buffer unit 1101 is configured to buffer the data transmission instruction
  • the instruction decoding unit 1102 is configured to decode the data transmission instruction to obtain the sending data address and the receiving data address;
  • the memory management unit 1103 is configured to manage the data transmission instruction.
  • the on-chip storage data path control circuit 110 is further divided to obtain an instruction cache unit 1101, an instruction decoding unit 1102, and a memory management unit 1103, respectively, and execute corresponding steps, that is, the data management instruction is managed by the memory management unit 1103, that is, When the data transmission instruction is executed, it is directly called from the instruction buffer unit 1101, and the data decoding instruction is translated by the instruction decoding unit 1102 to complete the data transmission, so that the execution efficiency and the stability of execution are improved.
  • the memory management unit 1103 includes an address mapping module 11031, a request arbitration module 11032, and a consistency control module 11033, where:
  • the address mapping module 11031 is configured to determine the second on-chip cache circuit corresponding to the received data address
  • the request arbitration module 11032 is configured to allocate an execution order of each data transmission instruction in the plurality of data transmission instructions if the instruction cache unit includes a plurality of the data transmission instructions;
  • the consistency control module 11033 is configured to ensure consistency of data transmission.
  • the memory management unit 1103 is further divided to obtain the address mapping module 11031, the request arbitration module 11032, and the consistency control module 11033, respectively, and corresponding steps are performed, that is, the address mapping module 11031 is used to determine the target data to be cached.
  • the request arbitration module 11032 determines the execution order of each data transmission instruction, and controls the on-chip cache circuit data path 121 for data transmission according to the transmission order, which can improve the stability of the transmission.
  • the consistency control module 11033 ensures the consistency of data transmission, which improves the stability of the transmission and the security of execution.
  • the computing device 100 further includes a fast peripheral device interconnect bus (PCIE) data path 122 connected to each off-chip cache circuit, for implementing the described Data transmission between off-chip cache circuits of any two computing carriers in multiple computing carriers.
  • PCIE peripheral device interconnect bus
  • the off-chip storage data between the various computing carriers can be directly used for data interaction through the PCIE data path 122, that is, the off-chip cached data is exchanged through the dedicated off-chip storage data path 122 to support larger-scale machines Learning operations. It can also be connected to various types of servers through the PCIE interface, which improves transmission efficiency.
  • FIG. 2-2 is a schematic flowchart of a data transmission method proposed by this application.
  • the data transmission method is applied to a computing device shown in FIG. 2-1, that is, the computing device includes multiple computing carriers, and an on-chip storage data path control connected to an on-chip cache circuit of each computing carrier in the multiple computing carriers.
  • S201 Receive a data transmission instruction sent by a first on-chip cache circuit of a first computing carrier in a plurality of computing carriers through an on-chip storage data path control circuit.
  • S202 Decode the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address.
  • S203 Obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address.
  • the received data address is an address in a second on-chip cache circuit of a second computing carrier of the plurality of computing carriers.
  • the on-chip storage data path control circuit receives the data transmission instruction sent by the first on-chip cache circuit of the first computing carrier in the plurality of computing carriers, and then the on-chip storage data path control circuit sends the data transmission instruction.
  • Decode to obtain a sending data address and a receiving data address obtain target data according to the sending data address through an on-chip buffer circuit data path, and transmit the target data to the receiving data address. In this way, the data transmission efficiency between different computing carriers can be improved, and it is convenient to improve the operation efficiency of the neural network.
  • the on-chip storage data path control circuit includes an instruction cache unit, an instruction decoding unit connected to the instruction cache unit, and an instruction cache unit connected to the instruction cache unit and the instruction decoding unit.
  • the memory management unit which decodes the data transmission instruction through the on-chip storage data path control circuit to obtain a sending data address and a receiving data address, includes:
  • Decoding the data transmission instruction by the instruction decoding unit to obtain the sending data address and the receiving data address;
  • the method further includes:
  • the data transmission instruction is managed by the memory management unit.
  • the memory management unit includes an address mapping module, a request arbitration module, and a consistency control module, and the managing the data transmission instruction by the memory management unit includes:
  • the instruction buffer unit includes a plurality of the data transmission instructions, determining an execution order of each data transmission instruction in the plurality of data transmission instructions through the request arbitration module;
  • the consistency control module ensures data transmission consistency.
  • the computing device further includes a fast external device interconnect bus PCIE data path
  • the method further includes:
  • the multiple computing carriers include a central processing unit CPU, an image processor GPU, an application specific integrated circuit ASIC, a field programmable logic gate array FPGA, a coarse-grained reconfigurable array CGRA, or digital signal processing. At least two of the processor DSPs.
  • the calculation carrier includes at least one calculation unit.
  • the calculation unit includes: a main processing circuit, a branch processing circuit, and a basic processing circuit.
  • the main processing circuit is connected to the branch processing circuit.
  • the basic processing circuit is connected to the branch processing circuit, and the method further includes:
  • the sending the broadcast data to all the branch processing circuits in a broadcast manner by using the main processing circuit includes:
  • the operation performed by the basic processing circuit on the broadcast data and the distribution data to obtain an operation result includes:
  • the basic processing circuit performs an inner product operation, a product operation, or a vector operation on the broadcast data and the distribution data to obtain an operation result.
  • a computing device including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured by The processor executes the implementation manner described in the data transmission method.
  • a computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When the program instructions are executed by a processor, the processor executes the instructions. The implementation manner described in the data transmission method.
  • the present application also discloses a combined processing device, which includes the above-mentioned computing device, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with other processing devices to jointly complete the operation specified by the user.
  • Figure 2-3 is a schematic structural diagram of a combined processing device.
  • Other processing devices include one or more types of processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • processors such as a central processing unit CPU, a graphics processor GPU, and a neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of the machine learning computing device such as start and stop; other processing devices can also cooperate with the machine learning computing device to complete the computing tasks.
  • a universal interconnection interface for transmitting data and control instructions between the machine learning computing device and other processing devices.
  • the machine learning computing device obtains required input data from other processing devices and writes it to the storage device on the chip of the machine learning computing device; it can obtain control instructions from other processing devices and write it to the control cache on the machine learning computing device chip;
  • the data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
  • the combined processing device shown in FIG. 2-3 may further include a storage device, and the storage device is connected to the machine learning computing device and the other processing devices, respectively.
  • the storage device is configured to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be completely stored in the internal storage of the machine learning computing device or other processing devices.
  • the combined processing device can be used as an SOC system-on-chip for devices such as mobile phones, robots, drones, and video surveillance equipment, effectively reducing the core area of the control section, increasing processing speed, and reducing overall power consumption.
  • the universal interconnection interface of the combined processing device is connected to some parts of the equipment. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above-mentioned machine learning computing device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the chip package structure described above.
  • FIG. 2-4 provides a board card.
  • the board card may also include other supporting components.
  • the supporting components include, but are not limited to, a storage device, an interface device, and a control device. ;
  • the memory device is connected to a chip in the chip package structure through a bus, and is used to store data.
  • the memory device may include a plurality of groups of memory cells. Each group of the storage units is connected to the chip through a bus. It can be understood that each group of the storage units may be a double-rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDR SDRAM).
  • the storage device may include 4 groups of the storage units. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB / s.
  • each group of the storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip, and is used for controlling data transmission and data storage of each of the storage units.
  • the interface device is electrically connected to a chip in the chip package structure.
  • the interface device is used to implement data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit can implement the transfer function.
  • the operation result of the chip is still transmitted by the interface device to an external device (such as a server).
  • the control device is electrically connected to the chip.
  • the control device is configured to monitor a state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit).
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may drive multiple loads. Therefore, the chip can be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processes, and / or multiple processing circuits in the chip.
  • an electronic device which includes the board card described above.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, camcorders, projectors, watches, headphones , Mobile storage, wearables, transportation, home appliances, and / or medical devices.
  • the vehicles include airplanes, ships, and / or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, cooker hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may Integration into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software program modules.
  • the integrated unit When the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory.
  • the technical solution of the present application essentially or part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a memory.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the foregoing memories include: U disks, Read-Only Memory (ROM), Random Access Memory (RAM), mobile hard disks, magnetic disks, or optical disks and other media that can store program codes.
  • the program may be stored in a computer-readable memory, and the memory may include a flash disk.
  • ROM Read-only memory
  • RAM Random Access Memory
  • magnetic disks or optical disks etc.

Abstract

La présente invention concerne un appareil informatique et un produit associé. L'appareil informatique comprend : X groupes de puces de réseau neuronal, chaque groupe de puces de réseau neuronal des X groupes de puces de réseau neuronal comprenant une puce maîtresse et au moins une puce esclave, la puce maîtresse étant connectée à la puce esclave, les puces maîtresses des X groupes de puces de réseau neuronal étant connectées, et la plage de valeurs de X correspondant à des nombres entiers supérieurs ou égaux à 2. L'appareil informatique divise de multiples groupes de puces de réseau neuronal en les puces maîtresses et les puces esclaves, et met ensuite en œuvre une planification de données dans les puces conformément à une relation de connexion entre les puces maîtresses, de façon à améliorer une vitesse d'apprentissage des puces de réseau neuronal et à raccourcir la durée d'apprentissage.
PCT/CN2019/108842 2018-09-29 2019-09-29 Appareil informatique et produit associé WO2020063940A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811153022.6A CN110968532B (zh) 2018-09-29 2018-09-29 数据传输方法及相关产品
CN201811153022.6 2018-09-29
CN201811207452.1 2018-10-17
CN201811207452.1A CN111062469B (zh) 2018-10-17 2018-10-17 计算装置及相关产品

Publications (1)

Publication Number Publication Date
WO2020063940A1 true WO2020063940A1 (fr) 2020-04-02

Family

ID=69950992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108842 WO2020063940A1 (fr) 2018-09-29 2019-09-29 Appareil informatique et produit associé

Country Status (1)

Country Link
WO (1) WO2020063940A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786736A (zh) * 2014-12-18 2016-07-20 深圳市中兴微电子技术有限公司 一种多芯片级联的方法、芯片和装置
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN108510064A (zh) * 2016-04-18 2018-09-07 中国科学院计算技术研究所 包括多个核心处理模块的人工神经网络的处理系统及方法
CN108549934A (zh) * 2018-04-25 2018-09-18 福州瑞芯微电子股份有限公司 一种基于自动集群神经网络芯片组的运算方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786736A (zh) * 2014-12-18 2016-07-20 深圳市中兴微电子技术有限公司 一种多芯片级联的方法、芯片和装置
CN108510064A (zh) * 2016-04-18 2018-09-07 中国科学院计算技术研究所 包括多个核心处理模块的人工神经网络的处理系统及方法
US20180204118A1 (en) * 2017-01-18 2018-07-19 Hitachi, Ltd. Calculation System and Calculation Method of Neural Network
CN108549934A (zh) * 2018-04-25 2018-09-18 福州瑞芯微电子股份有限公司 一种基于自动集群神经网络芯片组的运算方法和装置

Similar Documents

Publication Publication Date Title
WO2020078470A1 (fr) Procédé et dispositif de traitement de données réseau sur puce
CN112799726B (zh) 数据处理装置、方法及相关产品
CN110750351B (zh) 多核任务调度器、多核任务调度方法、装置及相关产品
CN110968532B (zh) 数据传输方法及相关产品
CN110059797B (zh) 一种计算装置及相关产品
KR102539571B1 (ko) 네트워크 온칩 데이터 처리 방법 및 장치
KR102539572B1 (ko) 네트워크 온칩 데이터 처리 방법 및 장치
WO2020063940A1 (fr) Appareil informatique et produit associé
CN111209230B (zh) 数据处理装置、方法及相关产品
KR102539573B1 (ko) 네트워크 온칩 데이터 처리 방법 및 장치
KR102539574B1 (ko) 네트워크 온칩 데이터 처리 방법 및 장치
CN111078625B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078624B (zh) 片上网络处理系统和片上网络数据处理方法
CN111078623B (zh) 片上网络处理系统和片上网络数据处理方法
CN111209245B (zh) 数据处理装置、方法及相关产品
CN111062469B (zh) 计算装置及相关产品
CN111210011B (zh) 数据处理装置及相关产品
CN112396186B (zh) 执行方法、装置及相关产品
CN112394990A (zh) 浮点转半精度浮点指令处理装置、方法及相关产品
CN112394993A (zh) 半精度浮点转短整形指令处理装置、方法及相关产品
CN112394986A (zh) 半精度浮点转浮点指令处理装置、方法及相关产品
CN112394903A (zh) 短整形转半精度浮点指令处理装置、方法及相关产品
CN117908959A (zh) 用于执行原子操作的方法及其相关产品
CN111047027A (zh) 运算方法、装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19865073

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 250621)

122 Ep: pct application non-entry in european phase

Ref document number: 19865073

Country of ref document: EP

Kind code of ref document: A1