WO2021036404A1 - 数据传输方法及相关设备 - Google Patents

数据传输方法及相关设备 Download PDF

Info

Publication number
WO2021036404A1
WO2021036404A1 PCT/CN2020/095205 CN2020095205W WO2021036404A1 WO 2021036404 A1 WO2021036404 A1 WO 2021036404A1 CN 2020095205 W CN2020095205 W CN 2020095205W WO 2021036404 A1 WO2021036404 A1 WO 2021036404A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
computing node
computing
node
result
Prior art date
Application number
PCT/CN2020/095205
Other languages
English (en)
French (fr)
Inventor
张尧
刘少礼
韩栋
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910819939.3A external-priority patent/CN112446463B/zh
Priority claimed from CN201910819947.8A external-priority patent/CN112446485B/zh
Priority claimed from CN201910819946.3A external-priority patent/CN112446474B/zh
Priority claimed from CN201910819940.6A external-priority patent/CN112446464B/zh
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Publication of WO2021036404A1 publication Critical patent/WO2021036404A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/06Optimizing the usage of the radio link, e.g. header compression, information sizing, discarding information

Definitions

  • This application relates to the field of chip technology, and specifically, to a chip, a multi-chip system, an electronic device, and a data transmission method.
  • This application aims to provide a chip and multi-chip system, electronic equipment and data transmission method, which can improve computing efficiency.
  • a chip including a data bus and a memory connected to the data bus, a data receiver, an arithmetic processing unit, and a data transmitter, wherein the data receiver is configured to receive external First data and header information, writing the first data to a corresponding area of the memory through the data bus, and configuring a corresponding operation processing unit and/or data transmitter according to the header information; the operation The processing unit is configured to receive first task information, perform arithmetic processing according to the first task information, and perform configuration operations on the data transmitter; the data transmitter is configured to obtain second task information and second data, and based on At least part of the second data outputs third data to the outside.
  • a multi-chip system including the chip according to the present application.
  • an electronic device including the chip or multi-chip system according to the present application.
  • a method for a computing node to transmit data including: starting to receive first data; after receiving a part of the first data, continuing to receive the first data At the same time, forwarding the part of the first data; and/or after receiving a part of the first data, while continuing to receive the first data, perform processing on the part of the first data Process and forward the processing results.
  • a data transmission method including using the chip according to the present application to execute the aforementioned method for computing node data transmission.
  • a data transmission method including using the multi-chip system according to the present application to execute the foregoing method.
  • a chip structure which overcomes the defect that as the number of cooperatively working chips increases, the amount of communication between multiple chips increases rapidly.
  • the calculation and transmission of data can be streamlined, which can cover the transmission overhead and improve the computing efficiency and hardware resource utilization.
  • the embodiments of this application provide a neural network convolution operation method, device, and related products, which can reduce data communication time, make the communication process covered by the calculation process, and improve the efficiency of the convolution operation.
  • a convolution operation method is provided, which is applied to an artificial intelligence processor including multiple computing nodes.
  • the method includes: performing a convolution operation according to target data to obtain an operation result ,
  • the target data is any one of a plurality of groups of data to be calculated; in the process of performing a convolution operation on the target data and obtaining a calculation result, when it is determined that the calculation result is used by other computing nodes, Sending the calculation result to the corresponding other computing node.
  • a convolution operation device is provided.
  • the device is applied to an artificial intelligence processor including multiple computing nodes.
  • the device includes: a first execution unit configured to execute Convolution operation to obtain the operation result, the target data is any one of the multiple groups of data to be operated; the sending unit is used to perform the convolution operation on the target data and obtain the operation result, in determining the If the calculation result is used by another computing node, the calculation result is sent to the corresponding other computing node.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the method of the first aspect when the computer program is executed.
  • a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method provided in the first aspect.
  • a computer program product in a fifth aspect, includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the method provided in the first aspect.
  • the technical solution provided by this application sends the operation result to the corresponding other computing nodes that need to use the calculation result during the process of performing the convolution operation and obtaining the operation result, and sends the calculation while calculating
  • the communication time is reduced; and the data calculated by each computing node is divided into multiple groups of data to be calculated, and the calculation results are given priority to a group of calculation nodes used by other computing nodes.
  • the embodiments of this application provide a neural network fully connected layer computing method, device, and related products, which can reduce data communication time, make the communication process covered by the calculation process, and improve the efficiency of fully connected layer computing.
  • a fully connected layer calculation method is provided.
  • the method is applied to an artificial intelligence processor including multiple computing nodes.
  • the method includes: performing calculation based on input calculation data for a first output. Perform an operation to obtain a first result; if it is determined that there is a second result sent from a second computing node for the first output, receive the second result sent by the second computing node; and after receiving the In the process of the second result, the first result and the second result are added to obtain a third result.
  • a fully connected layer computing device is provided.
  • the device is applied to an artificial intelligence processor including multiple computing nodes.
  • the device includes: a first computing unit for The input calculation data of an output is operated to obtain the first result; the first receiving unit is configured to receive the second calculation when it is determined that there is a second result sent from the second calculation node for the first output The second result sent by the node; and an addition unit, configured to perform an addition operation on the first result and the second result in the process of receiving the second result to obtain a third result.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the method of the first aspect when the computer program is executed.
  • a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method provided in the first aspect.
  • a computer program product in a fifth aspect, includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the method provided in the first aspect.
  • each computing node performs coordinated operations for one output.
  • Each computing node can perform summation in the process of receiving the operation results of other computing nodes, and obtain In the process of results, the result of the sum is sent, that is, a part of the data is processed when a part of the data is received, and a part of the calculation result is sent when a part of the calculation result is calculated.
  • the result of the sum is sent, that is, a part of the data is processed when a part of the data is received, and a part of the calculation result is sent when a part of the calculation result is calculated.
  • each computing node after each computing node has calculated its own fully connected layer calculation for the current output, it can perform subsequent fully connected layer calculations for other outputs or other neural network layer calculations without waiting for the slowest calculation. The node calculation is completed, thereby improving the calculation efficiency. 201910819939.3
  • the embodiments of this application provide a neural network collaborative training method, device, and related products, which can reduce data communication time, enable the communication process to be covered by the calculation process, and improve the efficiency of collaborative training.
  • a method of collaborative training is provided, which is applied to an artificial intelligence processor including a plurality of nodes, the plurality of nodes including a control node and a plurality of computing nodes, for each of the plurality of computing nodes
  • the method includes the following steps: acquiring first weight gradient data; in the case where there is second weight gradient data from a second computing node in the plurality of computing nodes, the method includes the following steps: In the process of adding the second weight gradient data of the second computing node and the first weight gradient data to obtain updated weight gradient data, the updated weight gradient data is sent.
  • a device for collaborative training is provided.
  • the device is applied to an artificial intelligence processor including multiple nodes.
  • the multiple nodes include control nodes and multiple computing nodes.
  • the device includes: an acquiring unit, configured to acquire first weight gradient data; a first sending unit, configured to receive a second weight from a second computing node among the plurality of computing nodes In the case of gradient data, in the process of adding the second weight gradient data from the second computing node and the first weight gradient data to obtain updated weight gradient data, send all The updated weight gradient data.
  • an electronic device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the method of the first aspect when the computer program is executed.
  • a computer-readable storage medium which stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method provided in the first aspect.
  • a computer program product in a fifth aspect, includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the method provided in the first aspect.
  • the technical solution provided by this application meets the requirements for obtaining gradient update data signals.
  • the computing node adds the local weight gradient data with the weight gradient data from another computing node.
  • send the result of the addition that is, send the calculation result while calculating, instead of sending the calculation result after the calculation is completed; it does not meet the requirements of obtaining the gradient update data signal.
  • the process of the calculation node receiving the weight gradient data of other calculation nodes Send the received weight gradient data in the receiving process, that is, send the data while receiving the data, instead of sending it after the reception is completed; thus, the calculation while sending and the receiving while sending can effectively reduce the communication Time; and, in the training process, multiple computing nodes are grouped, so that when the computing power of multiple computing nodes does not match, only part of the computing nodes can be synchronized, thereby reducing the waiting overhead between different computing nodes. Improve computing efficiency. 201910819947.8
  • Figure 1-1 shows a chip structure according to an exemplary embodiment of the present application.
  • Fig. 1-2A shows a data receiver according to an exemplary embodiment of the present application.
  • FIGS 1-2B illustrate a data receiver according to another exemplary embodiment of the present application.
  • Figures 1-3A show a data transmitter according to an exemplary embodiment of the present application.
  • Figures 1-3B show a data transmitter according to another exemplary embodiment of the present application.
  • Figures 1-3C show a data transmitter according to another exemplary embodiment of the present application.
  • Figures 1-4 show a merging module according to an example embodiment of the present application.
  • Figures 1-5A illustrate a ring connection structure based on a ring topology according to an exemplary embodiment of the present application.
  • Figures 1-5B show a ring connection structure constructed in a 2D-MESH topology according to an exemplary embodiment of the present application.
  • Figures 1-6 illustrate a method for computing nodes to transmit data according to an embodiment of the present application.
  • Figures 1-7A show an example of a data transmission process in the prior art.
  • Fig. 1-7B shows an example of the data transmission process of the method shown in Figs. 1-6.
  • Figures 1-8 show schematic diagrams of multi-node cooperative execution of convolution operations according to an exemplary embodiment of the present application
  • Figures 1-9 show schematic diagrams of multi-node collaborative execution of classification layer operations according to exemplary embodiments of the present application.
  • Figures 1-10 show schematic diagrams of multi-chip asynchronous and parallel collaborative training according to exemplary embodiments of the present application.
  • Figures 1-11 show schematic diagrams of electronic devices according to exemplary embodiments of the present application.
  • Figure 2-1 is a schematic diagram of the structure of a neural network architecture.
  • Figure 2-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • Figure 2-3 provides a schematic diagram of a convolution algorithm according to an embodiment of the present application.
  • Figures 2-4 provide a schematic diagram of a convolution algorithm according to another embodiment of the present application.
  • Figures 2-5 provide a schematic diagram of a topological structure between computing nodes according to an embodiment of the present application.
  • 2-6A to 2-6G are flowcharts of a convolution operation method according to an embodiment of the present application.
  • 2-7A to 2-7G are schematic diagrams of a convolution operation device according to an embodiment of the present application.
  • Figures 2-8 are structural diagrams of an electronic device provided by an embodiment of the present application.
  • Figure 3-1 is a schematic diagram of a neural network architecture.
  • Figure 3-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • Figure 3-3 provides a schematic diagram of a fully connected layer algorithm according to an embodiment of the present application.
  • Figures 3-4 provide a schematic diagram of a topological structure between computing nodes according to an embodiment of the present application.
  • Figures 3-5A to 3-5H are flowcharts of a fully connected layer operation method according to an embodiment of the present application.
  • 3-6A to 3-6H are schematic diagrams of a fully connected layer computing device according to an embodiment of the present application.
  • Figures 3-7 are structural diagrams of an electronic device provided by an embodiment of the present application.
  • Figure 4-1 is a schematic diagram of a neural network architecture.
  • Figure 4-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • Figure 4-3 provides a schematic diagram of the topology result of the collaborative training system according to an embodiment of the present application.
  • Figure 4-4 provides a schematic diagram of collaborative training according to an embodiment of the present application.
  • Figure 4-5 provides a schematic diagram of dynamically adjusting the grouping of computing nodes according to an embodiment of the present application.
  • 4-6A to 4-6I are flowcharts of a collaborative training method according to an embodiment of the present application.
  • 4-7A to 4-7I are schematic diagrams of a collaborative training device according to an embodiment of the present application.
  • Figures 4-8 are structural diagrams of an electronic device provided by an embodiment of the present application.
  • This application proposes a chip design structure that can be used for collaborative computing in a multi-chip system, which can at least partially overcome the problem of excessive communication overhead that makes communication unable to be completely covered by calculations, and improve computing efficiency and hardware resource utilization.
  • Figure 1-1 shows a chip structure according to an exemplary embodiment of the present application.
  • the chip shown in Figure 1-1 can be used to build a multi-chip system to perform calculation tasks such as deep learning collaborative computing.
  • the chip can be an artificial intelligence chip.
  • the chip 100 includes a data bus 110 and a memory 120 connected to the data bus 110, a data receiver RX, an arithmetic processing unit 130, and a data transmitter TX.
  • the data bus 110 may include a NOC (network-on-chip), but the application is not limited thereto.
  • NOC network-on-chip
  • the data receiver RX is configured to receive first data and header information from the outside, write the first data to the corresponding area of the memory 120 through the data bus 110, and configure the corresponding arithmetic processing unit 130 according to the header information And/or data transmitter TX.
  • the memory 120 may be, for example, a DRAM memory, but the application is not limited thereto.
  • the data receiver RX may disassemble the first data according to the header information.
  • the data receiver RX may include a SERDES interface, a receive data buffer, a decoder, and a DMA unit, etc., as described later with reference to FIGS. 1-2A or 1-2B, but the application is not limited thereto.
  • the data receiver RX may include a decompression unit.
  • the arithmetic processing unit 130 is configured to receive first task information, perform arithmetic processing according to the first task information, and perform configuration operations on the data transmitter TX.
  • the arithmetic processing unit 130 may be an artificial intelligence processing unit or a machine learning processing unit.
  • the operation processing unit 130 is configured to store the operation processing result in the memory 120.
  • the data transmitter TX is configured to obtain the second task information and the second data, and output third data based on at least part of the second data.
  • the data transmitter TX may include a transmission decoder, a data reordering buffer, a serial interface, and a transmission buffer. According to some embodiments, the data transmitter TX may further include an arithmetic logic unit and/or compressor.
  • the chip 100 may further include a configuration bus 140, so that the arithmetic processing unit 130, the data receiver RX, and the data transmitter TX are connected to the configuration bus 140 and transfer configurations to each other through the configuration bus 140. information.
  • the data receiver RX, the data transmitter TX, and the operation processing unit 130 can transmit data to each other and/or access the memory through the data bus 110.
  • the arithmetic processing unit 130, the data receiver RX, and the data transmitter TX can transmit configuration information to each other through the configuration bus 140, so that the chip 100 according to the embodiment of the present application can be advantageously used for multi-chip collaborative computing.
  • Fig. 1-2A shows a data receiver according to an exemplary embodiment, which can be used in the chip 100 shown in Fig. 1-1.
  • the data receiver RX may include a first serial interface 210, a data buffer 220, a decoder 230, and a DMA unit 240.
  • the data receiver RX can receive the first data and header information transmitted from an external, such as an upstream computing node, through the first serial interface 210.
  • the first serial interface 210 may adopt a SERDES interface, and SERDES is the abbreviation of SERializer (serializer)/DESerializer (deserializer).
  • SERDES includes time division multiplexing (TDM) and point-to-point (P2P) serial communication technologies.
  • TDM time division multiplexing
  • P2P point-to-point serial communication technologies.
  • the multiple low-speed parallel signals are converted into high-speed serial signals at the transmitting end, and the high-speed serial signals are re-converted into low-speed parallel signals at the receiving end.
  • This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium to increase the transmission speed of the signal, thereby greatly reducing the communication cost.
  • the data buffer 220 is used to buffer the first data from the first serial interface 210.
  • the data buffer 220 can accommodate overshoot data on the entire link. In this way, it is possible to avoid the problem that the overshoot data cannot be received and lost due to the existence of overshoot data.
  • the data buffer 220 may also provide data to subsequent modules after the back pressure disappears until the new data transmitted from the upstream is received.
  • the decoder 230 is configured to analyze the format and storage address of the first data received subsequently from the header information, so as to segment the first data received subsequently according to the parsed format. In addition, the decoder 230 may configure corresponding bits of the arithmetic processing unit 130 and the data transmitter TX according to the header information. According to an example embodiment, the decoder 230 also transmits address information to the DMA unit 240.
  • the header information also contains information about the arithmetic processing unit and the data transmitter that need to be started after the data transmission ends, so that when the decoder 230 writes the received first data to the memory 120 via the data bus 110 After that, the bit corresponding to the arithmetic processing unit and/or data transmitter configured according to the header information is 1.
  • the DMA unit 240 is configured to receive the first data and the storage address from the decoder 230, so as to write the first data into the corresponding area of the memory 120 through the data bus 110.
  • the DMA unit 240 parses the address information into an AXI protocol or the like, and then writes the data into the memory 120 through the data bus 110. At the same time, after all data of a packet are successfully written into the memory 120, the decoder 230 is notified to perform subsequent actions.
  • the data receiver RX may further include a decompression unit 250 for decompressing the first data from the decoder 230 and sending the decompressed first data to the DMA unit 240.
  • Figs. 1-3A show a data transmitter according to an exemplary embodiment, which can be used in the chip 100 shown in Fig. 1-1.
  • the data transmitter TX may include a transmission decoder 310, a data reordering buffer 320, a transmission buffer 330, and a second serial interface 340.
  • the transmission decoder 310 is configured to package the received second task information into second header information, and send the second header information to the transmission buffer 330.
  • the transmission decoder 310 may also send data read request information to the data reordering buffer 320 according to the second task information.
  • the transmit decoder 310 obtains the address, size, etc. of the operands and the operation codes between the operands according to the task information, and disassembles the operands into specific memory access requests, so as to obtain data from the memory 120 through the data bus 110 Get the corresponding data.
  • the data reordering buffer 320 is configured to obtain and send second data through the data bus 110 according to the data read request information, the second data including at least part of the first data and/or the operation processing result of the operation processing unit 130.
  • the data reordering buffer 320 is required to preserve the order of the received data. According to some embodiments, after the data reordering buffer 320 receives the data, it shifts the data according to the source address and the destination address of the data. When the data in the two data reordering buffers 320 are shifted and aligned, the data is sent, for example, to the buffer 330.
  • the data reordering buffer 320 obtains the second data from the memory 120.
  • the sending buffer 330 is configured to buffer the received data and send the buffered data according to the format of the second serial interface 340.
  • the sending buffer 330 is configured to receive the second header information and receive and buffer the second data, and send the third data according to the format of the second serial interface 340, and the third data includes the second data.
  • the second serial interface 340 is configured to receive and transmit third data.
  • the second serial interface may include SERDES.
  • the sending buffer 330 after the data is buffered by the sending buffer 330, the data is integrated into a data stream, and then divided into corresponding packages and/or bursts according to the format accepted by the second serial interface 340. transmission.
  • the sending buffer 330 will load the data transmitted from the upstream for a short time after the downstream node forms a back pressure through the second serial interface 340, so as to avoid forming a back pressure on the data bus 110 and blocking data transmission between other units.
  • the second serial interface 340 releases the back pressure, because it needs to obtain new data through the data bus 110 again, it sends a request again, the request reaches the memory 120 through the data bus 110, and the memory 120 returns the data.
  • the data is returned through the data bus 110.
  • the sending buffer 330 uses the data it has stored to prevent the data output to the second serial interface from being cut off.
  • Figures 1-3B illustrate a data transmitter according to another example embodiment.
  • the data transmitter TX shown in Figure 1-3B is basically the same as that shown in Figure 1-3A, except that the data transmitter TX shown in Figure 1-3B also includes ALU (arithmetic Logic Unit) 350.
  • ALU Arimetic Logic Unit
  • the arithmetic logic unit 350 is configured to perform an operation on at least part of the second data, and send a part or all of the obtained operation result and/or the second data to the sending buffer 330 as the fourth data.
  • the sending buffer 330 receives the second header information and receives and buffers the fourth data from the arithmetic logic unit 350, and sends the third data according to the format of the second serial interface 340, and the third data includes the fourth data.
  • the second serial interface 340 is configured to receive and transmit third data.
  • the ALU 350 performs corresponding addition and subtraction operations on the data transmitted from the data reordering buffer 320 according to the operation code transmitted from the transmit decoder 310 to obtain the data to be transmitted. After sending the second header information packaged according to the task information, the ALU 350 sequentially sends the data to be transmitted to the sending buffer 330.
  • an ALU 350 is added to the data transmitter TX, and a light-weight arithmetic operation is completed in the calculation process, which can improve the processing efficiency of the system and speed up the transmission process.
  • Figures 1-3C show a data transmitter according to another example embodiment.
  • the data transmitter TX shown in Figure 1-3C is basically the same as that shown in Figure 1-3A, except that the data transmitter TX shown in Figure 1-3C also includes a compression unit 360 .
  • the compression unit 360 is configured to compress the second data into fourth data and send to the transmission buffer 330.
  • the sending buffer 330 receives the second header information and receives and buffers the fourth data from the compression unit 360, and sends the third data according to the format of the second serial interface 340, and the third data includes the fourth data.
  • the second serial interface 340 receives and transmits the third data.
  • the compression unit 360 compresses data smaller than a preset threshold.
  • the preset threshold may be 0 by default or may be user-defined.
  • the compression module 360 may be arranged after the ALU 350, so that the ALU completes lightweight computing operations and improves efficiency.
  • FIGS. 1-3C For other parts of the data transmitter TX shown in FIGS. 1-3C, please refer to FIGS. 1-3A, which will not be repeated here.
  • FIG. 1-1 illustrate merging modules according to example embodiments.
  • the merging module 400 can be used in the chip structure shown in FIG. 1-1.
  • the merging module 400 may be provided between the data bus 110 and the operation processing unit 130 or the data transmitter TX. As shown in FIGS. 1-4, the merging module 400 may include a merging mode unit 410, a task prefetching unit 420, and a task sending unit 430.
  • the merging module 400 arranged before the data transmitter TX is responsible for receiving messages sent by other units, acquiring tasks, and checking whether the corresponding tasks are executable.
  • the task may be disassembled according to the task information, the disassembled subtasks are sent to the transmission decoder 310 for execution, and the information is sent to other units according to the execution result and the task information.
  • the merge mode unit 410 receives and stores execution information of the other operation processing unit 130 and/or the data transmitter TX.
  • the merge mode unit 410 stores the received execution information of other units, and aggregates the execution information from other units, so that the task prefetch unit 420 can read information from it and process it.
  • the structure of the entries stored in the merge mode unit 410 is as shown in Table 1-1.
  • the entry includes three fields: Valid, Bit, and ID.
  • Bit width use Valid 1 Used to indicate whether the entry is valid Bit 64 Used to store information about the execution status of each unit ID 16 Used to distinguish table entries
  • Valid is used to identify whether the entry is available. If it is 0, it means that all information of the entry is unavailable.
  • a unit sends information a new entry is allocated. For example, whenever a unit sends information to the merge mode unit 410, a new entry is allocated for the information, and the Valid of the corresponding entry is set to 1.
  • the task prefetch unit 420 decides to clear an entry, it sets the Valid of the corresponding entry to 0.
  • Bit can use the form of onehot (one hot code) to indicate the execution status of each unit collected. The information of each unit received by the hardware is set to 1, and the software is cleared to 0 through the task prefetch unit 420.
  • the task prefetch unit 420 is configured to obtain the first task information from the memory 120 according to the register information configured by the software, process the execution information according to the first task information, and determine and send the configuration information and/or the second task information according to the processing result.
  • the task prefetch unit 420 first obtains task information from the memory 120 according to the software-configured registers TASK HEAD (task header), TASK SIZE (task size) and TASK TAIL (task tail), and then merges the mode unit according to the task information
  • the Bit in 410 is processed, and whether to send or continue to wait for information is selected according to the result.
  • the task information also contains bit clearing information, which can clear the entries corresponding to these IDs based on multiple IDs specified in the task information.
  • the task prefetching unit 420 is further configured to disassemble the corresponding task into multiple transmission subtasks according to the first task information, and send the second task information of the multiple transmission subtasks to the task sending unit 430 according to the execution information. .
  • the task sending unit 430 is configured to receive the second task information from the task prefetch unit 420 and send it to the other arithmetic processing unit 130 and/or the data transmitter TX for processing.
  • the task sending unit 430 is configured to monitor the status of the arithmetic processing unit 130 or the data transmitter TX, and send to other arithmetic processing units and/or data transmitters according to the execution end status of the arithmetic processing unit 130 or the data transmitter TX. Send configuration information.
  • the task sending unit 430 monitors the status of the arithmetic processing unit 130 or the data transmitter TX, and if its normal execution ends, first, according to the method recorded in the task information, the configuration bus 140 goes to the remaining arithmetic processing units 130 and/or sends The data unit TX sends information, and at the same time, if there is a task that can be sent, a new task is sent for execution.
  • the chip according to the embodiment of the present application can be used to construct a multi-chip system, for example, a multi-chip system including at least one layout structure of a ring structure, a mesh structure, and a tree structure can be configured.
  • the chip according to the embodiment of the present application includes a data receiver, a data transmitter, and an arithmetic processing unit that can communicate with each other, so that it can be better used for multi-chip collaboration.
  • the multiple chips are constructed as a ring-shaped connection structure.
  • FIGS. 1-5A illustrate a ring connection structure based on a ring topology according to an example embodiment
  • FIGS. 1-5B illustrate a ring connection structure constructed in a 2D-MESH topology according to an example embodiment.
  • the chip or multi-chip system according to the embodiments of the present application can be applied to various electronic devices, including but not limited to supercomputers, cloud servers, smart phones, embedded systems, etc.
  • Figures 1-6 illustrate a method for computing nodes to transmit data according to an embodiment of the present application.
  • the methods shown in FIGS. 1-6 can be executed using the chip or multi-chip system according to the embodiment of the present application, or applied to the chip or multi-chip system according to the embodiment of the present application, but the method of the present application is not limited to this.
  • the data transmission method shown in FIGS. 1-6 may be used in a system including multiple computing nodes.
  • a computing node may include a chip according to an embodiment of the present application. At least some of the multiple computing nodes execute the aforementioned method.
  • multiple computing nodes are constructed as a ring-shaped connection structure, see, for example, those shown in FIGS. 1-5A and 1-5B.
  • the first data is received through the data receiver RX of the aforementioned chip.
  • data is transmitted through the data transmitter TX of the aforementioned chip.
  • the data is processed by the arithmetic processing unit 130 of the aforementioned chip, and the data is transmitted by the data transmitter TX of the aforementioned chip.
  • the method shown in FIGS. 1-6 may be used to process the transmitted data. That is, after receiving a small portion of data, each computing node can immediately transmit data to the next node. In this mode, after receiving the transmitted data, the intermediate node processes and forwards the data while continuing to receive the data, which can significantly reduce the communication time.
  • FIGS. 1-8 show schematic diagrams of multi-node cooperatively performing convolution operations according to example embodiments.
  • a layer of convolution may be first split into 4 parts in the directions of H and W, which are scattered on 4 computing nodes, and each computing node loads an equal piece of data. Then, within the slice of each computing node, it is further divided into 4 subtasks, and each subtask has an equal load.
  • the dark colored blocks are subtasks that have been executed, and the light colored blocks are subtasks waiting to be executed.
  • the data receiver of a computing node After the data receiver of a computing node receives the data transmitted from the neighboring computing node, it can notify the corresponding operation processing unit (deep learning processing unit) that the relevant subsequent tasks have the sending conditions. For example, after the second step is executed, the execution of the subtasks in the middle two columns ends, and the overlapping data is transmitted to the corresponding computing node, all the data required by the 4 subtasks on the upper and lower sides of the second layer can be prepared. Complete, so it has the conditions for execution. In this way, for each computing node, after the convolution calculation of the first layer is completed, the convolution calculation of the second layer can be started immediately.
  • the corresponding operation processing unit deep learning processing unit
  • each of its computing nodes will give priority to subtasks connected to other chips, and each subtask is executed.
  • the overlapping data can be sent to the corresponding adjacent computing node.
  • the corresponding split subtasks will be in the same order and in a state that can be sent sequentially, so as to ensure that even if the calculation rate between the two computing nodes is not enough, the execution is fast.
  • the computing node can still execute continuously, without waiting for the slow computing node to finish executing and transmit data.
  • Figures 1-9 show schematic diagrams of multi-node cooperative execution of classification layer operations according to example embodiments.
  • the output data is divided into 8 groups, taking the fifth group of data as an example.
  • the input data is further divided into 12 groups and placed in 4 computing nodes, and 3 groups with the same filling shape are placed in the same node. That is, 0, 4, and 8 are placed on compute node 0 for calculation; 1, 5, and 9 are placed on compute node 1 for computation; 2, 6, and 10 are placed on compute node 2 for computation; 3, 7, and 11 are placed on compute node 3 Perform calculations.
  • each computing node first calculates the 3 sets of input data loaded by itself, and obtains the partial sum corresponding to the fifth set of output data. Then start the merge and add transmission process. Each computing node adds up its own partial sum data with the received partial sum data, and then passes the result of the sum to the next computing node. At the same time, when each computing node is transmitting data, it can start to calculate the sixth group of output data. Therefore, at this time, the entire topology includes the mutual transmission process of the fifth group of partial sums and the calculation process of the sixth group of partial sums.
  • 4 computing nodes may be connected in a ring.
  • the merging process can be as follows: first computing node 1 sends a partial sum to computing node 2; then computing node 2 adds the received data with the local partial sum data, and then transmits it to computing node 3. ; Afterwards, the computing node 3 adds the received data and the local part and data, and then passes it to the computing node 0; finally, the computing node 0 adds the received data and saves it locally.
  • computing node 0 can directly start the merging process , Send the data to computing node 1.
  • the transmission process still uses slice transmission, that is, as long as each computing node receives part of the data transmitted by the previous computing node, it can immediately add the local partial sum data (or other operations), and then immediately obtain the partial result Transmit to downstream computing nodes.
  • the bit setting operation can be performed on the corresponding data transmitter.
  • the data receiver After receiving the data transmitted by the upstream node, the data receiver performs bit setting operation to the corresponding data transmitter. Therefore, if the corresponding data transmitter finds through bit monitoring that the arithmetic processing unit has completed the corresponding subtask operation, and the corresponding data receiver has also completed the data reception work, it can obtain the local calculated partial sum from the memory and The received data is added and operated, and then the data is packaged and transmitted to the downstream computing node. In this way, according to the exemplary embodiment, the problem that communication cannot be completely covered by calculation due to excessive communication overhead can be overcome, and calculation efficiency can be improved.
  • Figures 1-10 show schematic diagrams of multi-chip asynchronous parallel collaborative training according to example embodiments.
  • the starting computing node may include a parameter service node.
  • the filled computing node is group 1, and the unfilled computing node is group 2.
  • the purpose of dividing into two groups is to be able to synchronize only part of the computing nodes when the computing power of multiple computing nodes does not match, thereby reducing the waiting overhead between different computing nodes.
  • each computing node saves its data locally after completing the local batch training.
  • the control node notifies the initial computing node to initiate a request for adding the weight gradient data.
  • the initial computing node (parameter service node) sends a request for obtaining gradient data according to its historical state. In this request, not only the updated algebra (generations), but also which nodes need to be merged. Since the first computing node itself does not participate in the merging, it only sends a request to the next computing node. The first computing node that needs to participate in the merging sends its gradient data to the next computing node.
  • the subsequent computing node When the subsequent computing node receives the data, if it needs to participate in the merging, when receiving the first slice data, if the data of the first local slice is also ready, it will immediately perform the addition operation locally, and then The slice is transmitted to the next computing node.
  • the computing node when it obtains the request, it calculates the difference according to the updated algebra contained therein and the algebra identified by the local weight gradient data. If the difference meets expectations, the weight gradient data of the computing node needs to be merged into this transmission, and the local weight gradient data is also ready, the data transmitter can start the corresponding subtask.
  • the corresponding data transmitter can obtain the data transmitted by the upstream computing node from the DRAM memory, and the weight gradient data obtained by the local calculation, and perform the addition operation to obtain the new weight gradient data, and then use the SERDES to convert the weight gradient
  • the data is passed to downstream nodes. As shown in Figure 1-10, all computing nodes in group 2 will send or add up when they output, and integrate the local weight gradient data into the transmitted data.
  • the subsequent computing node When the subsequent computing node receives the data, if it does not need to participate in the merging, it will immediately transmit the slice to the next computing node when it receives the first slice of data. For example, all computing nodes in group 1 will transmit the data directly without processing.
  • the last computing node When the last computing node receives the data, it indicates that all nodes have completed the merging operation, thereby obtaining the final new weight. At this time, the initial computing node (parameter service node) starts the weight broadcast process. When broadcasting weight data, all computing nodes save and update the local weight backup and forward the weight data to the next computing node until the last computing node. At this point, all transmissions are completed.
  • the initial computing node when the initial computing node (parameter service node) receives the merged data transmitted back, it first updates the local copy. Then, the updated new weight is broadcast to all computing nodes through the ring topology; at the same time, a label is marked in the information to indicate the algebra of the weight data. At this time, after the computing node receives the corresponding weight data, it updates its local weight data algebra, and then uses the new weight data for training in the next training. At the same time, the weight gradient data obtained by its training uses the label attached to the new weight data.
  • the control node only needs to communicate with the initial computing node. Therefore, before transmission, each merging node does not need to communicate with the control node separately, saving a synchronization communication overhead.
  • the request can be initiated without waiting for each node to be ready, and each computing node can control it according to its local execution status.
  • each computing node since each computing node is an asynchronous transmission process, the merging process of the second packet can be started before the first packet is fully merged.
  • the merge and broadcast processes are combined. Therefore, this solution greatly reduces the overall cost.
  • Figures 1-11 show schematic diagrams of electronic devices according to exemplary embodiments of the present application.
  • the electronic device 1100 may include a central processing unit 1110, an acceleration module 1120, and a memory 1130.
  • the acceleration module 1120 is communicatively connected with the central processing unit 1110, and includes a plurality of chips 100 according to the present application.
  • the memory 1130 stores computer programs. When the computer program stored in the memory 1130 is executed by the central processing unit 1110, the central processing unit 1110 can obtain the result of the accelerated operation through the acceleration module 1120.
  • the chip and the multi-chip system, the electronic device and the data transmission method according to the embodiments of the present application have at least one or more of the following advantages.
  • the chip according to the embodiment of the present application includes a data receiver, a data transmitter, and an arithmetic processing unit that can communicate with each other, so that it can be better used for multi-chip collaboration.
  • the chip design according to the embodiment of the present application can be used for collaborative computing in a multi-chip system, and can at least partially overcome the problem of excessive communication overhead that makes communication unable to be completely covered by calculations, and improve computing efficiency and hardware resource utilization.
  • the communication overhead is transparent to the computing node and is almost imperceptible.
  • ALU0 is added to the data transmitter to complete processing of lightweight arithmetic operations during the calculation process, which can improve the processing efficiency of the system and speed up the transmission process.
  • the use of the chip and the multi-chip system of the present application can streamline calculation and transmission data, thereby covering transmission overhead and improving computing efficiency and hardware resource utilization.
  • a mechanism for triggering coordination among the data transmitter, the data receiver, and the arithmetic processing unit is added to the chip, so that the system using the chip can not only maximize the parallelism of calculation and communication, but also achieve extreme High parallel speedup.
  • modules in the above-mentioned embodiments can be distributed in the device according to the description of the embodiment, or can be changed to be located in one or more devices different from this embodiment.
  • the modules in the above-mentioned embodiments can be combined into one module or further divided into multiple sub-modules.
  • a chip including a data bus and a memory connected to the data bus, a data receiver, an arithmetic processing unit, and a data transmitter, wherein the data receiver is configured to receive first data and a header from the outside Information, the first data is written to the corresponding area of the memory through the data bus, and the corresponding arithmetic processing unit and/or data transmitter is configured according to the header information; the arithmetic processing unit is configured to receive First task information, performing arithmetic processing according to the first task information and performing configuration operations on the data transmitter; the data transmitter is configured to obtain second task information and second data, and based on at least part of the first task information The second data outputs the third data to the outside.
  • Clause A2 The chip according to clause A1, further comprising: a configuration bus, and the arithmetic processing unit, the data receiver, and the data transmitter are connected to the configuration bus so as to transmit configuration information to each other through the configuration bus .
  • Clause A3 The chip according to clause A1, wherein the data receiver is further configured to disassemble the first data according to the header information.
  • Clause A4 The chip according to clause A1, wherein the data receiver includes: a first serial interface; a data buffer for buffering the first data from the first serial interface; a decoder , Used to parse the format and storage address of the first data from the header information, segment the first data according to the format of the first data, and configure the arithmetic processing unit and the storage address according to the header information The corresponding bit of the data transmitter; a DMA unit for receiving the first data and the storage address from the decoder, so as to write the first data to the memory through the data bus Corresponding area.
  • Clause A5 The chip according to clause A1, wherein the data receiver further includes: a decompression unit, configured to decompress the first data from the decoder, and send the decompressed first data To the DMA unit.
  • a decompression unit configured to decompress the first data from the decoder, and send the decompressed first data To the DMA unit.
  • Clause A6 The chip according to clause A1, wherein the data transmitter includes a transmission decoder, a data reordering buffer, a transmission buffer, and a second serial interface, wherein the transmission decoder is configured to:
  • the second task information packs the second header information and sends the second header information to the sending buffer, and sends data read request information to the data reordering buffer according to the second task information;
  • the data reordering buffer is configured to acquire and send the second data through the data bus according to the data read request information, and the second data includes at least part of the first data and/or the arithmetic processing Result;
  • the sending buffer is configured to buffer the received data, and send the buffered data according to the format of the second serial interface.
  • Clause A7 The chip according to clause A6, wherein the sending buffer is configured to receive the second header information and receive and buffer the second data, and send all data according to the format of the second serial interface.
  • the third data includes the second data; the second serial interface is configured to receive and send the third data.
  • Clause A8 The chip according to clause A6, wherein the data transmitter further includes an arithmetic logic unit, wherein the arithmetic logic unit is configured to perform an operation on at least part of the second data, and the obtained operation result And/or part or all of the second data is sent to the sending buffer as fourth data; wherein the sending buffer is configured to receive the second header information and receive and buffer data from the arithmetic logic unit
  • the fourth data of the second serial interface, and the third data is sent according to the format of the second serial interface, the third data includes the fourth data;
  • the second serial interface is configured to receive and send the first Three data.
  • Clause A9 The chip according to clause A6, wherein the data transmitter further includes a compression unit, wherein the compression unit is configured to compress the second data into fourth data and send it to the sending buffer; Wherein, the sending buffer is configured to receive the second header information and receive and buffer the fourth data from the compression unit, and send the third data according to the format of the second serial interface, the first The third data includes the fourth data; wherein the second serial interface is configured to receive and send the third data.
  • Clause A10 The chip according to clause A1, which further includes a merging module provided between the data bus and the arithmetic processing unit or the data transmitter, and the merging module includes a merging mode unit and task prefetching Unit and task sending unit, wherein the merging mode unit receives and stores execution information of other arithmetic processing units and/or data transmitters; wherein, the task prefetching unit is configured to read from the memory according to register information configured by software Acquire the first task information, process the execution information according to the first task information, and determine and send configuration information and/or the second task information according to the processing result; wherein, the task sending unit is configured to The second task information is received from the task prefetch unit and sent to other arithmetic processing units and/or data transmitters.
  • the merging module includes a merging mode unit and task prefetching Unit and task sending unit, wherein the merging mode unit receives and stores execution information of other arithmetic processing units and/or data transmitters; wherein, the task pre
  • Clause A11 The chip according to clause A10, wherein the task prefetching unit is further configured to disassemble the corresponding task into multiple transmission subtasks according to the first task information, and send multiple transmissions according to the execution information The second task information of the subtask is sent to the task sending unit.
  • Clause A12 The chip according to clause A10, wherein the task sending unit is further configured to monitor the state of the arithmetic processing unit or the data transmitter, and perform according to the execution of the arithmetic processing unit or the data transmitter The end state sends configuration information to other arithmetic processing units and/or data transmitters.
  • Clause A13 The chip according to clause A1, wherein the data bus includes a NOC.
  • Clause A14 The chip according to clause A1, wherein the chip is an artificial intelligence chip, and the arithmetic processing unit is an artificial intelligence processing unit or a machine learning processing unit.
  • Clause A15 The chip according to clause A1, wherein the data receiver, the data transmitter, and the arithmetic processing unit transmit data to each other and access the memory via the data bus.
  • Clause A16 The chip according to clause A2, wherein the data receiver, the data transmitter, and the arithmetic processing unit transmit data to each other and access the memory via the data bus; the arithmetic processing unit, The data receiver and the data transmitter transmit configuration information to each other through the configuration bus.
  • Clause A17 A multi-chip system comprising a plurality of chips according to any one of clauses A1-A16.
  • Clause A18 The multi-chip system according to clause A17, wherein the plurality of chips are configured in a layout structure including at least one of a ring structure, a mesh structure, and a tree structure.
  • Clause A19 The multi-chip system according to clause A18, wherein the plurality of chips are constructed as a ring connection structure.
  • Clause A20 An electronic device comprising the chip according to any one of clauses A1-A16 or the multi-chip system according to any one of clauses A17-A19.
  • Clause A21 A method for computing nodes to transmit data, including: starting to receive first data; after receiving part of the first data, while continuing to receive the first data, forward the first data The part of the data; and/or after receiving the part of the first data, while continuing to receive the first data, the part of the first data is processed and the processing result is forwarded.
  • Clause A22 A data transmission method, comprising: using the chip according to any one of clauses A1-A16 to execute the method for computing node data transmission according to clause A21.
  • Clause A23 A data transmission method for a system including multiple computing nodes, wherein at least some of the multiple computing nodes perform the method according to clause A21 or A22.
  • Clause A24 The data transmission method according to clause A23, wherein the plurality of computing nodes are constructed as a ring connection structure.
  • This application relates to the field of information processing technology, in particular to a neural network convolution operation method, device and related products.
  • artificial neural network is one of the most common calculation models in all intelligent methods.
  • the calculation process of each network layer of the neural network and the process of neural network training there are communication time for data communication and calculation time for processing data.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Figure 2-1 provides a schematic diagram of a neural network architecture.
  • the neural network architecture can include a multi-layer structure, as shown in Figure 2-1, which can include : Input layer, convolutional layer 1, batchnorm layer, convolutional layer 2, intermediate layer (there are different intermediate layers according to the neural network architecture of different functions, the intermediate layer can be at least one layer), convolutional layer n, fully connected Layer, activation (for example, activation function: softmax) layer.
  • the neural network architecture can be called a calculation layer for a layer with a large amount of calculation, such as a convolutional layer, a fully connected layer, etc. Of course, in practical applications, the calculation layer may also include other types of layers.
  • this application provides The neural network architecture in Figure 2-1 is only for illustration, and the neural network in this application is not limited to the architecture shown in Figure 2-1.
  • Figure 2-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • the core system can be a neural network chip.
  • the multi-core system includes 16 cores (CORE) and 4 storage nodes.
  • the 16 cores are connected to 4 storage node DRAMs through a ring-shaped NOC.
  • the core of the multi-core system can be a computing core in a neural network chip, and the type of storage node can be any type of memory, such as dynamic random access memory (Dynamic Random Access Memory, DRAM), and static random access memory. Get memory (Static Random Access Memory, SRAM), etc.
  • a multi-core system has 16 cores and 4 storage nodes.
  • the multi-core system may include any number of cores and any number of storage nodes, which all fall within the scope of this application.
  • Figure 2-3 provides a schematic diagram of a convolution algorithm according to an embodiment of the present application.
  • the overall data is allocated to each computing node, and the input data of each computing node needs to be determined according to the number of computing nodes.
  • the number of computing nodes in the artificial intelligence processor is N
  • the overall data can be divided into N parts of data, and the N parts of data are respectively regarded as N computing nodes Input data.
  • the overall data can be divided into multiples of N (for example, 2N, 3N, etc.) partial data.
  • the overall data can also be divided into data with less than N parts.
  • the overall data can also be divided into data of any number of parts.
  • each computing node may also store all the weights to split the input neurons.
  • computing nodes there are 4 computing nodes, computing node 1, computing node 2, computing node 3, and computing node 4, which are respectively distributed in the upper left corner, upper right corner, lower left corner, and lower right corner.
  • the overall data to be processed is divided into 4 input data, and each input data is allocated to a computing node for processing.
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 include a neural network chip, and/or a computing core in the neural network chip.
  • any topological structure may be adopted between the computing node 1, the computing node 2, the computing node 3, and the computing node 4, such as a ring, a mesh, a tree, or other structures including a ring.
  • the input data can be split into multiple groups of data to be calculated according to the principle of load balancing, or the input data can be split into multiple groups of data to be calculated along the height and/or width direction. data.
  • the input data can be split into multiple groups of data to be calculated along the height and/or width direction. data.
  • there may be other splitting methods for the input data which are all covered by this application.
  • the foregoing splitting of the input data may be performed after the computing node obtains the input data, or after the input data is split into multiple sets of data to be calculated, the computing node receives the split multiple sets of data to be calculated.
  • the input data of each computing node in computing node 1, computing node 2, computing node 3, and computing node 4 is divided into 4 groups of data to be calculated, that is, the first group of data to be calculated, and the second Group the data to be calculated, the third group of data to be calculated, and the fourth group of data to be calculated. Then, the computing node performs convolution operations on the four groups of data to be calculated.
  • the data framed by the sliding window may span multiple computing nodes, so the overlapping part needs to be transmitted to the corresponding computing node For example, it is necessary to send the calculation result indicated by the oblique line of the computing node 1 to the computing node 2.
  • the calculation node 1 sends the calculation result to the calculation node 2 during the process of performing the convolution calculation and obtaining the calculation result. In this way, the calculation result is sent while calculating, instead of sending the calculation result after the calculation is completed, thereby reducing communication time.
  • other computing nodes that rely on the calculation result to perform the calculation can start the corresponding calculation faster after receiving the calculation result.
  • the data used when performing the convolution operation of the subsequent convolution layer for other computing nodes is called overlapping data.
  • the calculation result of computing node 1 is used as The part represented by the line is the data used by the computing node 2 to perform the convolution operation of the subsequent convolution layer.
  • the computing node 1 sends the result of the operation for the second group of data to be calculated to the calculation Node 2 sends overlapping data during the process, that is, the part represented by the diagonal line can be sent.
  • the oblique line in the calculation result of the calculation node 1 for the fourth group of data to be calculated is the data used when the calculation node 2 performs the convolution operation of the subsequent convolution layer.
  • the part represented by the line is the data used when computing node 3 performs the convolution operation of the subsequent convolution layer
  • the shaded part is the data used when computing node 2, computing node 3, and computing node 4 perform the convolution operation of the subsequent convolution layer
  • the computing node 1 sends the slashed part to the computing node 2, the vertical line to the computing node 3, and the shaded part to the computing node 2, computing node 3 and Compute each of the nodes 4.
  • any order can be used for the 4 groups of data to be calculated.
  • the computing node preferentially executes the convolution operation of a set of data to be operated whose operation result is used by other computing nodes.
  • Figures 2-4 provide a schematic diagram of a convolution algorithm according to another embodiment of the present application. As shown in Figure 2-4, the execution sequence is solid arrow, dashed arrow, dotted arrow and dotted arrow. The number on the arrow indicates the number of data to be calculated. For example, 1 represents the first group of data to be calculated.
  • the execution sequence is: the second group of data to be calculated, the third group of data to be calculated, the fourth group of data to be calculated, and the first group of data to be calculated.
  • the execution sequence is: the first group of data to be calculated, the third group of data to be calculated, the fourth group of data to be calculated, and the second group of data to be calculated.
  • the execution sequence is: the fourth group of data to be calculated, the second group of data to be calculated, the first group of data to be calculated, and the third group of data to be calculated.
  • the execution sequence is: the third group of data to be calculated, the first group of data to be calculated, the second group of data to be calculated, and the fourth group of data to be calculated.
  • Figures 2-3 and 2-4 only illustrate an implementation of the execution sequence of multiple sets of data to be calculated. All other execution sequences of multiple sets of data to be calculated that can be imagined by those skilled in the art under the enlightenment of the above-mentioned embodiments fall within the scope of this application.
  • the data operated by each computing node is divided into multiple groups of data to be operated, and the convolution operation of the group of data to be operated by the operation result used by other computing nodes is performed first, and the computing node can obtain the required convolution operation faster Data without waiting for multiple sets of data to be calculated.
  • each computing node executes the corresponding subsequent neural network layer operation after completing the convolution operation of each data to be calculated.
  • a computing node After a computing node completes the convolution operation of the 4 sets of data to be calculated, it can execute the subsequent neural network without waiting for the completion of the respective convolution operations of other computing nodes. Operation of the layer.
  • Subsequent neural network layer operations may be convolution operations, pooling layer operations, classification layer operations, and other operations on other network layers.
  • each computing node After each computing node has calculated its own convolution operation, it can perform subsequent neural network layer operations without waiting for the calculation of the slowest computing node to complete, thereby improving the computing efficiency.
  • Table 2-1 shows the process of computing node 1, computing node 2, computing node 3, and computing node 4 performing convolution operations.
  • Node 1, compute node 2, compute node 3, and compute node 4 perform a two-layer convolution operation together as an example.
  • the topology of compute node 1, compute node 2, compute node 3, and compute node 4 is shown in Figure 2-5 ,
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 can send and receive data to and from each other.
  • the second group of data to be calculated (ID10) that performs the second layer of convolution needs to obtain the calculation result (ID2) of the first group of data to be calculated on the computing node 2 to perform the first layer of convolution.
  • ID2 the calculation result of the first group of data to be calculated on the computing node 2 to perform the first layer of convolution.
  • the fourth set of data to be calculated (ID11) of the second layer of convolution for the computing node 1 needs to be obtained, and the third set of the first layer of convolution for the computing node 2 is required.
  • computing node 1 it does not need to wait for computing node 2, computing node 3, and computing node 4, even if its execution speed is faster than computing node 2, computing node 3, and computing node 4 for the calculation of 3 sets of data to be calculated, There is no need to reduce its execution speed.
  • the convolution operation method includes: Step 2-S601, perform a convolution operation according to the target data to obtain an operation result, the target data is a group of data to be operated Any group.
  • the convolution operation to be performed by the computing node 1 includes 4 sets of data to be calculated.
  • the computing node 1 can perform the volume on any set of data to be calculated in a predetermined order, such as the second group of data to be calculated.
  • Product operation get the result of the operation.
  • Step 2-S602 in the process of performing the convolution operation on the target data and obtaining the operation result, when it is determined that the operation result is used by other computing nodes, the operation result is sent to the corresponding other computing node. calculate node.
  • the calculation result is sent while calculating, instead of sending the calculation result after the calculation is completed, thereby reducing the communication time.
  • other computing nodes that rely on the calculation result to perform the calculation can start the corresponding calculation faster after receiving the calculation result.
  • step 2-S602 includes the following sub-steps: step 2-S6021, determining overlap data in the operation result, where the overlap data is the result of the convolution operation of the subsequent convolution layer by the other computing node. Data used.
  • the result of the operation includes the data used when computing node 2 performs the convolution operation of the subsequent convolution layer, that is, overlap Data (indicated by diagonal lines in Figure 2-3).
  • Step 2-S6022 sending the overlapping data to the corresponding other computing nodes.
  • the computing node 1 needs to send it to the computing node 2.
  • step 2-S6022 includes the following sub-steps: sending the overlapping data to the corresponding one or more other computing nodes.
  • the result of the operation includes the execution of subsequent convolution layers for computing node 2, computing node 3, and computing node 4.
  • the data used in the convolution operation is overlapped data (represented by diagonal lines, vertical lines, and shading in Figure 2-3).
  • the computing node 1 needs to send it to each of the computing node 2, the computing node 3, and the computing node 4 accordingly.
  • step 2-S601 includes the following sub-steps: step 2-S6011, preferentially execute the convolution operation of the target data whose operation result is used by the other computing nodes.
  • the computing node preferentially executes the convolution operation of a group of data to be operated whose operation results are used by other computing nodes.
  • the execution sequence is: the second group of data to be calculated, the third group of data to be calculated, the fourth group of data to be calculated, and the first group of data to be calculated.
  • the data operated by each computing node is divided into multiple groups of data to be operated, and the convolution operation of the group of data to be operated by the operation result used by other computing nodes is performed first, and the computing node can obtain the required convolution operation faster Data without waiting for multiple sets of data to be calculated.
  • the convolution operation method further includes: step 2-S603, determining the data to be calculated and/or the input data of each computing node according to the number of computing nodes in the artificial intelligence processor.
  • the overall data can be divided into N parts of data, and these N parts of data are used as the input of N computing nodes. data.
  • the overall data can be divided into multiples of N (for example, 2N, 3N, etc.) partial data.
  • the overall data can also be divided into data with less than N parts.
  • the overall data can also be divided into data of any number of parts.
  • the convolution operation method further includes: Step 2-S604: Split the input data into multiple groups of data to be operated on.
  • the input data can be split into multiple groups of data to be calculated according to the principle of load balancing, or the input data can be split into multiple groups along the height direction and/or width direction. Data to be calculated.
  • there may be other splitting methods for the input data which are all covered by this application.
  • the foregoing splitting of the input data may be performed after the computing node obtains the input data, or after the input data is split into multiple sets of data to be calculated, the computing node receives the split multiple sets of data to be calculated.
  • the input data of each computing node is divided into 4 groups of data to be calculated, namely the first group of data to be calculated, the second group of data to be calculated, the third group of data to be calculated, and the fourth group Data to be calculated. Then, the computing node performs convolution operations on the four groups of data to be calculated.
  • the convolution operation method further includes: step 2-S605, receiving the multiple sets of data to be calculated.
  • the input data before the computing node obtains the input data, the input data has been split into multiple groups of data to be calculated.
  • the computing node receives the split multiple sets of data to be calculated.
  • the convolution operation method further includes: Step 2-S606, after completing the convolution operation of each data to be operated, execute the corresponding subsequent neural network layer operation.
  • a computing node shown in Figure 2-3 After a computing node shown in Figure 2-3 completes the convolution operation of the 4 sets of data to be calculated, it can execute the subsequent neural network layer without waiting for the completion of the respective convolution operations of other computing nodes. Operation. Subsequent neural network layer operations may be convolution operations, pooling layer operations, classification layer operations, and other operations in other network layers.
  • each computing node after each computing node has calculated its own convolution operation, it can perform subsequent neural network layer operations without having to wait for the calculation of the slowest computing node to complete, thereby improving computing efficiency.
  • the convolution operation method further includes: step 2-S607, when the data to be calculated includes receiving operation results of other computing nodes, determining whether the receiving of the operation results of other computing nodes has been completed .
  • Step 2-S608 in a case where it is determined that the reception of the operation result of the other computing node is completed, perform a convolution operation according to the target data.
  • the computing node 1 determines that it has completed the reception of the computing node 2's operation result of the first group of data to be operated on, and can perform the operation on the second group of data to be operated on.
  • the above description mainly focuses on the actions performed by the computing node 1.
  • Those skilled in the art need to note that the above descriptions for the actions performed by the computing node 1 are also applicable to the computing node 2, the computing node 3, and the computing node. 4.
  • four computing nodes are used, those skilled in the art can understand that the number of computing nodes can be arbitrary according to actual application requirements.
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 include a neural network chip, and/or a computing core in the neural network chip.
  • any topological structure may be adopted between the computing node 1, the computing node 2, the computing node 3, and the computing node 4, such as a ring, a mesh, a tree, or other structures including a ring.
  • the operation result is sent to the corresponding other computing nodes that need to use the calculation result, and the calculation result is sent while calculating, instead of waiting for the calculation to be completed Then send the calculation results to reduce the communication time; and divide the data calculated by each computing node into multiple groups of data to be calculated, and give priority to the convolution operation of a group of data to be calculated that the calculation results are used by other computing nodes.
  • the computing node can obtain the data required for the convolution operation faster without waiting for the calculation of multiple sets of data to be calculated; in addition, each computing node can execute the subsequent neural network after calculating its own convolution operation The calculation of the layer without waiting for the calculation of the slowest calculation node to complete, thereby improving the calculation efficiency.
  • steps in the flowcharts of FIGS. 2-6A to 2-6G are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in FIGS. 2-6A to 2-6G may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution of these sub-steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of the sub-steps or stages of other steps.
  • the present invention also provides a neural network convolution operation device.
  • the neural network convolution operation device includes:
  • the first execution unit 2-701 is configured to perform a convolution operation according to target data to obtain an operation result, and the target data is any one of a plurality of groups of data to be calculated.
  • the first execution unit 2-701 is configured to: preferentially execute the convolution operation of the target data whose operation result is used by the other computing nodes.
  • the sending unit 2-702 is configured to send the result of the operation to the corresponding calculation result when it is determined that the result of the operation is used by other computing nodes in the process of performing the convolution operation on the target data and obtaining the result of the operation.
  • the other computing nodes are configured to send the result of the operation to the corresponding calculation result when it is determined that the result of the operation is used by other computing nodes in the process of performing the convolution operation on the target data and obtaining the result of the operation.
  • the other computing nodes are configured to send the result of the operation to the corresponding calculation result when it is determined that the result of the operation is used by other computing nodes in the process of performing the convolution operation on the target data and obtaining the result of the operation. The other computing nodes.
  • the sending unit 2-702 is configured to: determine overlapping data in the calculation result, where the overlapping data is the data used when the other computing node executes the convolution operation of the subsequent convolution layer; The overlapping data is sent to the corresponding other computing node.
  • the sending unit 2-702 is configured to send the overlapping data to the corresponding one or more other computing nodes.
  • the convolution operation device further includes: a first determining unit 2-703, configured to determine the to-be-calculated data and/or the calculated data of each computing node according to the number of computing nodes in the artificial intelligence processor ⁇ input data.
  • the convolution operation device further includes: a splitting unit 2-704, configured to split the input data into multiple groups of data to be calculated.
  • the input data can be split into multiple groups of data to be calculated according to the principle of load balancing, or the input data can be split into multiple groups along the height direction and/or width direction. Data to be calculated.
  • there may be other splitting methods for the input data which are all covered by this application.
  • the foregoing splitting of the input data may be performed after the computing node obtains the input data, or after the input data is split into multiple sets of data to be calculated, the computing node receives the split multiple sets of data to be calculated.
  • the convolution operation device further includes: a receiving unit 2-705, configured to receive the multiple sets of data to be operated on.
  • the input data before the computing node obtains the input data, the input data has been split into multiple groups of data to be calculated.
  • the computing node receives the split multiple sets of data to be calculated.
  • the convolution operation device further includes: a second execution unit 2-706, configured to execute the corresponding subsequent neural network layer operation after completing the convolution operation of each data to be operated.
  • the convolution operation device further includes: a second determining unit 2-707, configured to determine whether the calculation of other computing nodes has been completed when the data to be calculated includes receiving the operation results of other computing nodes.
  • the third execution unit 2-708 is configured to perform a convolution operation according to the target data when it is determined that the reception of the operation result of the other computing node is completed.
  • the calculation result is sent to the corresponding other calculation nodes that need to use the calculation result during the process of performing the convolution operation and obtaining the calculation result, and the calculation result is sent while calculating, instead of waiting for the calculation to be completed Then send the calculation results to reduce the communication time; and divide the data calculated by each computing node into multiple groups of data to be calculated, and give priority to the convolution operation of a group of data to be calculated that the calculation results are used by other computing nodes.
  • the computing node can obtain the data required for the convolution operation faster without waiting for the calculation of multiple sets of data to be calculated; in addition, each computing node can execute the subsequent neural network after calculating its own convolution operation The calculation of the layer without waiting for the calculation of the slowest calculation node to complete, thereby improving the calculation efficiency.
  • Figures 2-8 provide an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program
  • the implementation is as shown in the figure 2-6A to Figure 2-6G shows the method and detailed plan.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the processor or chip may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the on-chip cache, off-chip memory, and storage can be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive random access memory (RRAM), dynamic random access memory (DRAM) ( Dynamic Random Access Memory), Static Random-Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory HBM (High-Bandwidth Memory), Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • RRAM resistive random access memory
  • DRAM dynamic random access memory
  • SRAM Static Random-Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High-Bandwidth Memory
  • Hybrid Storage Cube HMC Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • An embodiment of the present application also provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes the computer to execute the methods and details shown in FIGS. 2-6A to 2-6G ⁇ The solution.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute Figures 2-6A to 2 -6G shows the method and detailed plan.
  • a convolution operation method characterized in that the method is applied to an artificial intelligence processor including multiple computing nodes.
  • the method includes: performing a convolution operation according to target data to obtain The operation result, the target data is any one of a plurality of groups of data to be calculated; in the process of performing a convolution operation on the target data and obtaining the operation result, when it is determined that the operation result is used by other computing nodes Next, send the calculation result to the corresponding other computing node.
  • Clause B2 the method according to clause B1, characterized in that the sending the calculation result to the corresponding other computing node includes: determining overlapping data in the calculation result, and the overlapping data is the result of the calculation. Data used when the other computing node executes the convolution operation of the subsequent convolution layer; sending the overlapping data to the corresponding other computing node.
  • Clause B3 the method according to clause B2, wherein the sending the calculation result to the corresponding other computing node includes: sending the overlapping data to the corresponding one or more other computing nodes. calculate node.
  • Clause B4 the method according to clause B1, characterized in that the performing the convolution operation according to the target data to obtain the operation result includes: prioritizing the execution of the target data whose operation result is used by the other computing node Convolution operation.
  • Clause B5 the method described in clause B1, characterized in that the method further comprises: splitting the input data into the multiple sets of data to be calculated.
  • Clause B6 the method according to clause B5, characterized in that the splitting the input data into the multiple groups of data to be calculated includes: splitting the input data into the multiple groups according to the principle of load balancing Data to be calculated.
  • the method according to clause B5 characterized in that the splitting the input data into the multiple sets of data to be calculated includes: splitting the input data into the height direction and/or the width direction The multiple sets of data to be calculated.
  • Clause B8 the method according to clause B5, characterized in that the method further comprises: receiving the multiple sets of data to be calculated.
  • Clause B9 the method according to clause B1 or B5, characterized in that the method further comprises: determining the data to be calculated and/or the input according to the number of the computing nodes in the artificial intelligence processor data.
  • Clause B10 the method described in clause B1, characterized in that the method further comprises: after completing the convolution operation of each data to be operated, performing the operation of the corresponding subsequent neural network layer.
  • Clause B11 the method described in clause B1, characterized in that the method further includes: when the data to be calculated includes receiving the calculation result of another computing node, determining whether the calculation of the other computing node has been completed Receiving the operation result; in the case where it is determined that the reception of the operation result of the other computing node is completed, perform a convolution operation according to the target data.
  • Clause B12 the method according to any one of clauses B1 to B11, characterized in that the topological structure formed by the multiple computing nodes includes a ring, a mesh, a tree, or other structures including a ring.
  • Clause B13 the method according to any one of clauses B1 to B12, wherein the computing node includes a neural network chip, and/or a computing core in the neural network chip.
  • a convolution operation device characterized in that the device is applied to an artificial intelligence processor that includes multiple computing nodes.
  • the device includes: a first execution unit for The data performs a convolution operation to obtain an operation result, and the target data is any one of a plurality of groups of data to be operated; the sending unit is used to perform a convolution operation on the target data and obtain the operation result, in the process When it is determined that the operation result is used by another computing node, the operation result is sent to the corresponding other computing node.
  • Clause B15 the device according to clause B14, characterized in that the first execution unit is configured to: determine overlapping data in the calculation result, and the overlapping data is the result of the other computing node executing the subsequent convolutional layer Data used in convolution operations; sending the overlapping data to the corresponding other computing nodes.
  • Clause B16 the device according to clause B14, characterized in that the sending unit is configured to send the overlapping data to the corresponding one or more other computing nodes.
  • Clause B17 the device according to clause B14, characterized in that the first execution unit is configured to preferentially execute the convolution operation of the target data whose operation result is used by the other computing node.
  • the device according to clause B14 characterized in that the device further includes: a splitting unit, configured to split the input data into the multiple sets of data to be calculated.
  • Clause B19 the device according to clause B18, characterized in that the splitting unit is configured to split the input data into the multiple sets of data to be calculated according to the principle of load balancing.
  • Clause B20 the device according to clause B18, characterized in that the splitting unit is configured to split the input data into the multiple sets of data to be calculated along the height direction and/or the width direction.
  • the device according to clause B14, characterized in that the device further comprises: a receiving unit, configured to receive the multiple sets of data to be calculated.
  • the device according to clause B14 or B18, characterized in that the device further comprises: a first determining unit, configured to determine the number of computing nodes in the artificial intelligence processor Data and/or said input data.
  • the device according to clause B14 characterized in that the device further includes: a second execution unit, configured to execute the corresponding subsequent neural network layer after completing the convolution operation of each data to be calculated Operation.
  • a second execution unit configured to execute the corresponding subsequent neural network layer after completing the convolution operation of each data to be calculated Operation.
  • the device further includes: a second determining unit, configured to determine whether it has been completed when the data to be calculated includes receiving calculation results from other computing nodes Receiving the operation result of the other computing node; a third execution unit, configured to perform a convolution operation according to the target data when it is determined that the reception of the operation result of the other computing node is completed.
  • the device according to any one of clauses B14 to B24, wherein the topological structure formed by the multiple computing nodes includes a ring shape, a mesh shape, a tree shape, or other structures including a ring shape.
  • Clause B26 the device according to any one of clauses B14 to B25, wherein the computing node includes a neural network chip, and/or a computing core in the neural network chip.
  • an electronic device characterized in that it includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, it implements any of items B1-B13.
  • One described method One described method.
  • Clause B28 a computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method described in any one of clauses B1-B13.
  • Clause B29 a computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of clauses B1-B13 The method described in the item.
  • This application relates to the field of information processing technology, in particular to a neural network fully connected layer operation method, device and related products.
  • artificial neural network is one of the most common calculation models in all intelligent methods.
  • the calculation process of each network layer of the neural network and the process of neural network training there are communication time for data communication and calculation time for processing data.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Figure 3-1 provides a schematic diagram of a neural network architecture.
  • the neural network architecture can include a multi-layer structure, as shown in Figure 3-1, which can include : Input layer, convolutional layer 1, batchnorm layer, convolutional layer 2, intermediate layer (there are different intermediate layers according to the neural network architecture of different functions, the intermediate layer can be at least one layer), convolutional layer n, fully connected Layer, activation (for example, activation function: softmax) layer.
  • the layer with a large amount of calculation can be called a calculation layer, such as a convolutional layer, a fully connected layer, etc., of course, in practical applications, the above-mentioned calculation layer may also include other types of layers.
  • this application provides The neural network architecture in Figure 3-1 is only for illustration, and the neural network in this application is not limited to the architecture shown in Figure 3-1.
  • Figure 3-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • the core system can be a neural network chip.
  • the multi-core system includes 16 cores (CORE) and 4 storage nodes.
  • the 16 cores are connected to 4 storage nodes DRAM through a ring-shaped NOC.
  • the core of the multi-core system can be a computing core in a neural network chip, and the type of storage node can be any type of memory, such as dynamic random access memory (Dynamic Random Access Memory, DRAM), and static random access memory. Get memory (Static Random Access Memory, SRAM), etc.
  • a multi-core system has 16 cores and 4 storage nodes.
  • the multi-core system may include any number of cores and any number of storage nodes, which all fall within the scope of this application.
  • Figure 3-3 provides a schematic diagram of a fully connected layer algorithm according to an embodiment of the present application.
  • FIG. 3-3 shows that the number of outputs and the number of computing nodes shown in Figure 3-3 is a specific example for ease of description. Inspired by this embodiment, those skilled in the art can think of other output numbers. And the number of computing nodes are all covered by this application. In addition, Fig. 3-3 shows that 4 computing nodes cooperate with fully connected layer operations for the fifth output. Those skilled in the art can understand that 4 computing nodes can also perform operations at other outputs in cooperation with fully connected layer.
  • the grouping of input data and the way of distributing input data to each computing node shown in Figure 3-3 is a specific example for the convenience of description. This application does not limit the grouping of input data and the input The way data is distributed to each computing node.
  • input data can be divided into 20 groups, and 5 consecutive input groups can be assigned to one computing node.
  • the input data packets are not evenly distributed to multiple computing nodes, that is, the number of input packets allocated to each computing node may be different, and so on.
  • each computing node After each computing node obtains the input packet, it can perform calculations. According to the grouping of input data shown in Figure 3-3 and the way the input data is allocated to each computing node, for 5 outputs, computing node 1 pairs the first, fifth, and ninth groups of input data, and computing node 2 pairs second and second groups. 6, 10 groups of input data, computing node 3 inputs data for the 3rd, 7th, and 11th groups, and computing node 4 performs fully connected layer calculations for the 4th, 8th, and 12th groups of input data, and the calculated results are for the 5th output Part and. Then, the four computing nodes start the merging and adding and transmitting process. Each computing node adds up its own partial sum data and the received partial sum data, and then sends the sum result to the next computing node.
  • Figures 3-4 provide a schematic diagram of a topological structure between computing nodes according to an embodiment of the present application.
  • computing node 1, computing node 2, computing node 3, and computing node 4 form a ring topology.
  • the calculation node 4 is designated as the calculation node that obtains the final addition result.
  • Compute node 1 transmits the first result of its fully-connected layer calculation to computing node 2.
  • computing node 2 After computing node 2 receives the first result from computing node 1, it will compare the first result with the calculated value of computing node 2's fully-connected layer. The second result is added and the first addition result is obtained.
  • the first addition result is sent to the computing node 3, and the computing node 3 adds the first addition result to the third result of the fully connected layer operation of the computing node 3 Sum, get the second sum result, send the second sum result to the calculation node 4, and the calculation node 4 adds the second sum result and the fourth result after the calculation node 4 is fully connected to get the third result
  • the result of the addition, and finally, the fourth result is added and stored, and the result of the third addition is stored as the final calculation result for the fifth output.
  • computing node 2 in the process of receiving the first result from computing node 1, it adds the first result and the second result, and receives the first result while executing The addition operation of the first result and the second result, that is, when a part of the data of the first result is received, the addition operation is performed, and the addition operation is performed while receiving.
  • the calculation node 2 adds the first result and the second result to obtain the first addition result, and sends the first addition result to the calculation node 3 while performing the addition of the first result and the second result. Operation, while sending the first addition result, that is, a part of the data of the first addition result obtained by the addition operation, start sending the first addition result, and send it while performing the addition operation.
  • the foregoing process of receiving edge computing and computing while sending is also applicable to other computing nodes, namely computing node 1, computing node 3, and computing node 4.
  • the calculation node 4 is designated as the calculation node for obtaining the final addition result.
  • any other calculation node can also be designated as the calculation node for obtaining the final addition result.
  • calculate node Moreover, for different outputs, the calculation nodes for obtaining the final addition result may be different. For example, for the fifth output, the calculation node 4 is designated as the calculation node for obtaining the final addition result, and for the sixth output, the calculation node 3 can be designated.
  • Figure 3-4 shows that compute node 1, compute node 2, compute node 3, and compute node 4 form a ring topology.
  • the topological structure formed by multiple computing nodes includes ring, mesh, tree, or other structures including ring, and so on.
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 include a neural network chip, and/or a computing core in the neural network chip.
  • the calculation node 1, the calculation node 2, the calculation node 3, and the calculation node 4 can perform the calculation for the fifth output or the addition operation for the fifth output. Perform subsequent calculations.
  • the operation for the sixth output can be performed. It can be understood that, after the calculation node completes the calculation for the current output or the addition operation for the current output, the next output operation performed and the current output may not be in the same fully connected layer.
  • the computing node after the computing node completes the calculation for the current output or the addition operation for the current output, it can also perform operations on other neural network layers, such as convolutional layers, pooling layers, and so on.
  • each computing node After each computing node has calculated its own fully connected layer calculation for the current output, it can perform subsequent fully connected layer calculations for other outputs or other neural network layer calculations without waiting for the slowest calculation. The node calculation is completed, thereby improving the calculation efficiency.
  • the fully connected layer calculation method includes: step S501, performing calculation based on input calculation data for the first output to obtain a first result.
  • the computing node 2 performs operations on the second, sixth, and tenth groups of the 12 groups of input data, and obtains an operation result, which is called the second result.
  • Step S502 If it is determined that there is a second result sent by the second computing node for the first output, receive the second result sent by the second computing node.
  • the computing node 1 sends the first result obtained by its calculation to the computing node 2, and the computing node 2 receives the first result from the computing node 1.
  • Step S503 In the process of receiving the second result, perform an addition operation on the first result and the second result to obtain a third result.
  • the computing node 2 adds the second result and the first result to obtain the first addition result.
  • computing node 2 in the process of receiving the first result from computing node 1, it adds the first result and the second result, and while receiving the first result, it executes the first result and the second result.
  • the fully connected layer operation method further includes the following steps: step S504, in the case where it is determined that the third result is used by a third computing node, when the first result is compared with the first result In the process of performing an addition operation on the two results to obtain the third result, sending the third result to the third computing node.
  • computing node 3 needs the first addition result from computing node 2 for subsequent calculations, then computing node 2 sends the first addition result to computing node 3.
  • the first addition result is sent to the computing node 3.
  • the first addition result is sent to the computing node 3, while the first result and the second result are executed When sending the first addition result, that is, part of the data of the first addition result obtained by the addition operation, start sending the first addition result, and send it while performing the addition operation.
  • the fully connected layer operation method further includes the following steps: step S505, in the case where it is determined that the third result is not used by a third computing node, the third result is used as the first
  • the final result of an output is stored.
  • the calculation node 4 it is designated as the calculation node to obtain the final addition result, and the third addition result obtained by the addition operation is regarded as the fifth one.
  • the output of the final calculation result is stored in the calculation node 4.
  • the fully connected layer operation method further includes the following steps: step S506, in the case where it is determined that there is no second result sent from the second computing node for the first output, sending the first output One result.
  • computing node 1 For computing node 1, there is no calculation result sent from other computing nodes for the fifth output, then computing node 1 sends the first result to Computing node 2.
  • the fully connected layer calculation method further includes the following steps: Step S507, receiving input calculation data for the first output.
  • the input data has 12 groups.
  • the input data can also include other group numbers, which are all covered by this application.
  • the fully connected layer operation method further includes the following steps: Step S508, grouping the received input calculation data for the first output.
  • the received 12 input data can be grouped, as shown in Figure 3-3, the first, fifth, and ninth groups are assigned to computing node 1, the second, sixth, and tenth groups are assigned to computing node 2, and the third and fifth groups are assigned to computing node 2. Groups 7, 11 are allocated to computing node 3, and groups 4, 8, and 12 are allocated to computing node 4.
  • the grouping of input data and the way of distributing input data to each computing node shown in Figure 3-3 is a specific example for the convenience of description. This application does not limit the grouping of input data and the input The way data is distributed to each computing node.
  • input data can be divided into 20 groups, and 5 consecutive input groups can be assigned to one computing node.
  • the input data packets are not evenly distributed to multiple computing nodes, that is, the number of input packets allocated to each computing node may be different, and so on.
  • the grouping method can be evenly distributed among the same data groups for each computing node (in Figure 3-3, each computing node obtains data groups with 4 data groups), or it can be separated by different data groups.
  • the data packets obtained by each computing node can be separated from each other or continuous; the number of data packets obtained by each computing node can be the same or different, and so on.
  • Those skilled in the art can adopt any suitable grouping method according to actual needs and specific application scenarios, which all fall within the scope of this application.
  • the computing node may receive a set of input data every interval a data group among the N data groups that are split into all input data for the first output to form the first output data group.
  • the output input calculation data where a represents the number of calculation nodes, and N is an integer multiple of a. In this way, the input data can be more evenly distributed to each computing node, so that the computing data borne by each computing node is closer.
  • the fully connected layer calculation method further includes the following steps: Step S509, after completing the sum operation of the first result and the second result to obtain the third result, perform the subsequent targeting of the first result and the second result. Two output operations.
  • the computing node 2 can perform an operation for the sixth output. It is understandable that after the computing node completes the sum operation for the current output, the next output operation performed by the computing node may not be the same fully connected layer as the current output. In addition, after the computing node completes the calculation for the current output or the addition operation for the current output, it can also perform operations on other neural network layers, such as convolutional layers, pooling layers, and so on.
  • the fully connected layer calculation method further includes the following steps: Step S510, after completing the calculation based on the input calculation data for the first output, perform subsequent calculations for the second output.
  • the computing node 1 can perform an operation for the sixth output. It is understandable that after the computing node completes the calculation for the current output, the next output operation performed and the current output may not be in the same fully connected layer. In addition, after the computing node completes the calculation for the current output, it can also perform operations on other neural network layers, such as convolutional layers, pooling layers, and so on.
  • each computing node After each computing node has calculated its own fully connected layer calculation for the current output, it can perform subsequent fully connected layer calculations for other outputs or other neural network layer calculations without waiting for the slowest calculation. The node calculation is completed, thereby improving the calculation efficiency.
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 include a neural network chip, and/or a computing core in the neural network chip.
  • any topological structure may be adopted between the computing node 1, the computing node 2, the computing node 3, and the computing node 4, such as a ring, a mesh, a tree, or other structures including a ring.
  • each computing node perform coordinated calculations for one output, and each calculation node can perform summation while receiving the calculation results of other calculation nodes.
  • the result of the sum is sent, that is, a part of the data is processed when a part of the data is received, and a part of the calculation result is sent when a part of the calculation result is calculated.
  • steps in the flowcharts of FIGS. 3-5A to 3-5H are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least a part of the steps in FIGS. 3-5A to 3-5H may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The order of execution of these sub-steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of the sub-steps or stages of other steps.
  • the present invention also provides a neural network fully connected layer computing device.
  • the neural network fully connected layer computing device includes: a first computing unit 3-601 configured to perform computing based on input computing data for the first output to obtain a first result.
  • the computing node 2 performs operations on the second, sixth, and tenth groups of the 12 groups of input data, and obtains an operation result, which is called the second result.
  • the first receiving unit 3-602 is configured to receive the second result sent by the second computing node when it is determined that there is a second result sent by the second computing node for the first output
  • the computing node 1 sends the first result obtained by its calculation to the computing node 2, and the computing node 2 receives the first result from the computing node 1.
  • the summation unit 3-603 is configured to perform a summation operation on the first result and the second result in the process of receiving the second result to obtain a third result.
  • the computing node 2 adds the second result and the first result to obtain the first addition result.
  • computing node 2 in the process of receiving the first result from computing node 1, it adds the first result and the second result, and while receiving the first result, it executes the first result and the second result.
  • the fully connected layer arithmetic device further includes: a first sending unit 3-604, configured to, in the case of determining that the third result is used by a third computing node, send the first In the process of adding the result and the second result to obtain the third result, sending the third result to the third computing node.
  • a first sending unit 3-604 configured to, in the case of determining that the third result is used by a third computing node, send the first In the process of adding the result and the second result to obtain the third result, sending the third result to the third computing node.
  • computing node 3 needs the first addition result from computing node 2 for subsequent calculations, then computing node 2 sends the first addition result to computing node 3.
  • the first addition result is sent to the computing node 3.
  • the first addition result is sent to the computing node 3, while the first result and the second result are executed When sending the first addition result, that is, part of the data of the first addition result obtained by the addition operation, start sending the first addition result, and send it while performing the addition operation.
  • the fully connected layer computing device further includes: a storage unit 3-605, configured to use the third result as the third result when it is determined that the third result is not used by the third computing node The final result of the first output is stored.
  • the calculation node 4 it is designated as the calculation node to obtain the final addition result, and the third addition result obtained by the addition operation is regarded as the fifth one.
  • the output of the final calculation result is stored in the calculation node 4.
  • the fully connected layer arithmetic device further includes: a second sending unit 3-606, configured to, when it is determined that there is no second result sent from the second computing node for the first output , Send the first result.
  • computing node 1 For computing node 1, there is no calculation result sent from other computing nodes for the fifth output, then computing node 1 sends the first result to Computing node 2.
  • the fully connected layer computing device further includes: a second receiving unit 3-607, configured to receive input calculation data for the first output.
  • the input data has 12 groups.
  • the input data can also include other group numbers, which are all covered by this application.
  • the fully connected layer computing device further includes: a splitting unit 3-608, configured to group the received input calculation data for the first output.
  • the received 12 input data can be grouped, as shown in Figure 3-3, the first, fifth, and ninth groups are assigned to computing node 1, the second, sixth, and tenth groups are assigned to computing node 2, and the third and fifth groups are assigned to computing node 2. Groups 7, 11 are allocated to computing node 3, and groups 4, 8, and 12 are allocated to computing node 4.
  • the grouping of input data and the way of distributing input data to each computing node shown in Figure 3-3 is a specific example for the convenience of description. This application does not limit the grouping of input data and the input The way data is distributed to each computing node.
  • the input data groupings and ways of allocating input data to various computing nodes which all fall within the scope of this application. For example, you can divide the input data into 20 groups, and assign 5 consecutive input groups to one calculation node.
  • the input data packets are not evenly distributed to multiple computing nodes, that is, the number of input packets allocated to each computing node may be different, and so on.
  • the grouping method can be evenly distributed among the same data groups for each computing node (in Figure 3-3, each computing node obtains data groups with 4 data groups), or it can be separated by different data groups.
  • the data packets obtained by each computing node can be separated from each other or continuous; the number of data packets obtained by each computing node can be the same or different, and so on.
  • Those skilled in the art can adopt any suitable grouping method according to actual needs and specific application scenarios, which all fall within the scope of this application.
  • the computing node may receive a set of input data every interval a data group among the N data groups that are split into all input data for the first output to form the first output data group.
  • the output input calculation data where a represents the number of calculation nodes, and N is an integer multiple of a. In this way, the input data can be more evenly distributed to each computing node, so that the computing data borne by each computing node is closer.
  • the fully connected layer arithmetic device further includes: a second arithmetic unit 3-609, configured to perform an addition operation on the first result and the second result to obtain a third result , Perform subsequent operations for the second output.
  • the computing node 2 can perform an operation for the sixth output. It is understandable that after the computing node completes the sum operation for the current output, the next output operation performed by the computing node may not be the same fully connected layer as the current output. In addition, after the computing node completes the calculation for the current output or the addition operation for the current output, it can also perform operations on other neural network layers, such as convolutional layers, pooling layers, and so on.
  • the fully connected layer computing device further includes: a third computing unit 3-610, configured to perform subsequent computing for the second output after completing the computing based on the input computing data for the first output .
  • the computing node 1 can perform an operation for the sixth output. It is understandable that after the computing node completes the calculation for the current output, the next output operation performed and the current output may not be in the same fully connected layer. In addition, after the computing node completes the calculation for the current output, it can also perform operations on other neural network layers, such as convolutional layers, pooling layers, and so on.
  • each computing node After each computing node has calculated its own fully connected layer calculation for the current output, it can perform subsequent fully connected layer calculations for other outputs or other neural network layer calculations without waiting for the slowest calculation. The node calculation is completed, thereby improving the calculation efficiency.
  • the computing node 1, the computing node 2, the computing node 3, and the computing node 4 include a neural network chip, and/or a computing core in the neural network chip.
  • any topological structure may be adopted between the computing node 1, the computing node 2, the computing node 3, and the computing node 4, such as a ring, a mesh, a tree, or other structures including a ring.
  • each computing node perform coordinated computing for one output, and each computing node can perform summation while receiving the computing results of other computing nodes.
  • the sum result is sent, that is, a part of the data is processed after receiving a part of the data, and a part of the calculation result is sent when a part of the calculation result is calculated.
  • the communication time is thereby greatly reduced.
  • each computing node after each computing node has calculated its own fully connected layer calculation for the current output, it can perform subsequent fully connected layer calculations for other outputs or other neural network layer calculations without waiting for the slowest calculation. The node calculation is completed, thereby improving the calculation efficiency.
  • Figure 3-7 provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor.
  • the processor executes the computer program, the implementation is as shown in the figure 3-5A to Figure 3-5H shows the method and detailed plan.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the processor or chip may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the on-chip cache, off-chip memory, and storage can be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive random access memory (RRAM), dynamic random access memory (DRAM) ( Dynamic Random Access Memory), Static Random-Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory HBM (High-Bandwidth Memory), Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • RRAM resistive random access memory
  • DRAM dynamic random access memory
  • SRAM Static Random-Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High-Bandwidth Memory
  • Hybrid Storage Cube HMC Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • An embodiment of the present application also provides a computer-readable storage medium that stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the methods and details shown in FIGS. 3-5A to 3-5H. ⁇ The solution.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute FIGS. 3-5A to 3 -5H shows the method and detailed plan.
  • a fully connected layer computing method which is applied to an artificial intelligence processor including multiple computing nodes.
  • the method includes: computing based on input computing data for the first output, To obtain a first result; in the case where it is determined that there is a second result sent from a second computing node for the first output, receiving the second result sent by the second computing node; and receiving the second result During the result process, the first result and the second result are added to obtain a third result.
  • Clause C2 the method described in Clause C1, further includes: in the case where it is determined that the third result is used by a third computing node, performing a sum operation on the first result and the second result to obtain the first result In the process of the three results, the third result is sent to the third computing node.
  • Clause C3 the method described in Clause C1, further includes: in a case where it is determined that the third result is not used by a third computing node, storing the third result as the final result of the first output.
  • Clause C4 the method of clause C1, further includes: sending the first result in a case where it is determined that there is no second result sent from the second computing node for the first output.
  • Clause C5 the method described in any one of clauses C1 to C4, further includes: receiving input calculation data for the first output.
  • Clause C6 the method of clause C5, further includes grouping the received input calculation data for the first output.
  • Clause C7 the method according to clause C6, wherein the receiving input calculation data for the first output includes: in N data groups split into all the input data for the first output, each interval a data The group receives a group of input data to form the input data for the first output, where a represents the number of computing nodes, and N is an integer multiple of a.
  • Clause C8 the method described in Clause C1 or C2, further includes: performing a subsequent operation on the second output after completing the sum operation of the first result and the second result to obtain the third result.
  • Clause C9 the method described in Clause C4, further includes: after completing the calculation based on the input calculation data for the first output, performing subsequent calculations for the second output.
  • Clause C10 is the method according to any one of clauses C1 to C9, wherein the topological structure formed by the plurality of computing nodes includes a ring, a mesh, a tree, or other structures including a ring.
  • Clause C11 the method according to any one of clauses C1 to C10, wherein the computing node includes a neural network chip or a computing core in the neural network chip.
  • a fully connected layer computing device which is applied to an artificial intelligence processor including multiple computing nodes.
  • the device includes: a first computing unit for The input calculation data of, to obtain the first result; the first receiving unit is configured to receive the second result sent by the second computing node for the first output in the case where it is determined that there is a second result sent by the second computing node And an addition unit, configured to perform an addition operation on the first result and the second result in the process of receiving the second result to obtain a third result.
  • the device further includes: a first sending unit, configured to compare the first result with the second result when it is determined that the third result is used by the third computing node. In the process of adding the results to obtain the third result, sending the third result to the third computing node.
  • the device according to clause C12 further includes: a storage unit, configured to use the third result as the first output when it is determined that the third result is not used by the third computing node The final result is stored.
  • the device according to clause C12 further includes: a second sending unit, configured to send the first output if it is determined that there is no second result sent from the second computing node for the first output result.
  • the device further includes: a second receiving unit, configured to receive input calculation data for the first output.
  • the device according to clause C16 further includes: a splitting unit, configured to group the received input calculation data for the first output.
  • Clause C18 the device according to clause C17, wherein the second receiving unit is configured to: among the N data groups split into all the input data for the first output, receive one data group every interval a The input data forms the input data for the first output, where a represents the number of computing nodes, and N is an integer multiple of a.
  • the device further includes: a second calculation unit, configured to perform a subsequent operation after the first result and the second result are added to obtain the third result. Operation of the second output.
  • Clause C20 the device of Clause C15, further includes: a third calculation unit configured to perform subsequent calculations on the second output after completing calculations based on the input calculation data for the first output.
  • the device according to any one of clauses C12 to C20, wherein the topological structure formed by the plurality of computing nodes includes a ring, a mesh, a tree, or other structures including a ring.
  • Clause C22 the device according to any one of clauses C12 to C21, wherein the computing node includes a neural network chip or a computing core in the neural network chip.
  • Clause C23 an electronic device, characterized in that it includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, it implements any of the items C1-C11.
  • One described method When the processor executes the computer program, it implements any of the items C1-C11.
  • Clause C24 a computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method described in any one of clauses C1-C11.
  • Clause C25 a computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of clauses C1-C11 The method described in the item.
  • This application relates to the field of information processing technology, and in particular to a neural network collaborative training method, device and related products.
  • artificial neural network is one of the most common calculation models in all intelligent methods.
  • the calculation process of each network layer of the neural network and the process of neural network training there are communication time for data communication and calculation time for processing data.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Figure 4-1 provides a schematic diagram of a neural network architecture.
  • the neural network architecture can include a multi-layer structure, as shown in Figure 4-1, which can include : Input layer, convolutional layer 1, batchnorm layer, convolutional layer 2, intermediate layer (there are different intermediate layers according to the neural network architecture of different functions, the intermediate layer can be at least one layer), convolutional layer n, fully connected Layer, activation (for example, activation function: softmax) layer.
  • the layer with a large amount of calculation can be called a calculation layer, such as a convolutional layer, a fully connected layer, etc., of course, in practical applications, the above-mentioned calculation layer may also include other types of layers.
  • this application provides The neural network architecture in Figure 4-1 is only for illustration, and the neural network in this application is not limited to the architecture shown in Figure 4-1.
  • Figure 4-2 provides a schematic diagram of a multi-core system according to an embodiment of the present application.
  • the core system can be a neural network chip.
  • the multi-core system includes 16 cores (CORE), including 4 storage nodes, and the 16 cores are connected to 4 storage node DRAMs through a ring-shaped NOC.
  • the core of the multi-core system can be a computing core in a neural network chip, and the type of storage node can be any type of memory, such as dynamic random access memory (Dynamic Random Access Memory, DRAM), and static random access memory. Get memory (Static Random Access Memory, SRAM), etc.
  • a multi-core system has 16 cores and 4 storage nodes.
  • the multi-core system may include any number of cores and any number of storage nodes, which all fall within the scope of this application.
  • Figure 4-3 provides a schematic diagram of the topology result of the collaborative training system according to an embodiment of the present application.
  • the collaborative training system includes a control node and multiple computing nodes. Data can be transferred between the control node and multiple computing nodes.
  • the number of control nodes and computing nodes are 1 and 8, respectively.
  • the control nodes and computing nodes The number can be arbitrary.
  • Fig. 4-3 shows the result of the control node and the computing node adopting the ring topology, this is only to facilitate the description of a specific implementation manner mentioned in the solution of this application. It should be noted that, according to actual needs and specific applications, any topological structure can be adopted between the control node and the computing node, such as a ring, a mesh, a tree, or other structures including a ring.
  • the control node includes a parameter service node.
  • the control node and the computing node include a neural network chip or a computing core in the neural network chip.
  • Figure 4-4 provides a schematic diagram of collaborative training according to an embodiment of the present application.
  • the control node sends a gradient update data signal to all computing nodes.
  • the acquiring gradient update data signal may include the identification of the computing node that requires the weight gradient data of the related computing node. For example, the control node wishes to obtain the weight gradient data of the computing node 1, the computing node 2, the computing node 4, and the computing node 5. Then, after receiving the gradient update data signal, each computing node confirms whether it meets the conditions for acquiring the gradient update data signal.
  • the acquiring gradient update data signal may further include an algebraic identification of the updated weight gradient data.
  • the computing node compares the algebra of the updated weight gradient data with the algebra identified by the local weight gradient data. If the difference between the two is in line with expectations, the computing node merges the local weight gradient data into this training transmission .
  • the algebraic identification algebra of the updated weight gradient data is 8, and the algebraic difference is scheduled to be 3, and the algebraic identification of the local weight gradient data is 5, and the algebraic difference meets expectations. Then, the computing node will be local The weight gradient data of is merged into this training transmission.
  • the acquiring gradient update data signal may include a calculation node identification that requires weight gradient data of the relevant calculation node and an algebraic identification of the updated weight gradient data.
  • the computing node meets the requirement of the calculation node identification of the weight gradient data of the related computing node and the algebraic identification of the updated weight gradient data, the local weight gradient data needs to be merged into this training transmission. in.
  • the control node sends the gradient update data signal.
  • the computing nodes In the process of determining whether each computing node meets the requirements for obtaining the gradient update data signal, the computing nodes automatically form a group, so that when the computing power of multiple computing nodes does not match, you can only Synchronize some computing nodes, thereby reducing the waiting overhead between different computing nodes and improving computing efficiency.
  • the local weight gradient data needs to be merged into this time Training transmission.
  • the computing node 1 For the computing node 1, it only needs to send the obtained weight gradient data 1 to the computing node 2.
  • the weight gradient data 1 from the computing node 1 and the locally obtained weight gradient data 2 are added, and the result of the addition is sent to the computing node 3.
  • the computing node 3 since it does not meet the conditions for obtaining the gradient update data signal, there is no need to merge the local weight gradient data into this training transmission. Then the computing node 3 only needs to add the received data from the computing node 2.
  • the weight gradient data is sent out (direct transmission).
  • the calculation node 2 sums the weight gradient data 1 from the calculation node 1 and the weight gradient data 2 obtained locally, and sends the result of the addition to the calculation node 3, which is to perform the summation at the same time.
  • the processing sends the sum result at the same time, that is, if a part of the sum result is obtained by calculation, a part of the sum result is sent, instead of sending the calculation result after the calculation is completed.
  • the process in which the computing node 3 sends the weight gradient data received from the computing node 2 is to send the data while receiving the data, that is, to send a part of the data after receiving part of the data, instead of waiting for the completion of the reception. Send again. Therefore, the above-mentioned methods of sending while calculating and sending while receiving can effectively reduce the communication time.
  • the computing node 4 and the computing node 5 adopt a method similar to that of the computing node 2 for processing and sending data
  • the computing node 6, the computing node 7 and the computing node 8 adopt a method similar to that of the computing node 3 for processing and sending data.
  • control node when the control node receives the transmitted merged weight gradient data, it updates the weight data and broadcasts the updated weight data to all computing nodes. At the same time, the information is marked with a label, which means The algebra of the updated weight data. As shown in Figure 4-4, each computing node saves the updated weight data after receiving it, updates the local weight data, and uses the updated weight data for training during the next training. The obtained weight gradient data is marked with the label attached to the updated weight data.
  • each computing node sends the weight data to the next computing node if there is a next computing node that receives the weight data.
  • computing node 1 sends weight data to computing node 2
  • computing node 2 sends weight data to computing node 3
  • computing node 7 sends weight data to computing node 8.
  • the computing node can use the method of receiving and sending the weight data.
  • the computing node when the computing node receives and sends the weighted data, it can also be sent while receiving, that is, sending part of the data after receiving part of the data, instead of sending it after the reception is completed.
  • the computing node transmits the local weight gradient data
  • the time when the weight gradient data is generated is attached to the data and passed back to the control node.
  • the control node compares the time stamps returned by the computing nodes of each packet.
  • the overlapping parts of each packet are exchanged until the time stamp returned by each computing node in each packet Complete separation from another group.
  • Figure 4-5 provides a schematic diagram of dynamically adjusting the grouping of computing nodes according to an embodiment of the present application.
  • the original grouping method is: compute node 1, compute node 2, compute node 4, and compute node 5 as a group, while compute node 3, compute node 6, compute node 7, and compute node 8 are Another grouping.
  • the control node compares the timestamps returned by the computing node 1 to the computing node 8, in order to prevent the time stamps of the two packets from being overlapped in time, the positions of the computing node 3 and the computing node 5 need to be exchanged, then the control The node exchanges the positions of computing node 3 and computing node 5 to realize the dynamic adjustment of computing node grouping.
  • the adjusted grouping method is: computing node 1, computing node 2, computing node 3, and computing node 4 are grouped together, and computing Node 5, computing node 6, computing node 7, and computing node 8 are another group.
  • Figure 4-5 shows an example of 8 computing nodes and two groups.
  • the above-mentioned dynamic grouping method can be applied to any other number of computing nodes and any other number of groups.
  • other dynamic grouping methods thought of by those skilled in the art under the inspiration of the above-mentioned embodiments all fall within the scope of this application.
  • the collaborative training method includes:
  • Step 4-S601 Obtain the first weight gradient data.
  • each computing node obtains locally acquired weight gradient data after training.
  • Step 4-S602 in the case where there is second weight gradient data from a second computing node among the multiple computing nodes, when comparing the second weight gradient data from the second computing node with In the process of performing an addition operation on the first weight gradient data to obtain updated weight gradient data, the updated weight gradient data is sent.
  • weight gradient data 1 from computing node 1 there is weight gradient data 1 from computing node 1, and the weight gradient data 1 from computing node 1 is combined with the weight gradient data obtained locally. 2 Perform addition to obtain updated weight gradient data, and send the updated weight gradient data to computing node 3.
  • the calculation node 2 adds the weight gradient data 1 from the calculation node 1 and the locally obtained weight gradient data 2 and sends the result of the addition to the calculation node 3.
  • the process is to send the result while adding and processing , That is, a part of the sum result is sent after the calculation is obtained, instead of sending the calculation result after the calculation is completed, which can effectively reduce the communication time.
  • the collaborative training method further includes: step 4-S603, sending the first weight gradient data if there is no weight gradient data from the second computing node.
  • the collaborative training method further includes: step 4-S604, receiving and acquiring a gradient update data signal.
  • control node sends a gradient update data signal to all computing nodes.
  • the acquiring gradient update data signal may include the identification of the computing node that requires the weight gradient data of the related computing node.
  • the control node wishes to obtain the weight gradient data of the computing node 1, the computing node 2, the computing node 4, and the computing node 5. Then, each computing node receives the gradient update data signal and confirms whether it meets the conditions for acquiring the gradient update data signal.
  • the acquiring gradient update data signal may further include an algebraic identification of the updated weight gradient data.
  • the computing node compares the algebra of the updated weight gradient data with the algebra identified by the local weight gradient data. If the difference between the two is in line with expectations, the computing node merges the local weight gradient data into this training transmission .
  • the algebraic identification algebra of the updated weight gradient data is 8, and the algebraic difference is scheduled to be 3, and the algebraic identification of the local weight gradient data is 5, and the algebraic difference meets expectations. Then, the computing node will be local The weight gradient data of is merged into this training transmission.
  • the acquiring gradient update data signal may include a calculation node identification that requires weight gradient data of the relevant calculation node and an algebraic identification of the updated weight gradient data.
  • the computing node meets the requirement of the calculation node identification of the weight gradient data of the related computing node and the algebraic identification of the updated weight gradient data, the local weight gradient data needs to be merged into this training transmission. in.
  • the control node sends the gradient update data signal.
  • the computing nodes In the process of determining whether each computing node meets the requirements for obtaining the gradient update data signal, the computing nodes automatically form a group, so that when the computing power of multiple computing nodes does not match, you can only Synchronize some computing nodes, thereby reducing the waiting overhead between different computing nodes and improving computing efficiency.
  • the collaborative training method further includes: step 4-S605, if the requirements for obtaining the gradient update data signal are met, step 4-S602 or step 4-S603 is executed.
  • computing node 1, computing node 2, computing node 4, and computing node 5 computing nodes meet the requirements for obtaining gradient update data signals, and the local weight gradient data needs to be merged into In this training transmission, the method of merging the local weight gradient data into this training transmission is implemented through step 4-S602 or step 4-S603.
  • the collaborative training method further includes: step 4-S606, when the requirement for acquiring the gradient update data signal is not met and the second weight gradient data from the second computing node exists In the case of receiving the second weight gradient data, the second weight gradient data is sent.
  • the computing node 3 does not meet the requirements for obtaining the gradient update data signal and there is weight gradient data from the computing node 2, then the computing node 3 only needs to transfer the received data from the computing The weight gradient data of node 2 is sent out (direct transmission).
  • the process in which the computing node 3 sends the weight gradient data received from the computing node 2 is to send the data while receiving the data, that is, to send a part of the received data. Instead of sending it after the reception is complete. Therefore, the communication time can be effectively reduced.
  • the collaborative training method further includes: step 4-S607, receiving weight data broadcast by the control node.
  • control node when the control node receives the transmitted merged weight gradient data, it updates the weight data and broadcasts the updated weight data to all computing nodes, and at the same time in the information
  • the tag label indicates the algebra of the updated weight data.
  • the collaborative training method further includes: step 4-S608, saving the weight data.
  • each computing node saves the updated weight data after receiving it, updates the local weight data, and uses the updated weight data for training during the next training.
  • the obtained weight gradient data is marked with the label attached to the updated weight data.
  • the collaborative training method further includes: step 4-S609, in the case that there is a third computing node that receives the weight data, in the process of receiving the weight data, the The weight data is sent to the third computing node.
  • each computing node in the process of receiving the updated weight data, each computing node sends the weight data to the next computing node if there is a next computing node that receives the weight data.
  • computing node 1 sends weight data to computing node 2
  • computing node 2 sends weight data to computing node 3
  • ... computing node 7 sends weight data to computing node 8.
  • the computing node can use the method of receiving and sending the weight data.
  • the computing node receives and sends the weighted data it can also be sent while receiving, that is, sending part of the data after receiving part of the data, instead of sending it after the reception is completed.
  • the collaborative training method further includes: step 4-S610, sending a timestamp for acquiring the first weight gradient data.
  • each computing node When each computing node transmits local weight gradient data, it attaches the time when the weight gradient data is generated to the data and transmits it back to the control node.
  • the time stamp passed back by the control node and each computing node dynamically adjusts the grouping of computing nodes. For example, in the embodiment shown in FIGS. 4-5, the control node exchanges the positions of the computing node 3 and the computing node 5, and adjusts the grouping of the computing node 3 and the computing node 5.
  • the computing node can add the local weight gradient data and the weight gradient data from another computing node to meet the requirements of obtaining the gradient update data signal.
  • it sends the sum of the data.
  • the calculation result is sent while calculating, instead of sending the calculation result after the calculation is completed; the calculation node does not meet the requirements for obtaining the gradient update data signal and sends the received weight during the process of receiving the weight gradient data of other computing nodes.
  • Value gradient data send data during the receiving process, that is, send the data while receiving the data, instead of sending it after the reception is completed; thus, sending while calculating and sending while receiving can effectively reduce the communication time; and, in training In the process, multiple computing nodes are grouped, so that when the computing power of multiple computing nodes does not match, only part of the computing nodes can be synchronized, thereby reducing the waiting overhead between different computing nodes and improving computing efficiency.
  • the present invention also provides a cooperative training device.
  • the cooperative training device includes: an acquiring unit 4-701, configured to acquire first weight gradient data.
  • each computing node obtains locally acquired weight gradient data after training.
  • the first sending unit 4-702 is configured to send the second weight gradient data from the second computing node of the plurality of computing nodes when the second weight gradient data from the second computing node exists. In the process of adding the weight gradient data and the first weight gradient data to obtain updated weight gradient data, the updated weight gradient data is sent.
  • weight gradient data 1 from computing node 1 there is weight gradient data 1 from computing node 1, and the weight gradient data 1 from computing node 1 is combined with the weight gradient data obtained locally. 2 Perform addition to obtain updated weight gradient data, and send the updated weight gradient data to computing node 3.
  • the calculation node 2 adds the weight gradient data 1 from the calculation node 1 and the locally obtained weight gradient data 2 and sends the result of the addition to the calculation node 3.
  • the process is to send the result while adding and processing , That is, a part of the sum result is sent after the calculation is obtained, instead of sending the calculation result after the calculation is completed, which can effectively reduce the communication time.
  • the collaborative training device further includes:
  • the second sending unit 4-703 is configured to send the first weight gradient data when there is no weight gradient data from the second computing node.
  • the collaborative training device further includes:
  • the first receiving unit 4-704 is configured to receive the gradient update data signal.
  • control node sends a gradient update data signal to all computing nodes.
  • the acquiring gradient update data signal may include the identification of the computing node that requires the weight gradient data of the related computing node.
  • the control node wishes to obtain the weight gradient data of the computing node 1, the computing node 2, the computing node 4, and the computing node 5. Then, each computing node receives the gradient update data signal and confirms whether it meets the conditions for acquiring the gradient update data signal.
  • the acquiring gradient update data signal may further include an algebraic identification of the updated weight gradient data.
  • the computing node compares the algebra of the updated weight gradient data with the algebra identified by the local weight gradient data. If the difference between the two is in line with expectations, the computing node merges the local weight gradient data into this training transmission .
  • the algebraic identification algebra of the updated weight gradient data is 8, and the algebraic difference is scheduled to be 3, and the algebraic identification of the local weight gradient data is 5, and the algebraic difference meets expectations. Then, the computing node will be local The weight gradient data of is merged into this training transmission.
  • the acquiring gradient update data signal may include a calculation node identification that requires weight gradient data of the relevant calculation node and an algebraic identification of the updated weight gradient data.
  • the computing node meets the requirement of the calculation node identification of the weight gradient data of the related computing node and the algebraic identification of the updated weight gradient data, the local weight gradient data needs to be merged into this training transmission. in.
  • the control node sends the gradient update data signal.
  • the computing nodes In the process of determining whether each computing node meets the requirements for obtaining the gradient update data signal, the computing nodes automatically form a group, so that when the computing power of multiple computing nodes does not match, you can only Synchronize some computing nodes, thereby reducing the waiting overhead between different computing nodes and improving computing efficiency.
  • the cooperative training device further includes: an executing unit 4-705, configured to execute the first sending unit 4-702 or the second sending unit 4-702 when the requirements for acquiring the gradient update data signal are met.
  • computing node 1, computing node 2, computing node 4, and computing node 5 computing nodes meet the requirements for obtaining gradient update data signals, and the local weight gradient data needs to be merged into In this training transmission, the method of merging the local weight gradient data into this training transmission is implemented through step 4-S602 or step 4-S603.
  • the collaborative training method further includes: a third sending unit 4-706, configured to do not meet the requirements for acquiring the gradient update data signal and there is the second computing node from the second computing node.
  • a third sending unit 4-706 configured to do not meet the requirements for acquiring the gradient update data signal and there is the second computing node from the second computing node.
  • the second weight gradient data is sent.
  • the computing node 3 does not meet the requirements for obtaining the gradient update data signal and there is weight gradient data from the computing node 2, then the computing node 3 only needs to transfer the received data from the computing The weight gradient data of node 2 is sent out (direct transmission).
  • the process of computing node 3 sending out the weight gradient data received from computing node 2 is to send the data while receiving the data, that is, to send part of the data when receiving part of the data, instead of waiting after the reception is completed send. Therefore, the communication time can be effectively reduced.
  • the cooperative training device further includes: a second receiving unit 4-707, configured to receive weight data broadcast by the control node.
  • control node when the control node receives the transmitted merged weight gradient data, it updates the weight data and broadcasts the updated weight data to all computing nodes, and at the same time in the information
  • the tag label indicates the algebra of the updated weight data.
  • the collaborative training device further includes: a storage unit 4-708, configured to store the weight data.
  • each computing node saves the updated weight data after receiving it, updates the local weight data, and uses the updated weight data for training during the next training.
  • the obtained weight gradient data is marked with the label attached to the updated weight data.
  • the cooperative training device further includes: a fourth sending unit 4-709, configured to, when there is a third computing node that receives the weight data, when the weight data is received In the process, the weight data is sent to the third computing node.
  • a fourth sending unit 4-709 configured to, when there is a third computing node that receives the weight data, when the weight data is received In the process, the weight data is sent to the third computing node.
  • each computing node in the process of receiving the updated weight data, each computing node sends the weight data to the next computing node if there is a next computing node that receives the weight data.
  • computing node 1 sends weight data to computing node 2
  • computing node 2 sends weight data to computing node 3
  • ... computing node 7 sends weight data to computing node 8.
  • the computing node can use the method of receiving and sending the weight data.
  • the computing node receives and sends the weighted data it can also be sent while receiving, that is, sending part of the data after receiving part of the data, instead of sending it after the reception is completed.
  • the collaborative training device further includes: a fifth sending unit 4-710, configured to send a timestamp for acquiring the first weight gradient data.
  • each computing node When each computing node transmits local weight gradient data, it attaches the time when the weight gradient data is generated to the data and transmits it back to the control node.
  • the time stamp passed back by the control node and each computing node dynamically adjusts the grouping of computing nodes. For example, in the embodiment shown in FIGS. 4-5, the control node exchanges the positions of the computing node 3 and the computing node 5, and adjusts the grouping of the computing node 3 and the computing node 5.
  • the computing node meets the requirements of obtaining the gradient update data signal and sums the local weight gradient data with the weight gradient data from another computing node.
  • it sends the sum of the data.
  • the calculation result is sent while calculating, instead of sending the calculation result after the calculation is completed; the calculation node does not meet the requirements for obtaining the gradient update data signal and sends the received weight during the process of receiving the weight gradient data of other computing nodes.
  • Value gradient data send data during the receiving process, that is, send the data while receiving the data, instead of sending it after the reception is completed; thus, sending while calculating and sending while receiving can effectively reduce the communication time; and, in training In the process, multiple computing nodes are grouped, so that when the computing power of multiple computing nodes does not match, only part of the computing nodes can be synchronized, thereby reducing the waiting overhead between different computing nodes and improving computing efficiency.
  • Figures 4-8 provide an electronic device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, the implementation is as shown in the figure 4-6A to Figure 4-6I shows the method and detailed plan.
  • the above device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the processor or chip may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the on-chip cache, off-chip memory, and storage can be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive random access memory (RRAM), dynamic random access memory (DRAM) ( Dynamic Random Access Memory), Static Random-Access Memory (SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High-Bandwidth Memory HBM (High-Bandwidth Memory), Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • RRAM resistive random access memory
  • DRAM dynamic random access memory
  • SRAM Static Random-Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High-Bandwidth Memory
  • Hybrid Storage Cube HMC Hybrid Storage Cube HMC (Hybrid Memory Cube) and so on.
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • An embodiment of the present application also provides a computer-readable storage medium that stores a computer program for electronic data exchange, where the computer program causes the computer to execute the methods and details shown in FIGS. 4-6A to 4-6I. ⁇ Of the program.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as shown in FIGS. 4-6A to 4 -6I shows the method and detailed plan.
  • a method of collaborative training which is applied to an artificial intelligence processor including a plurality of nodes, the plurality of nodes including a control node and a plurality of computing nodes, for any one of the plurality of computing nodes Computing node, and the method includes the following steps: acquiring first weight gradient data; in the case where there is second weight gradient data from a second computing node in the plurality of computing nodes, when transferring data from the first computing node In the process of adding the second weight gradient data of the second calculation node and the first weight gradient data to obtain updated weight gradient data, the updated weight gradient data is sent.
  • Clause D2 the method described in Clause D1, further includes: sending the first weight gradient data in a case where there is no weight gradient data from the second computing node.
  • Clause D3 the method described in Clause D2, further includes: receiving a gradient update data signal; if the requirements for acquiring the gradient update data signal are met, performing one of the following steps: In the case of the second weight gradient data of the second calculation node in the two calculation nodes, the second weight gradient data from the second calculation node and the first weight gradient data are added In the process of calculating the updated weight gradient data, send the updated weight gradient data; or send the first weight gradient when there is no weight gradient data from the second computing node data.
  • Clause D4 the method as described in Clause D3, further includes: in the case that the requirement for acquiring the gradient update data signal is not met and the second weight gradient data from the second computing node exists, receiving In the process of the second weight gradient data, the second weight gradient data is sent.
  • Clause D5 the method as described in Clause D3 or D4, wherein the acquiring gradient update data signal includes a calculation node identification that requires the weight gradient data of the relevant calculation node and/or an algebraic identification of the updated weight gradient data.
  • Clause D6 the method according to clause D5, wherein the requirement for obtaining the gradient update data signal includes: belonging to the computing node indicated by the computing node identifier; and/or the algebra and the value of the first weight gradient data The difference between the algebras of the updated weight gradient data satisfies the preset value.
  • Clause D7 the method described in any one of clauses D1 to D6, further includes: receiving weight data broadcast by the control node; saving the weight data, wherein the weight data is used for training; In the case of the third computing node receiving the weight data, in the process of receiving the weight data, the weight data is sent to the third computing node.
  • Clause D8 the method described in any one of clauses D1 to D7, further includes: sending a time stamp for obtaining the first weight gradient data, wherein the time stamp is used to dynamically perform the dynamic operation of the multiple computing nodes. Grouping.
  • Clause D9 the method according to any one of clauses D1 to D8, wherein the control node includes a parameter service node.
  • Clause D10 is the method according to any one of clauses D1 to D9, wherein the topological structure formed by the plurality of nodes includes a ring, a mesh, a tree, or other structures including a ring.
  • Clause D11 the method according to any one of clauses D1 to D10, wherein the node includes a neural network chip or a computing core in the neural network chip.
  • a cooperative training device which is applied to an artificial intelligence processor including multiple nodes, the multiple nodes including a control node and multiple computing nodes, for any one of the multiple computing nodes A computing node, and the device includes: an acquiring unit configured to acquire first weight gradient data; a first sending unit configured to obtain second weight gradient data from a second computing node among the plurality of computing nodes In the case of adding the second weight gradient data from the second computing node and the first weight gradient data to obtain updated weight gradient data, the update is sent The weight gradient data.
  • the device according to clause D12 further includes: a second sending unit, configured to send the first weight gradient data when there is no weight gradient data from the second computing node.
  • the device further includes: a first receiving unit, configured to receive a gradient update data signal; an execution unit, configured to perform the following if the requirements for acquiring a gradient update data signal are met One of: in the case where there is second weight gradient data from a second calculation node in the plurality of calculation nodes, when the second weight gradient data from the second calculation node is combined with In the process of adding and calculating the first weight gradient data to obtain updated weight gradient data, send the updated weight gradient data; or when there is no weight gradient data from the second computing node In this case, the first weight gradient data is sent.
  • Clause D15 the device according to clause D14, further comprising: a third sending unit, configured to: if the requirement for acquiring the gradient update data signal is not met and there is the second weight gradient from the second computing node In the case of data, in the process of receiving the second weight gradient data, the second weight gradient data is sent.
  • a third sending unit configured to: if the requirement for acquiring the gradient update data signal is not met and there is the second weight gradient from the second computing node In the case of data, in the process of receiving the second weight gradient data, the second weight gradient data is sent.
  • the device according to clause D14 or D15, wherein the acquiring gradient update data signal includes a calculation node identification that requires weight gradient data of the relevant calculation node and/or an algebraic identification of the updated weight gradient data.
  • the device according to clause D16, wherein the requirement for acquiring the gradient update data signal includes: belonging to the computing node indicated by the computing node identifier; and/or the algebra and the value of the first weight gradient data The difference between the algebras of the updated weight gradient data satisfies the preset value.
  • the device further includes: a second receiving unit, configured to receive weight data broadcast by the control node; and a storage unit, configured to save the weight data, wherein , The weight data is used for training; the fourth sending unit is used to transfer the weight data in the process of receiving the weight data when there is a third computing node that receives the weight data The data is sent to the third computing node.
  • the device described in any one of clauses D12 to D18 further includes: a fifth sending unit, configured to send a time stamp for acquiring the first weight gradient data, wherein the time stamp is used to transfer all
  • a fifth sending unit configured to send a time stamp for acquiring the first weight gradient data, wherein the time stamp is used to transfer all
  • the multiple computing nodes are dynamically grouped.
  • Clause D20 the device according to any one of clauses D12 to D19, wherein the control node includes a parameter service node.
  • Clause D22 the device according to any one of clauses D12 to D21, wherein the node includes a neural network chip or a computing core in the neural network chip.
  • an electronic device characterized in that it includes a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, it implements any of items D1-D11.
  • One described method One described method.
  • Clause D24 a computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method according to any one of clauses D1-D11.
  • Clause D25 a computer program product, characterized in that the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute any one of clauses D1-D11 The method described in the item.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Transfer Systems (AREA)

Abstract

一种芯片和多芯片系统及电子设备和数据传输方法。电子设备可包括中央处理器、加速模块和存储器。加速模块与中央处理器通信连接,并包括多个所述芯片。

Description

数据传输方法及相关设备
相关申请的交叉引用
本申请要求于2019年8月31日申请的,申请号为201910819946.3,名称为“芯片和多芯片系统及电子设备和数据传输方法”;申请号为201910819940.6,名称为“一种神经网络卷积运算方法、装置以及相关产品”;申请号为201910819939.3,名称为“一种神经网络全连接层运算方法、装置以及相关产品”;申请号为201910819947.8,名称为“一种神经网络协同训练方法、装置以及相关产品”的中国专利申请的优先权,在此将其全文引入作为参考。
技术领域
本申请涉及芯片技术领域,具体而言,涉及一种芯片、多芯片系统、电子设备和数据传输方法。
背景技术
计算任务的爆炸式增长对芯片设计提出越来越高的要求。以图像识别领域的imagenet挑战赛为例,自从使用深度学习网络之后,图像识别的错误率飞速下降,并且在ResNet网络出来之后,超越了人类的识别精度。但是,与之对应的是,这些深度学习网络的网络规模动辄几百兆字节,其训练的图片数据集动辄上百万个,因此对计算能力的需求飞速膨胀。
为了解决算力的问题,获得更高的性能,更低的功耗,以及量产之后更加低廉的成本,研究者们在努力开发多节点协同解决方案的同时,也在努力设计开发新的芯片结构,希望实现高运算效率和硬件资源的高利用率。
在所述背景技术部分公开的上述信息仅用于加强对本申请的背景的理解,因此它可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
201910819946.3本申请旨在提供一种芯片和多芯片系统及电子设备和数据传输方法,能够提高运算效率。
本申请的用户特性和优点将通过下面的详细描述变得显然,或部分地通过本申请的实践而习得。
根据本申请的一方面,提供一种芯片,包括数据总线以及与所述数据总线连接的存储器、数据接收器、运算处理单元、数据发送器,其中,所述数据接收器配置为接收来自外部的第一数据和头信息,将所述第一数据通过所述数据总线写入到所述存储器的对应区域,以及根据所述头信息配置对应的运算处理单元和/或数据发送器;所述运算处理单元配置为接收第一任务信息,根据所述第一任务信息执行运算处理并对所述数据发送器执行配置操作;所述数据发送器配置为获取第二任务信息以及第二数据,并基于至少部分所述第二数据向外输出第三数据。
根据本申请的另一方面,提供一种多芯片系统,包括根据本申请的所述芯片。
根据本申请的另一方面,还提供一种电子设备,包括根据本申请的所述芯片或多芯片系统。
根据本申请的另一方面,还提供一种用于计算节点传输数据的方法,包括:开始接收第一数据;在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,转发所述第一数据的所述一部分;和/或在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,对所述第一数据的所述一部分进行处理并转发处理结果。
根据本申请的另一方面,还提供一种数据传输方法,包括利用根据本申请的芯片执行前述用于计算节点传输数据的方法。
根据本申请的另一方面,还提供一种数据传输方法,包括利用根据本申请的多芯片系统执行前述方法。
根据本申请一些实施例,提供一种芯片结构,克服了随着协同工作的芯片数量的提升,多芯片之间的通信量迅速增大的缺陷。通过在芯片中增加了数据发送器、数据接收器以及运算处理单元之间相互触发协同的机制,可以使得计算和传输数据流水起来,从而能够覆盖传输开销,提高运算效率和硬件资源利用率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。201910819946.3
201910819940.6本申请实施例提供了一种神经网络卷积运算方法、装置及相关产品,可减少数据通信时间,使得通信过程被计算过程覆盖,并提高卷积运算的效率。
第一方面,提供一种卷积运算方法,所述方法应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述方法包括:根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组;在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
第二方面,提供一种卷积运算装置,所述装置应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述装置包括:第一执行单元,用于根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组;发送单元,用于在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
第三方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方面的方法。
第四方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面提供的方法。
第五方面,提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行第一方面提供的方法。
本申请提供的技术方案在卷积运算的过程中,通过在执行卷积运算并得到运算结果的过程中,将运算结果发送至对应的需要使用该计算结果的其他计算节点,一边计算一边发送计算结果,而不是待计算完成后再发送计算结果,从而减少通信时间;并且,将每个计算节点运算的数据分为多组待运算数据,优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算,这样计算节点就能更快获得卷积运算所需的数据,而无需等待多组待运算数据全部计算完毕;此外,每个计算节点在计算完自己的卷积运算后,即可执行后续的神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。201910819940.6
201910819939.3本申请实施例提供了一种神经网络全连接层运算方法、装置及相关产品,可减少数据通信时间,使得通信过程被计算过程覆盖,并提高全连接层运算的效率。
第一方面,提供一种全连接层运算方法,所述方法应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述方法包括:基于针对第一输出的输入计算数据进行运算,得到第一结果;在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果;以及在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
第二方面,提供一种全连接层运算装置,所述装置应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述装置包括:第一运算单元,用于基于针对第一输出的输入计算数据进行运算,得到第一结果;第一接收单元,用于在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果;以及加和单元,用于在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
第三方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方面的方法。
第四方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面提供的方法。
第五方面,提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行第一方面提供的方法。
本申请提供的技术方案在全连接层运算的过程中,多个计算节点针对一个输出协同运算,每个计算节点能够在接收其他计算节点的运算结果的过程中进行加和,并且在加和获得结果的过程中发送加和的结果,即接收一部分数据就处理一部分数据,计算获得一部分计算结果就发送一部分计算结果,而不是待接收完成后再计算,也不是待计算完成后再发送计算结果,从而大大减少通信时间。此外,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。201910819939.3
201910819947.8本申请实施例提供了一种神经网络协同训练方法、装置及相关产品,可减少数据通信时间,使得通信过程被计算过程覆盖,并提高协同训练的效率。
第一方面,提供一种协同训练的方法,所述方法应用于包括多个节点的人工智能处理器,所述多个节点包括控制节点以及多个计算节点,对于所述多个计算节点中的任一计算节点,所述方法包括如下步骤:获取第一权值梯度数据;在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据。
第二方面,提供一种协同训练的装置,所述装置应用于包括多个节点的人工智能处理器,所述多个节点包括控制节点以及多个计算节点,对于所述多个计算节点中的任一计算节点,所述装置包括:获取单元,用于获取第一权值梯度数据;第一发送单元,用于在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新 的权值梯度数据的过程中,发送所述更新的权值梯度数据。
第三方面,提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现第一方面的方法。
第四方面,提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行第一方面提供的方法。
第五方面,提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行第一方面提供的方法。
本申请提供的技术方案在协同训练的过程中,符合获取梯度更新数据信号的要求计算节点将本地的权值梯度数据与来自另外的计算节点的权值梯度数据进行加和,在加和的过程中,发送加和的结果,即一边计算一边发送计算结果,而不是待计算完成后再发送计算结果;不符合获取梯度更新数据信号的要求计算节点在接收其他计算节点的权值梯度数据的过程中发送所接收的权值梯度数据,在接收过程中发送数据,即一边接收数据一边发送数据,而不是待接收完成后再发送;从而,边计算边发送以及边接收边发送,能够有效减少通信时间;并且,在训练的过程中,对多个计算节点进行分组,从而在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。201910819947.8
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1-1示出根据本申请一示例性实施例的芯片结构。
图1-2A示出根据本申请一示例性实施例的数据接收器。
图1-2B示出根据本申请另一示例性实施例的数据接收器。
图1-3A示出根据本申请一示例性实施例的数据发送器。
图1-3B示出根据本申请另一示例性实施例的数据发送器。
图1-3C示出根据本申请另一示例性实施例的数据发送器。
图1-4示出根据本申请示例实施例的归并模块。
图1-5A示出根据本申请示例实施例的基于环状拓扑的环形连接结构。
图1-5B示出根据本申请示例实施例的在2D-MESH拓扑结构中构建的环形连接结构。
图1-6示出根据本申请实施例的一种用于计算节点传输数据的方法。
图1-7A示出现有技术的数据传输过程的示例。
图1-7B示出图1-6所示方法的数据传输过程的示例。
图1-8示出根据本申请示例实施例的多节点协同执行卷积运算的示意图
图1-9示出根据本申请示例实施例的多节点协同执行分类层运算的示意图。
图1-10示出根据本申请示例实施例的多芯片异步并行协同训练的示意图。
图1-11示出根据本申请示例实施例的电子设备的示意图。
图2-1为一种神经网络构架的结构示意图。
图2-2提供了根据本申请一个实施例的多核系统的示意图。
图2-3提供了根据本申请一个实施例的卷积算法的示意图。
图2-4提供了根据本申请另一个实施例的卷积算法的示意图。
图2-5提供了根据本申请一个实施例的计算节点之间的拓扑结构的示意图。
图2-6A至图2-6G是根据本申请实施例的卷积运算方法的流程图。
图2-7A至图2-7G是根据本申请实施例的卷积运算装置的示意图。
图2-8是本申请实施例提供的一种电子设备的结构图。
图3-1为一种神经网络构架的结构示意图。
图3-2提供了根据本申请一个实施例的多核系统的示意图。
图3-3提供了根据本申请一个实施例的全连接层算法的示意图。
图3-4提供了根据本申请一个实施例的计算节点之间的拓扑结构的示意图。
图3-5A至图3-5H是根据本申请实施例的全连接层运算方法的流程图。
图3-6A至图3-6H是根据本申请实施例的全连接层运算装置的示意图。
图3-7是本申请实施例提供的一种电子设备的结构图。
图4-1为一种神经网络构架的结构示意图。
图4-2提供了根据本申请一个实施例的多核系统的示意图。
图4-3提供了根据本申请一个实施例的协同训练系统的拓扑结果的示意图。
图4-4提供了根据本申请一个实施例的协同训练的示意图。
图4-5提供了根据本申请一个实施例的动态调整计算节点分组的示意图。
图4-6A至图4-6I是根据本申请实施例的协同训练方法的流程图。
图4-7A至图4-7I是根据本申请实施例的协同训练装置的示意图。
图4-8是本申请实施例提供的一种电子设备的结构图。
具体实施方式
201910819946.3
现在将参考附图更全面地描述示例实施例。然而,示例实施例能够以多种形式实施,且不应被理解为限于在此阐述的实施例;相反,提供这些实施例使得本申请将全面和完整,并将示例实施例的构思全面地传达给本领域的技术人员。在图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
应理解,虽然本文中可能使用术语第一、第二、第三等来描述各种组件,但这些组件不应受这些术语限制。这些术语乃用以区分一组件与另一组件。因此,下文论述的第一组 件可称为第二组件而不偏离本申请概念的教示。如本文中所使用,术语“及/或”包括相关联的列出项目中的任一个及一或多者的所有组合。
本领域技术人员可以理解,附图只是示例实施例的示意图,附图中的模块或流程并不一定是实施本申请所必须的,因此不能用于限制本申请的保护范围。
发明人发现,在多芯片协同系统中,例如用于深度学习的多芯片系统中,虽然现在单节点的计算峰值得到了指数性的扩展,但是,多芯片之间的通信带宽却扩展有限。随着协同工作的芯片数量的提升,多芯片之间的通信量迅速增大。这样,在多芯片协同系统中,多芯片通信的瓶颈日益显著,导致增加芯片所带来的边际收益迅速减少。
本申请提出一种芯片设计结构,能够用于多芯片系统的协同计算,可以至少部分克服通信开销过大而使得通信无法被计算完全覆盖的问题,提高运算效率和硬件资源利用率。
下面对根据本申请实施例的芯片以及多芯片系统进行详细说明。
图1-1示出根据本申请一示例性实施例的芯片结构。图1-1所示芯片可以用来构建多芯片系统执行深度学习协同计算等运算任务,例如该芯片可以是人工智能芯片。
如图1-1所示,根据示例实施例的芯片100包括数据总线110以及与数据总线110连接的存储器120、数据接收器RX、运算处理单元130、数据发送器TX。
根据示例实施例,数据总线110可以包括NOC(network-on-chip,片上网络),但本申请不限于此。
参照1-1,数据接收器RX配置为接收来自外部的第一数据和头信息,将第一数据通过数据总线110写入到存储器120的对应区域,以及根据头信息配置对应的运算处理单元130和/或数据发送器TX。存储器120可以为例如DRAM存储器,但本申请不限于此。
根据示例实施例,数据接收器RX可根据头信息对第一数据进行拆解。
根据一些实施例,数据接收器RX可包括SERDES接口、接收数据缓冲器、解码器及DMA单元等,如后面参照图1-2A或图1-2B所描述的,但本申请不限于此。可选地,数据接收器RX可包括解压单元。
参照1-1,运算处理单元130配置为接收第一任务信息,根据第一任务信息执行运算处理并对数据发送器TX执行配置操作。
根据一些实施例,运算处理单元130可以是人工智能处理单元或机器学习处理单元。
根据示例实施例,运算处理单元130配置为在存储器120中存储运算处理结果。
参照图1-1,数据发送器TX配置为获取第二任务信息以及第二数据,并基于至少部分第二数据向外输出第三数据。
如后面参照附图所说明的,根据一些实施例,数据发送器TX可包括发送解码器、数据重排序缓冲器、串行接口及发送缓冲器。根据有一些实施例,数据发送器TX还可包括算术逻辑单元和/或压缩器。
根据示例实施例,如图1-1所示,芯片100还可包括配置总线140,从而运算处理单元130、数据接收器RX、数据发送器TX与配置总线140连接且通过配置总线140相互传输配置信息。
这样,根据本申请实施例,数据接收器RX、数据发送器TX和运算处理单元130可通过数据总线110相互传输数据和/或访问存储器。另外,运算处理单元130、数据接收器RX、数据发送器TX可通过配置总线140相互传输配置信息,从而根据本申请实施例的芯片100可有利地用于多芯片协同计算。
图1-2A示出根据一示例性实施例的数据接收器,其可用于图1-1所示芯片100。
如图1-2A所示,根据示例实施例的数据接收器RX可包括第一串行接口210、数据缓冲器220、解码器230以及DMA单元240。
参见图1-2A,数据接收器RX可通过第一串行接口210接收来自外部例如上游计算节点传输过来的第一数据和头信息。第一串行接口210可采用SERDES接口,SERDES是SERializer(串行器)/DESerializer(解串器)的简称。SERDES包括时分多路复用(TDM)、点对点(P2P)的串行通信技术。在发送端多路低速并行信号被转换成高速串行信号,在接收端高速串行信号被重新转换成低速并行信号。这种点对点的串行通信技术充分利用传输媒体的信道容量,提升信号的传输速度,从而大大降低通信成本。
参见图1-2A,数据缓冲器220用于缓存来自第一串行接口210的第一数据。
根据一些实施例,当需要反压上游的数据发送器时,数据缓冲器220能够容纳整个链路上的过冲数据。这样,可以避免由于过冲数据的存在,导致过冲数据无法被接收而丢失的问题。另外,数据缓冲器220也可以在反压消失之后,给后续模块提供数据,直到接收到上游传输过来的新的数据为止。
解码器230用于从头信息解析用于后续接收到的第一数据的格式和存放地址,从而根据解析出来的格式切分后续收到的第一数据。另外,解码器230可根据头信息配置运算处理单元130和数据发送器TX的对应位。根据示例实施例,解码器230还将地址信息发送给DMA单元240。
根据一些实施例,头信息中还包含在数据传输结束之后所需启动的运算处理单元和数据发送器的信息,从而当解码器230将接收到的第一数据通过数据总线110写入到存储器120之后,根据头信息配置的运算处理单元和/或数据发送器对应的位为1。
DMA单元240用于接收来自解码器230的第一数据和存放地址,从而将第一数据通过数据总线110写入到存储器120的对应区域。
根据一些实施例,DMA单元240将地址信息解析为AXI协议等,然后将数据通过数据总线110写入存储器120。同时,在一个包的所有数据全部成功写入到存储器120之后,通知解码器230执行后续行为。
根据一些实施例,如图1-2B所示,数据接收器RX还可包括解压单元250,用于对来自解码器230的第一数据进行解压,并将解压后的第一数据发送给DMA单元240。
图1-3A示出根据一示例性实施例的数据发送器,其可用于图1-1所示的芯片100。
如图1-3A所示,根据示例实施例的数据发送器TX可包括发送解码器310、数据重排序缓冲器320、发送缓冲器330和第二串行接口340。
参见图1-3A,发送解码器310配置为将接收到的第二任务信息打包为第二头信息,并将第二头信息发送至发送缓冲器330。另外,发送解码器310还可根据第二任务信息向数据重排序缓冲器320发送数据读取请求信息。
根据一些实施例,发送解码器310依据任务信息获取操作数的地址、大小等以及操作数之间的操作码,并将操作数拆解为具体的访存请求,以通过数据总线110从存储器120获取对应的数据。
数据重排序缓冲器320配置为根据数据读取请求信息通过数据总线110获取并发送第二数据,第二数据包括至少部分第一数据和/或运算处理单元130的运算处理结果。
由于数据总线110在传输数据的时候,各个数据传输过程会发生超车,因此,需要 数据重排序缓冲器320对收到的数据进行保序。根据一些实施例,数据重排序缓冲器320接收到数据之后,依据数据的源地址和目的地址,对数据进行移位。当两个数据重排序缓冲器320中的数据都进行移位对齐之后发送数据,例如发送到缓冲器330。
根据一些实施例,数据重排序缓冲器320从存储器120获取第二数据。
发送缓冲器330配置为对接收的数据进行缓存,并按照第二串行接口340的格式发送缓存的数据。
根据一些实施例,发送缓冲器330配置为接收第二头信息以及接收并缓存第二数据,以及按照第二串行接口340的格式发送第三数据,第三数据包括第二数据。
第二串行接口340配置为接收并发送第三数据。如前,第二串行接口可以包括SERDES。
根据示例实施例,发送缓冲器330缓存数据之后,将数据整合成一个数据流,然后按照第二串行接口340接受的格式,切分成对应的包(package)和/或突发(burst)进行传输。另外,发送缓冲器330会在下游节点通过第二串行接口340形成反压之后,短时负载上游传递下来的数据,避免对数据总线110形成反压,阻塞其他单元之间传递数据。在第二串行接口340解除反压之后,由于需要重新通过数据总线110获取新的数据,其再发送请求,请求通过数据总线110到达存储器120,存储器120返回数据,数据通过数据总线110返回之前,发送缓冲器330利用自身已经存储的数据来避免向第二串行接口输出的数据形成断流。
图1-3B示出根据另一示例实施例的数据发送器。
如图1-3B所示,图1-3B所示的数据发送器TX与图1-3A所示的基本相同,区别仅在于在图1-3B所示的数据发送器TX还包括ALU(算术逻辑单元)350。
根据示例实施例,算术逻辑单元350配置为对至少部分第二数据进行运算,并将所得到的运算结果和/或第二数据的部分或全部作为第四数据发送给发送缓冲器330。发送缓冲器330接收第二头信息以及接收并缓存来自算术逻辑单元350的第四数据,按照第二串行接口340的格式发送第三数据,第三数据包括第四数据。第二串行接口340配置为接收并发送第三数据。
根据一些实施例,ALU 350根据发送解码器310传输过来的操作码,将数据重排序缓冲器320传输过来的数据进行对应的加减法运算之后,得到需要传输的数据。在根据任务信息打包成的第二头信息发送之后,ALU 350依次将要传输的数据发送给发送缓冲器330。
根据示例实施例,在数据发送器TX增加了ALU 350,在运算过程中完成处理轻量级的运算操作,能够提高系统处理效率,并可加速传输过程。
图1-3B所示的数据发送器TX的其他部分可参考图1-3A,此处不再赘述。
图1-3C示出根据另一示例实施例的数据发送器。
如图1-3C所示,图1-3C所示的数据发送器TX与图1-3A所示的基本相同,区别仅在于在图1-3C所示的数据发送器TX还包括压缩单元360。
根据示例实施例,压缩单元360配置为将第二数据压缩为第四数据并发送给发送缓冲器330。发送缓冲器330接收第二头信息以及接收并缓存来自压缩单元360的第四数据,按照第二串行接口340的格式发送第三数据,第三数据包括第四数据。第二串行接口340接收并发送第三数据。
根据一些实施例,压缩单元360将小于预设阈值的数据压缩,预设阈值可默认为0,也可以用户定义。
根据一些实施例,压缩模块360可设置在ALU 350之后,从而ALU完成轻量级的运算操作,提高效率。
图1-3C所示的数据发送器TX的其他部分可参考图1-3A,此处不再赘述。
图1-4示出根据示例实施例的归并模块。归并模块400可用于图1-1所示的芯片结构。
根据示例实施例,归并模块400可设置在数据总线110与运算处理单元130或数据发送器TX之间。如图1-4所示,归并模块400可包括归并模式单元410、任务预取单元420和任务发送单元430。
例如,设置在数据发送器TX之前的归并模块400负责接收其他单元发送过来的消息、获取任务、查验对应的任务是否可执行。另外,可将任务依据任务信息进行拆解,将拆解得到的子任务发送给发送解码器310进行执行,并依据执行结果和任务信息将信息发送给其他的单元。
根据实施例,归并模式单元410接收并存储其他运算处理单元130和/或数据发送器TX的执行信息。
例如,归并模式单元410存储接收到的其他单元的执行信息,对来自于其他单元的执行信息进行汇总,以便于任务预取单元420从中读取信息并进行处理。
根据一些实施例,归并模式单元410中存储的表项内的结构如表1-1所示。参见表1-1,表项包括Valid、Bit、ID三个字段。
表1-1
名称 位宽 用途
Valid 1 用以表示该表项是否有效
Bit 64 用以存放各个单元执行状态的信息
ID 16 用以区分表项
Valid用以标识该表项是否可用,如果其为0,则表示该表项所有信息不可用。每当由一个单元发送信息过来之后,新分配表项,例如每当由一个单元发送信息至归并模式单元410后,为该信息新分配表项,则将对应表项的Valid置1。每当任务预取单元420决定清除表项的时候,将对应表项的Valid置0。Bit可使用onehot(独热码)形式,表示收集到的各个单元的执行状态。由硬件接收各个单元的信息置1,由软件通过任务预取单元420进行清0操作。例如,每当一个单元发送一个64bit的ID为In的配置信息Bn过来之后,如果在所有已存的表项之中,没有能匹配上对应ID的表项,则将Bn存入表项之中。如果In在已存的表项中有对应的匹配项,则将已存的信息B和Bn进行或操作,再存入表项,即B=Bn|B。
任务预取单元420配置为根据软件配置的寄存器信息从存储器120获取第一任务信息,根据第一任务信息对执行信息进行处理并根据处理结果确定并发送配置信息和/或第二任务信息。
例如,任务预取单元420依据软件配置的寄存器TASK HEAD(任务头)、TASK SIZE(任务大小)和TASK TAIL(任务尾)信息,首先从存储器120获取任务信息,然后依据任务信息对归并模式单元410中的Bit进行处理,依据结果选择是否发送还是继续等待信息。在任务信息中包含64bit的MASK(掩码信息)以及所需要归并的多个ID。然后依据 需要归并的ID,从归并模式单元410中取出对应ID的Bit信息,进行归并,得到结果记为Br。最后,将归并的结果与MASK进行或操作,R=Mask|Br。如果R全为1,则该任务可以发送;否则,重新获取各个ID的对应的bit信息,重新进行查询操作。在任务信息中,还包含bit清除信息,其可以依据任务信息中指定的多个ID,清除这些ID对应的表项。
根据一些实施例,任务预取单元420还配置为根据第一任务信息将相应任务拆解为多个传输子任务,并根据执行信息发送多个传输子任务的第二任务信息给任务发送单元430。
任务发送单元430配置为从任务预取单元420接收第二任务信息并发送给其他运算处理单元130和/或者数据发送器TX进行处理。
根据一些实施例,任务发送单元430配置为监听运算处理单元130或数据发送器TX的状态,并根据运算处理单元130或数据发送器TX的执行结束状态向其他运算处理单元和/或数据发送器发送配置信息。
例如,任务发送单元430监听运算处理单元130或数据发送器TX的状态,如果其正常执行结束,则首先依据任务信息中记载的方式,通过配置总线140去向其余的运算处理单元130和/或发送数据单元TX发送信息,同时如果有任务已经可以发送,则发送新的任务进行执行。
根据本申请实施例的芯片可用于构建多芯片系统,例如可配置包括环状、网状、树状结构中的至少一种的布局结构的多芯片系统。根据本申请实施例的芯片包括能够相互通信的数据接收器、数据发送器和运算处理单元,从而能够更好地用于多芯片协同。
根据一些实施例,多个芯片构建为环形连接结构。图1-5A示出根据示例实施例的基于环状拓扑的环形连接结构,图1-5B示出根据示例实施例的在2D-MESH拓扑结构中构建的环形连接结构。
根据本申请实施例的芯片或多芯片系统可应用于各种电子设备,包括但不限于超级计算机、云服务器、智能手机、嵌入式系统等。
图1-6示出根据本申请实施例的一种用于计算节点传输数据的方法。
根据一些实施例,图1-6所示的方法可利用根据本申请实施例的芯片或多芯片系统执行,或应用于根据本申请实施例的芯片或多芯片系统,但本申请的方法不限于此。
根据一些实施例,图1-6所示的数据传输方法可用于包括多个计算节点的系统,例如,计算节点可包括根据本申请实施例的芯片。多个计算节点中的至少部分节点执行前述方法。可选地,多个计算节点构建为环形连接结构,参见例如图1-5A和1-5B所示出的。
参见图1-6,在1-S610,开始接收第一数据。
根据示例实施例,通过前述芯片的数据接收器RX接收第一数据。
在1-S620,在接收到第一数据的一部分之后,在继续接收第一数据的同时,转发第一数据的一部分。
根据示例实施例,通过前述芯片的数据发送器TX发送数据。
在1-S630,在接收到第一数据的一部分之后,在继续接收第一数据的同时,对第一数据的一部分进行处理并转发处理结果。
根据示例实施例,通过前述芯片的运算处理单元130处理数据,通过前述芯片的数据发送器TX发送数据。
下面结合图1-7A和1-7B对图1-6所示方法做进一步详细说明。
参见图1-7A,在现有传输数据的过程中,每次都从一个节点向另一个节点发送数据,然后下游节点接收全部数据之后,向其之后的节点发送数据。
参见图1-7B,在根据本申请的实施例中,为了加速传输数据的速度,可采用图1-6所示的方法对传输数据进行处理。即,每个计算节点在接受到一小部分数据之后,可以立刻向下一个节点传输数据。在这种模式下,中间节点在接收到传输的数据之后,在继续接收数据的同时进行处理和转发数据,可以显著减少通信时间。
下面对根据申请实施例的芯片和多芯片系统的一些应用进行举例说明。
图1-8示出根据示例实施例的多节点协同执行卷积运算的示意图。
参见图1-8,当多个计算节点按照数据并行的方式协同执行卷积运算的时候,其涉及到输入和输出在特征图方向的一个拆分。由于滑动窗口的存在,其所框的数据有可能会横跨多个芯片。那么,多个芯片之间,需要将所重叠的部分传输给对应的相邻节点。在一般的做法中,第一层完成之后,需要等待该层所有的计算节点全部都运算结束之后,再开始数据传输过程,传输结束之后,再启动第二层的运算过程。
根据示例实施例,可以将一层卷积首先在H和W的方向上拆分为4个部分,分散在4个计算节点上,每个计算节点负载等量的一块数据。然后,在每个计算节点的片内部,进一步切分为4个子任务,每个子任务负载相等。图1-8中深色色块为已经执行的子任务,浅色色块为等待执行的子任务。首先计算与其他计算节点相邻的子任务,在该子任务计算结束之后,启动与对应芯片相连的数据发送器,将计算得到的重叠的数据块发送给对应的计算节点。当一个计算节点的数据接收器接收到相邻计算节点传输过来的数据之后,即可通知对应的运算处理单元(深度学习处理单元),相关的后续任务具备发送条件。例如在第二步执行完之后,中间两列的子任务执行结束,并且重叠的数据传输给对应的计算节点之后,其第二层的上下两边的4个子任务所需要的所有数据即可全部准备完备,因此具备了可执行的条件。这样,对每个计算节点,在第一层的卷积计算结束之后,可以立刻开始第二层的卷积计算。
当更多计算节点协同运算执行更大量的数据时,在H和W的方向上数据拆分得更加细致之后,其每个计算节点优先执行与其他芯片相连的子任务,每执行完一个子任务,即可将重叠的数据发送给对应相邻的计算节点。这样,对于下一层的计算来说,其对应拆分出来的子任务,也会按照同样的顺序,依次处于可以发送的状态,从而保证即使两个计算节点之间的计算速率不够,执行快的计算节点仍然可以连续执行,而不需要等待执行慢的计算节点执行结束并传输数据。
图1-9示出根据示例实施例的多节点协同执行分类层运算的示意图。
参见图1-9,在处理分类层的时候,首先可以将多个输出进行分组,然后多个计算节点可以协同运算同一个输出结果。此时,数据结果等效于一次归并操作。如图1-9所示,将输出数据分为8组,以协同运算第5组数据作为示例。进一步将输入数据分为12组,放于4个计算节点之中,其相同填充形状的3组放于同一节点。即,0、4、8放于计算节点0进行运算;1、5、9放于计算节点1进行运算;2、6、10放于计算节点2进行运算;3、7、11放于计算节点3进行运算。
在计算的时候,每个计算节点首先计算自身所负载的3组输入数据,得到第5组输出数据对应的部分和。然后启动归并加和传输过程,每个计算节点将自身所得的部分和数 据与接收到的部分和数据进行加和,然后将加和的结果传递给下一个计算节点。同时,各个计算节点在传输数据的时候,可以开始计算第6组输出数据。因此,此时整个拓扑结构中同时包含着第5组部分和的相互传输过程,以及第6组部分和的计算过程。
根据示例实施例,4个计算节点可成环状连接。当做第5组运算的时候,其归并过程可为:首先计算节点1向计算节点2发送部分和;然后计算节点2将接收到的数据与本地部分和数据进行加和之后,传输给计算节点3;再后,计算节点3将接收到的数据与本地部分和数据进行加和之后,传递给计算节点0;最后,计算节点0将接收到的数据进行加和之后留存在本地。此时,如果第6组输出运算已经完成,则由于第6组输出数据存放于计算节点3,同时计算节点0和计算节点1之间的通路未被占用,则计算节点0可以直接启动归并过程,将数据发送给计算节点1。传输过程仍使用切片传输,即,每个计算节点只要接收到上一个计算节点传输过来的部分数据,即可立刻与本地的部分和数据进行加和(或其他运算),然后立刻将该部分结果传输给下游计算节点。
对于单个节点内部来说,首先是运算处理单元(例如,深度学习处理单元)执行完一次子任务之后,即可对相应的数据发送器执行bit置1操作。然后,数据接收器在接收到上游节点传输过来的数据之后,向对应的数据发送器执行bit置1操作。因此,如果对应的数据发送器通过bit监测发现运算处理单元已经完成对应的子任务运算,同时对应的数据接收器也完成了数据的接收工作,则可以从存储器中获取本地计算得到的部分和以及接收的数据,进行加和运算,然后将数据打包传输给下游计算节点。这样,根据示例实施例,可以克服通信开销过大而使得通信无法被计算完全覆盖的问题,提高运算效率。
图1-10示出根据示例实施例的多芯片异步并行协同训练的示意图。
参见图1-10,当考虑一个训练体系的多芯片进行异步协同训练的时候,其数据传输主要用来更新权值梯度数据。如图1-10所示,起始计算节点可包括参数服务节点,有填充的计算节点为分组1,无填充的计算节点为分组2。分两个分组的用意是为了能够在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少不同计算节点之间的等待开销。
在这种结构中,每个计算节点在完成本地batch的训练之后,将其数据保存在本地。控制节点通知起始计算节点发起加和权值梯度数据的请求。起始计算节点(参数服务节点)依据其历史状态,发送获取梯度数据的请求。在这个请求之中,不仅包含更新的代数(第几代),同时还包含有哪些节点需要进行归并操作。由于第一个计算节点本身不参与归并,因此只发送请求给下一个计算节点。第一个需要参与归并的计算节点发送其梯度数据给下一个计算节点。
当后续的计算节点接收到数据之后,如果其需要参与归并,在接收到第一个切片数据的时候,若本地第一个切片的数据也已经准备好,则立刻在本地进行加和操作,然后将该切片传输给下一个计算节点。
例如,当计算节点获取到该请求后,依据其蕴含的更新代数及本地的权值梯度数据所标识的代数,计算其差值。如果差值符合预期,同时该计算节点的权值梯度数据需要归并到这次传输之中,且本地权值梯度数据也已经准备好,则数据发送器可以启动对应的子任务。对应的数据发送器可以从DRAM存储器中获取上游计算节点传输过来的数据,以及本地计算得到的权值梯度数据,进行加和运算,得到新的权值梯度数据,然后通过SERDES,将权值梯度数据传递给下游节点。如图1-10所示,所有组2的计算节点都会在 其输出的时候,进行发送或加和操作,将本地的权值梯度数据整合在传输的数据之中。
当后续的计算节点接收到数据之后,如果其不需要参与归并,则在接收到第一个切片数据的时候,立刻将该切片传输给下一个计算节点即可。例如,所有处于组1的计算节点都会将数据直传下去,不做处理。
当最后一个计算节点收到数据之后,表示所有节点已经完成了归并操作,从而获得了最终的新的权值。此时起始计算节点(参数服务节点)开始权值广播过程。当广播权值数据的时候,所有计算节点都保存更新本地权值备份并且转发权值数据给下一个计算节点,直到最后一个计算节点。至此,完成所有传输。
例如,当起始计算节点(参数服务节点)接收到传输回来的归并之后的数据后,首先更新本地副本。然后,将更新之后的新的权值通过环状拓扑广播给所有计算节点;同时,在信息中标记标签,表示该权值数据的代数。此时,当计算节点接收到对应的权值数据之后,更新其本地的权值数据代数,然后在下次训练的时候使用新的权值数据进行训练。同时,其训练得到的权值梯度数据使用新的权值数据附带的标签标记。
根据示例实施例,控制节点只需要和起始计算节点通信即可。因此,在传输之前,不需要各个归并节点分别与控制节点通信,省去一个同步通信的开销。同时,不需要等到各个节点ready即可发起请求,由各个计算节点依据其本地执行状态,进行控制即可。此外,由于各个计算节点是异步传输过程,因此,可以在第一个分组未完成全部归并之前,即可开始第二个分组的归并过程。另外,将归并和广播过程合在一起。因此,该方案极大的缩减了整体的开销。
图1-11示出根据本申请示例实施例的电子设备的示意图。
如图1-11所示,该电子设备1100可包括中央处理器1110、加速模块1120和存储器1130。加速模块1120与中央处理器1110通信连接,并包括多个根据本申请的芯片100。存储器1130存储有计算机程序。当存储在存储器1130中的计算机程序被中央处理器1110执行时,中央处理器1110能够通过加速模块1120获得加速运算的结果。
以上对本申请实施例进行了详细描述和解释。应清楚地理解,本申请描述了如何形成和使用特定示例,但本申请不限于这些示例的任何细节。相反,基于本申请公开的内容的教导,这些原理能够应用于许多其它实施例。
通过对示例实施例的描述,本领域技术人员易于理解,根据本申请实施例的芯片和多芯片系统及电子设备和数据传输方法至少具有以下优点中的一个或多个。
根据本申请实施例的芯片包括能够相互通信的数据接收器、数据发送器和运算处理单元,从而能够更好地用于多芯片协同。
根据本申请实施例的芯片设计能够用于多芯片系统的协同计算,可以至少部分克服通信开销过大而使得通信无法被计算完全覆盖的问题,提高运算效率和硬件资源利用率。根据一些实施例,通信开销对于计算节点是透明的,几乎感知不到。
根据一些实施例,在数据发送器增加了ALU0,在运算过程中完成处理轻量级的运算操作,能够提高系统处理效率,并可加速传输过程。
根据一些实施例,利用本申请的芯片和多芯片系统,可以使得计算和传输数据流水起来,从而能够覆盖传输开销,提高运算效率和硬件资源利用率。
根据示例实施例,在芯片中增加了数据发送器、数据接收器以及运算处理单元之间相互触发协同的机制,从而利用该芯片的系统不仅可以最大限度的使得计算和通信并行, 同时可以获得极高的并行加速比。
本领域技术人员可以理解上述各模块可以按照实施例的描述分布于装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可进一步拆分成多个子模块。
依据以下条款可更好地理解前述内容:
条款A1:一种芯片,包括数据总线以及与所述数据总线连接的存储器、数据接收器、运算处理单元、数据发送器,其中,所述数据接收器配置为接收来自外部的第一数据和头信息,将所述第一数据通过所述数据总线写入到所述存储器的对应区域,以及根据所述头信息配置对应的运算处理单元和/或数据发送器;所述运算处理单元配置为接收第一任务信息,根据所述第一任务信息执行运算处理并对所述数据发送器执行配置操作;所述数据发送器配置为获取第二任务信息以及第二数据,并基于至少部分所述第二数据向外输出第三数据。
条款A2:根据条款A1所述的芯片,还包括:配置总线,所述运算处理单元、所述数据接收器、所述数据发送器与所述配置总线连接从而通过所述配置总线相互传输配置信息。
条款A3:根据条款A1所述的芯片,其中,所述数据接收器还配置为根据所述头信息对所述第一数据进行拆解。
条款A4:根据条款A1所述的芯片,其中,所述数据接收器包括:第一串行接口;数据缓冲器,用于缓存来自所述第一串行接口的所述第一数据;解码器,用于从所述头信息解析所述第一数据的格式和存放地址,根据所述第一数据的格式切分所述第一数据,以及根据所述头信息配置所述运算处理单元和所述数据发送器的对应位;DMA单元,用于接收来自所述解码器的所述第一数据和所述存放地址,从而将所述第一数据通过所述数据总线写入到所述存储器的对应区域。
条款A5:根据条款A1所述的芯片,其中,所述数据接收器还包括:解压单元,用于对来自所述解码器的所述第一数据进行解压,并将解压后的第一数据发送给所述DMA单元。
条款A6:根据条款A1所述的芯片,所述数据发送器包括发送解码器、数据重排序缓冲器、发送缓冲器和第二串行接口,其中,所述发送解码器配置为:将所述第二任务信息打包第二为头信息并将所述第二头信息发送至所述发送缓冲器,以及根据所述第二任务信息向所述数据重排序缓冲器发送数据读取请求信息;所述数据重排序缓冲器配置为根据所述数据读取请求信息通过所述数据总线获取并发送所述第二数据,所述第二数据包括至少部分所述第一数据和/或所述运算处理结果;所述发送缓冲器配置为对接收的数据进行缓存,并按照所述第二串行接口的格式发送缓存的数据。
条款A7:根据条款A6所述的芯片,其中,所述发送缓冲器配置为接收所述第二头信息以及接收并缓存所述第二数据,以及按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第二数据;第二串行接口配置为接收并发送所述第三数据。
条款A8:根据条款A6所述的芯片,其中所述数据发送器还包括算术逻辑单元,其中,所述算术逻辑单元配置为对至少部分所述第二数据进行运算,并将所得到的运算结果和/或所述第二数据的部分或全部作为第四数据发送给所述发送缓冲器;其中,所述发送缓冲器配置为接收所述第二头信息以及接收并缓存来自所述算术逻辑单元的所述第四数 据,以及按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第四数据;第二串行接口配置为接收并发送所述第三数据。
条款A9:根据条款A6所述的芯片,其中所述数据发送器还包括压缩单元,其中,所述压缩单元配置为将所述第二数据压缩为第四数据并发送给所述发送缓冲器;其中,所述发送缓冲器配置为接收所述第二头信息以及接收并缓存来自所述压缩单元的第四数据,按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第四数据;其中,所述第二串行接口配置为接收并发送所述第三数据。
条款A10:根据条款A1所述的芯片,其中还包括设置在所述数据总线与所述运算处理单元或所述数据发送器之间的归并模块,所述归并模块包括归并模式单元、任务预取单元和任务发送单元,其中,所述归并模式单元接收并存储其他运算处理单元和/或数据发送器的执行信息;其中,所述任务预取单元配置为根据软件配置的寄存器信息从所述存储器获取所述第一任务信息,根据所述第一任务信息对所述执行信息进行处理并根据处理结果确定并发送配置信息和/或所述第二任务信息;其中,所述任务发送单元配置为从所述任务预取单元接收所述第二任务信息并发送给其他运算处理单元和/或者数据发送器。
条款A11:根据条款A10所述的芯片,其中所述任务预取单元还配置为根据所述第一任务信息将相应任务拆解为多个传输子任务,并根据所述执行信息发送多个传输子任务的所述第二任务信息给所述任务发送单元。
条款A12:根据条款A10所述的芯片,其中所述任务发送单元还配置为监听所述运算处理单元或所述数据发送器的状态,并根据所述运算处理单元或所述数据发送器的执行结束状态向其他运算处理单元和/或数据发送器发送配置信息。
条款A13:根据条款A1所述的芯片,其中,所述数据总线包括NOC。
条款A14:根据条款A1所述的芯片,其中,所述芯片为人工智能芯片,所述运算处理单元为人工智能处理单元或机器学习处理单元。
条款A15:根据条款A1所述的芯片,其中,所述数据接收器、所述数据发送器和所述运算处理单元通过所述数据总线相互传输数据以及访问所述存储器。
条款A16:根据条款A2所述的芯片,其中,所述数据接收器、所述数据发送器和所述运算处理单元通过所述数据总线相互传输数据以及访问所述存储器;所述运算处理单元、所述数据接收器、所述数据发送器通过所述配置总线相互传输配置信息。
条款A17:一种多芯片系统,包括多个根据条款A1-A16中任一项所述的芯片。
条款A18:根据条款A17所述的多芯片系统,其中所述多个芯片配置为包括环状、网状、树状结构中的至少一种的布局结构。
条款A19:根据条款A18所述的多芯片系统,其中所述多个芯片构建为环形连接结构。
条款A20:一种电子设备,包括根据条款A1-A16中任一项所述的芯片或根据条款A17-A19中任一项所述的多芯片系统。
条款A21:一种用于计算节点传输数据的方法,包括:开始接收第一数据;在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,转发所述第一数据的所述一部分;和/或在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,对所述第一数据的所述一部分进行处理并转发处理结果。
条款A22:一种数据传输方法,包括:利用根据条款A1-A16中任一项所述的芯片执 行根据条款A21所述的用于计算节点传输数据的方法。
条款A23:一种数据传输方法,用于包括多个计算节点的系统,其中所述多个计算节点中的至少部分节点执行根据条款A21或A22所述的方法。
条款A24:根据条款A23所述的数据传输方法,其中所述多个计算节点构建为环形连接结构。
以上具体地示出和描述了本申请的示例性实施例。应可理解的是,本申请不限于这里描述的详细结构、设置方式或实现方法;相反,本申请意图涵盖包含在所附权利要求的精神和范围内的各种修改和等效设置。201910819946.3
201910819940.6
本申请涉及信息处理技术领域,具体涉及一种神经网络卷积运算方法、装置以及相关产品。
目前,人工神经网络是所有智能方法中最常见的计算模型之一。在进行神经网络各个网络层的运算过程中以及神经网络训练的过程中,存在数据通信的通信时间以及处理数据的计算时间。
然而,现有技术中还没有有效减少通信时间,使得数据通信的时间被数据计算的时间覆盖的方案。为了改进性能,有必要采用各种手段来改进神经网络中的网络层运算。
为了解决上述的问题,下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
参阅图2-1,图2-1提供了一种神经网络构架示意图,如图2-1所示,神经网络构架可以包括多层结构,该多层结构如图2-1所示,可以包括:输入层、卷积层1、batchnorm层、卷积层2、中间层(依据不同功能的神经网络构架具有不同的中间层,该中间层可以为至少一层)、卷积层n、全连接层、激活(例如激活函数:softmax)层。神经网络构架,对于计算量较大的层可以称为计算层,例如卷积层、全连接层等等,当然在实际应用中, 上述计算层还可以包含其他类型的层,另外,本申请提供的图2-1中的神经网络构架仅仅是为了举例说明,本申请中的神经网络并不局限如图2-1所示的构架。
图2-2提供了根据本申请一个实施例的多核系统的示意图。如图2-2所示,该核系统可以为一个神经网络芯片。该多核系统包括16个核(CORE)和4个存储节点,16个核通过一个环状的NOC与4个存储节点DRAM相连。需要注意的是,该多核系统的核可以为神经网络芯片中的计算核,存储节点的类型可以是任意类型的存储器,例如,动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)等。
根据图2-2所示的示例,多核系统为16个核以及4的存储节点。然而,可以理解的是,多核系统可以包括任意数量的核以及任意数量的存储节点,这些都属于本申请覆盖的范围。
图2-3提供了根据本申请一个实施例的卷积算法的示意图。在深度学习所处理的数据量较大时,考虑多个芯片或多个计算核协同处理的总体数据。
首先,将总体数据分配给各个计算节点,需要根据计算节点的数量,确定每个计算节点的输入数据。在一个实施例中,如果人工智能处理器中计算节点的数量为N个,对于N个计算节点,可将总体数据分为N个部分的数据,这N个部分的数据分别作为N个计算节点的输入数据。在另一个可选的实施例中,对于N个计算节点,可将总体数据分为N的倍数(例如2N、3N等)个部分的数据。在又一个实施例中,对于N个计算节点,还可将总体数据分为小于N个部分的数据。在又一个可选实施例中,对于N个计算节点,还可将总体数据分为任意数个部分的数据。在一个实施例中,进行神经网络并行运算时,也可以每个计算节点存储全部权值,对输入神经元进行拆分。
在图2-3所示的实施例中,存在4个计算节点,计算节点1、计算节点2、计算节点3和计算节点4,分别分布在左上角、右上角、左下角和右下角。将需要处理的总体数据分成4个输入数据,每一个输入数据分配给一个计算节点处理。虽然采用了4个计算节点,本领域技术人员能够理解的是,根据实际应用的需要,计算节点的数目可以是任意的。另外,需要注意的是,计算节点1、计算节点2、计算节点3和计算节点4包括神经网络芯片,和/或神经网络芯片中的计算核。而且,计算节点1、计算节点2、计算节点3和计算节点4之间可以采用任意的拓扑结构,譬如环状、网状、树状,或者其他包括环状的结构。
其次,对于输入数据的拆分,可以根据负载均衡的原则将输入数据拆分为多组待运算数据,也可以将所述输入数据沿着高度方向和/或宽度方向拆分为多组待运算数据。当然,对于输入数据还可以存在其他拆分方式,这些都属于本申请覆盖的范围。
上述对输入数据的拆分,可以是计算节点获取输入数据后进行的拆分,也可以是将输入数据拆分为多组待运算数据后,计算节点接收拆分好的多组待运算数据。
如图2-3所示,将计算节点1、计算节点2、计算节点3和计算节点4中每个计算节点输入数据拆分为4组待运算数据,即第1组待运算数据,第2组待运算数据、第3组待运算数据和第4组待运算数据。然后,计算节点对这4组待运算数据分别执行卷积运算。
如图2-3所示,在执行卷积运算的过程中,由于滑动窗口的存在,滑动窗口框住的数据有可能跨越多个计算节点,那么,需要将重叠的部分传输给对应的计算节点,例如,需要将计算节点1的斜线部分表示的计算结果发送给计算节点2。在一个优选的实施例中,计算节点1在执行卷积运算并得到运算结果的过程中,将运算结果发送至计算节点2。这 样,一边计算一边发送运算结果,而不是待计算完成后再发送计算结果,从而减少通信时间。而且,依靠该运算结果执行运算的其他计算节点,在收到该运算结果后,能够更快启动相应的运算。
如图2-3所示,将为其他计算节点执行后续卷积层的卷积运算时所使用的数据称为重叠数据,例如,如图2-3中,计算节点1的运算结果中用斜线表示的部分为计算节点2执行后续卷积层的卷积运算时所使用的数据,那么,在一个优选的实施例中,计算节点1将针对第2组待运算数据的运算结果发送至计算节点2的过程中发送重叠数据,即可以发送斜线表示的部分。再如,如图2-3中,计算节点1的针对第4组待运算数据的计算结果中斜线表示的部分为计算节点2执行后续卷积层的卷积运算时所使用的数据,竖线表示的部分为计算节点3执行后续卷积层的卷积运算时所使用的数据,而阴影表示的部分为计算节点2、计算节点3和计算节点4执行后续卷积层的卷积运算时所使用的数据,那么,计算节点1将斜线表示的部分发送至计算节点2,将竖线表示的部分发送至计算节点3,将将阴影表示的部分发送至计算节点2、计算节点3和计算节点4中的每一者。
另外,对于4组待运算数据可以采用任意顺序。在一个优选的实施例中,计算节点优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算。图2-4提供了根据本申请另一个实施例的卷积算法的示意图。如图2-4所示,执行顺序为实线箭头、虚线箭头、点箭头和点线箭头,箭头上的数字表示第几组待运算数据,例如,1表示第1组待运算数据。
从而,对于计算节点1来说,执行顺序为:第2组待运算数据,第3组待运算数据、第4组待运算数据和第1组待运算数据。对于计算节点2来说,执行顺序为:第1组待运算数据,第3组待运算数据、第4组待运算数据和第2组待运算数据。对于计算节点3来说,执行顺序为:第4组待运算数据,第2组待运算数据、第1组待运算数据和第3组待运算数据。对于计算节点4来说,执行顺序为:第3组待运算数据,第1组待运算数据、第2组待运算数据和第4组待运算数据。
图2-3和图2-4只是举出了多组待运算数据的执行顺序的一种实现方式。本领域技术人员在上述实施例的启示下可以想到的其他所有的多组待运算数据的执行顺序,都属于本申请覆盖的范围。
这样,将每个计算节点运算的数据分为多组待运算数据,优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算,计算节点就能更快获得卷积运算所需的数据,而无需等待多组待运算数据全部计算完毕。
在另一个实施例中,每个计算节点在完成各待运算数据的卷积运算后,执行所对应的后续的神经网络层的运算。对于图2-3所示的实施例来说,一个计算节点在完成4组待运算数据的卷积运算后,无需等待其他计算节点的各自的卷积运算是否完成,即可执行后续的神经网络层的运算。后续的神经网络层的运算可能是卷积运算,也可能是池化层的运算、分类层的运算等其他网络层的运算。
从而,每个计算节点在计算完自己的卷积运算后,即可执行后续的神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
根据图2-3和图2-4的实施例,表2-1展示了计算节点1、计算节点2、计算节点3和计算节点4执行卷积运算的过程,在表2-1中以计算节点1、计算节点2、计算节点3和计算节点4协同执行两层卷积运算为例,其中计算节点1、计算节点2、计算节点3和 计算节点4的拓扑结构如图2-5所示,计算节点1、计算节点2、计算节点3和计算节点4之间能够相互发送和接收数据。
需要注意的是,表2-1所示的协同执行两层卷积运算以及图2-5所示的计算节点之间的拓扑结构只是一种具体的实现方式,本领域技术人员在上述拓扑结构以及协同执行两层卷积运算的启发下,能够想到的其他拓扑结构以及协同执行卷积运算的方式,都属于本申请覆盖的范围。
表2-1
Figure PCTCN2020095205-appb-000001
Figure PCTCN2020095205-appb-000002
对于计算节点1来说,执行第二层卷积的第2组待运算数据(ID10),需要获取计算节点2执行第一层卷积的第1组待运算数据的运算结果(ID2),这两个执行动作之间相差了3组待运算数据的计算;计算节点1执行第二层卷积的第4组待运算数据(ID11),需要获取计算节点2执行第一层卷积的第3组待运算数据的运算结果(ID4)、计算节点3执行第一层卷积的第2组待运算数据的运算结果(ID5)以及计算节点4执行第一层卷积的第1组待运算数据的运算结果(ID6),这些执行动作之间相差了3组待运算数据的计算;计算节点1执行第二层卷积的第3组待运算数据(ID12),需要获取计算节点3执行第一层卷积的第1组待运算数据的运算结果(ID8),这两个执行动作之间相差了3组待运算数据的计算。
因此,对于计算节点1来说,其无需等待计算节点2、计算节点3和计算节点4,即使其执行速度比计算节点2、计算节点3和计算节点4快3个组待运算数据的计算,也无需降低其执行速度。
基于上述实施例,本申请提出一种卷积运算方法。如图2-6A至图2-6G所示,所述卷积运算方法包括:步骤2-S601,根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组。
如图2-3所示,计算节点1所要执行的卷积运算包括4组待运算数据,计算节点1可以按照预定顺序,对任意一组待运算数据,例如第2组待运算数据,执行卷积运算,得到运算结果。
步骤2-S602,在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
如图2-3所述,例如,计算节点1对第2组待运算数据进行卷积运算的过程中,所得到的运算结果被计算节点2使用。那么,在确定所述运算结果被计算节点2使用的情况下,将所述运算结果发送至对应的计算节点2。再如,计算节点1对第3组待运算数据进行卷积运算的过程中,所得到的运算结果被计算节点2、计算节点3和计算节点4使用。那么,在确定所述运算结果被计算节点2使用的情况下,将所述运算结果发送至对应的计算节点2、计算节点3和计算节点4。
按照上述方法,一边计算一边发送计算结果,而不是待计算完成后再发送计算结果,从而减少通信时间。而且,依靠该运算结果执行运算的其他计算节点,在收到该运算结果后,能够更快启动相应的运算。
更为具体地,步骤2-S602包括如下子步骤:步骤2-S6021,在所述运算结果中确定重叠数据,所述重叠数据为所述其他计算节点执行后续卷积层的卷积运算时所使用的数据。
如图2-3所示,计算节点1在对第2组待运算数据进行卷积运算后,运算结果包括了为计算节点2执行后续卷积层的卷积运算时所使用的数据,即重叠数据(在图2-3中斜线用表示)。
步骤2-S6022,将所述重叠数据发送至对应的所述其他计算节点。
那么,对于该重叠数据,计算节点1需要将其发送给计算节点2。
更为具体地,步骤2-S6022包括如下子步骤:将所述重叠数据发送至对应的一个或 多个所述其他计算节点。
如上所述,在图2-3中,计算节点1在对第4组待运算数据进行卷积运算后,运算结果包括了为计算节点2、计算节点3和计算节点4执行后续卷积层的卷积运算时所使用的数据,即重叠数据(在图2-3中分别用斜线、竖线和阴影表示)。
那么,对于用斜线、竖线和阴影表示的数据部分,计算节点1需要相应地将其发送给计算节点2、计算节点3和计算节点4中的每一者。
更为具体地,步骤2-S601包括如下子步骤:步骤2-S6011,优先执行所述运算结果被所述其他计算节点使用的所述目标数据的卷积运算。
如图2-3所述,计算节点优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算。例如,对于计算节点1来说,执行顺序为:第2组待运算数据,第3组待运算数据、第4组待运算数据和第1组待运算数据。
这样,将每个计算节点运算的数据分为多组待运算数据,优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算,计算节点就能更快获得卷积运算所需的数据,而无需等待多组待运算数据全部计算完毕。
在进一步的实施例中,所述卷积运算方法还包括:步骤2-S603,根据人工智能处理器中计算节点的数量,确定每个计算节点的待运算数据和/或所述输入数据。
如上所述,如果人工智能处理器中计算节点的数量为N个,对于N个计算节点,可将总体数据分为N个部分的数据,这N个部分的数据分别作为N个计算节点的输入数据。在另一个可选的实施例中,对于N个计算节点,可将总体数据分为N的倍数(例如2N、3N等)个部分的数据。在又一个实施例中,对于N个计算节点,还可将总体数据分为小于N个部分的数据。在又一个可选实施例中,对于N个计算节点,还可将总体数据分为任意数个部分的数据。
在进一步的实施例中,所述卷积运算方法还包括:步骤2-S604,将输入数据拆分为多组待运算数据。
如上所述,对于输入数据的拆分,可以根据负载均衡的原则将输入数据拆分为多组待运算数据,也可以将所述输入数据沿着高度方向和/或宽度方向拆分为多组待运算数据。当然,对于输入数据还可以存在其他拆分方式,这些都属于本申请覆盖的范围。
上述对输入数据的拆分,可以是计算节点获取输入数据后进行的拆分,也可以是将输入数据拆分为多组待运算数据后,计算节点接收拆分好的多组待运算数据。
如图2-3所示,将每个计算节点的输入数据拆分为4组待运算数据,即第1组待运算数据,第2组待运算数据、第3组待运算数据和第4组待运算数据。然后,计算节点对这4组待运算数据分别执行卷积运算。
在进一步的实施例中,所述卷积运算方法还包括:步骤2-S605,接收所述多组待运算数据。
根据一个实施例中,在计算节点获取输入数据之前,输入数据已经被拆分为多组待运算数据。计算节点接收拆分好的多组待运算数据。
在进一步的实施例中,所述卷积运算方法还包括:步骤2-S606,在完成各待运算数据的卷积运算后,执行所对应的后续的神经网络层的运算。
如上所述,图2-3所示的一个计算节点在完成4组待运算数据的卷积运算后,无需等待其他计算节点的各自的卷积运算是否完成,即可执行后续的神经网络层的运算。后续 的神经网络层的运算可能是卷积运算,也可能是池化层的运算、分类层的运算等其他网络层的运算。
这样,每个计算节点在计算完自己的卷积运算后,即可执行后续的神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
在进一步的实施例中,所述卷积运算方法还包括:步骤2-S607,当待运算数据包括接收其他计算节点的运算结果的情况下,确定是否已经完成对其他计算节点的运算结果的接收。
如图2-3所示,对于计算节点1执行对第2组待运算数据运算,需要获取计算节点2对其第1组待运算数据的运算结果。那么,计算节点1在执行对第2组待运算数据运算之前,需要确定是否已经完成对计算节点2对其第1组待运算数据的运算结果的接收。
步骤2-S608,在确定完成对所述其他计算节点的运算结果的接收的情况下,根据所述目标数据执行卷积运算。
计算节点1确定已经完成对计算节点2对其第1组待运算数据的运算结果的接收,就可以执行对其第2组待运算数据运算。
以上主要针对计算节点1所执行的动作做了相关描述,本领域技术人员需要注意的是,以上针对计算节点1所执行的动作所作的描述同样也适用于计算节点2、计算节点3和计算节点4。并且,虽然采用了4个计算节点,本领域技术人员能够理解的是,根据实际应用的需要,计算节点的数目可以是任意的。
另外,需要注意的是,计算节点1、计算节点2、计算节点3和计算节点4包括神经网络芯片,和/或神经网络芯片中的计算核。而且,计算节点1、计算节点2、计算节点3和计算节点4之间可以采用任意的拓扑结构,譬如环状、网状、树状,或者其他包括环状的结构。
根据上述卷积运算方法,通过在执行卷积运算并得到运算结果的过程中,将运算结果发送至对应的需要使用该计算结果的其他计算节点,一边计算一边发送计算结果,而不是待计算完成后再发送计算结果,从而减少通信时间;并且,将每个计算节点运算的数据分为多组待运算数据,优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算,这样计算节点就能更快获得卷积运算所需的数据,而无需等待多组待运算数据全部计算完毕;此外,每个计算节点在计算完自己的卷积运算后,即可执行后续的神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图2-6A至图2-6G的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-6A至图2-6G中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段 的至少一部分轮流或者交替地执行。
根据另一个实施例,本发明还提供一种神经网络卷积运算装置。如图2-7A至图2-7G所示,该神经网络卷积运算装置包括:
第一执行单元2-701,用于根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组。
更为具体地,第一执行单元2-701用于:优先执行所述运算结果被所述其他计算节点使用的所述目标数据的卷积运算。
发送单元2-702,用于在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
更为具体地,发送单元2-702用于:在所述运算结果中确定重叠数据,所述重叠数据为所述其他计算节点执行后续卷积层的卷积运算时所使用的数据;将所述重叠数据发送至对应的所述其他计算节点。
更为具体地,发送单元2-702用于:将所述重叠数据发送至对应的一个或多个所述其他计算节点。
在进一步的实施例中,所述卷积运算装置还包括:第一确定单元2-703,用于根据人工智能处理器中计算节点的数量,确定每个计算节点的待运算数据和/或所述输入数据。
在进一步的实施例中,所述卷积运算装置还包括:拆分单元2-704,用于将输入数据拆分为多组待运算数据。
如上所述,对于输入数据的拆分,可以根据负载均衡的原则将输入数据拆分为多组待运算数据,也可以将所述输入数据沿着高度方向和/或宽度方向拆分为多组待运算数据。当然,对于输入数据还可以存在其他拆分方式,这些都属于本申请覆盖的范围。
上述对输入数据的拆分,可以是计算节点获取输入数据后进行的拆分,也可以是将输入数据拆分为多组待运算数据后,计算节点接收拆分好的多组待运算数据。
在进一步的实施例中,所述卷积运算装置还包括:接收单元2-705,用于接收所述多组待运算数据。
根据一个实施例中,在计算节点获取输入数据之前,输入数据已经被拆分为多组待运算数据。计算节点接收拆分好的多组待运算数据。
在进一步的实施例中,所述卷积运算装置还包括:第二执行单元2-706,用于在完成各待运算数据的卷积运算后,执行所对应的后续的神经网络层的运算。
在进一步的实施例中,所述卷积运算装置还包括:第二确定单元2-707,用于当待运算数据包括接收其他计算节点的运算结果的情况下,确定是否已经完成对其他计算节点的运算结果的接收;以及第三执行单元2-708,用于在确定完成对所述其他计算节点的运算结果的接收的情况下,根据所述目标数据执行卷积运算。
根据上述卷积运算装置,通过在执行卷积运算并得到运算结果的过程中,将运算结果发送至对应的需要使用该计算结果的其他计算节点,一边计算一边发送计算结果,而不是待计算完成后再发送计算结果,从而减少通信时间;并且,将每个计算节点运算的数据分为多组待运算数据,优先执行运算结果被其他计算节点使用的一组待运算数据的卷积运算,这样计算节点就能更快获得卷积运算所需的数据,而无需等待多组待运算数据全部计算完毕;此外,每个计算节点在计算完自己的卷积运算后,即可执行后续的神经网络层的 运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
参阅图2-8,图2-8提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如图2-6A至图2-6G所示的方法以及细化方案。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本披露各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述处理器或芯片可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述片上缓存、片外内存、存储器可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本披露的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如图2-6A至图2-6G所示的方法以及细化方案。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如图2-6A至图2-6G所示的方法以及细化方案。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
条款B1,一种卷积运算方法,其特征在于,所述方法应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述方法包括:根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组;在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
条款B2,如条款B1所述的方法,其特征在于,所述将所述运算结果发送至对应的所述其他计算节点,包括:在所述运算结果中确定重叠数据,所述重叠数据为所述其他计算节点执行后续卷积层的卷积运算时所使用的数据;将所述重叠数据发送至对应的所述其他计算节点。
条款B3,如条款B2所述的方法,其特征在于,所述将所述运算结果发送至对应的所述其他计算节点,包括:将所述重叠数据发送至对应的一个或多个所述其他计算节点。
条款B4,如条款B1所述的方法,其特征在于,所述根据目标数据执行卷积运算,得到运算结果,包括:优先执行所述运算结果被所述其他计算节点使用的所述目标数据的卷积运算。
条款B5,如条款B1所述的方法,其特征在于,所述方法还包括:将输入数据拆分为所述多组待运算数据。
条款B6,如条款B5所述的方法,其特征在于,所述将输入数据拆分为所述多组待运算数据,包括:将所述输入数据根据负载均衡的原则拆分为所述多组待运算数据。
条款B731,如条款B5所述的方法,其特征在于,所述将输入数据拆分为所述多组待运算数据,包括:将所述输入数据沿着高度方向和/或宽度方向拆分为所述多组待运算数据。
条款B8,如条款B5所述的方法,其特征在于,所述方法还包括:接收所述多组待运算数据。
条款B9,如条款B1或B5所述的方法,其特征在于,所述方法还包括:根据所述人工智能处理器中所述计算节点的数量,确定所述待运算数据和/或所述输入数据。
条款B10,如条款B1所述的方法,其特征在于,所述方法还包括:在完成各待运算数据的卷积运算后,执行所对应的后续的神经网络层的运算。
条款B11,如条款B1所述的方法,其特征在于,所述方法还包括:当所述待运算数据包括接收其他计算节点的运算结果的情况下,确定是否已经完成对所述其他计算节点的运算结果的接收;在确定完成对所述其他计算节点的运算结果的接收的情况下,根据所述目标数据执行卷积运算。
条款B12,如条款B1至B11任一项所述的方法,其特征在于,所述多个计算节点所形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款B13,如条款B1至B12任一项所述的方法,其特征在于,所述计算节点包括:神经网络芯片,和/或所述神经网络芯片中的计算核。
条款B14,一种卷积运算装置,其特征在于,所述装置应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述装置包括:第一执行单元,用于根据目标数据执行卷积运算,得到运算结果,所述目标数据为多组待运算数据中的任一组;发送单元,用于在对所述目标数据执行卷积运算并得到运算结果的过程中,在确定所述运算结果被其他计算节点使用的情况下,将所述运算结果发送至对应的所述其他计算节点。
条款B15,如条款B14所述的装置,其特征在于,所述第一执行单元用于:在所述运算结果中确定重叠数据,所述重叠数据为所述其他计算节点执行后续卷积层的卷积运算时所使用的数据;将所述重叠数据发送至对应的所述其他计算节点。
条款B16,如条款B14所述的装置,其特征在于,所述发送单元用于:将所述重叠数据发送至对应的一个或多个所述其他计算节点。
条款B17,如条款B14所述的装置,其特征在于,所述第一执行单元用于:优先执行所述运算结果被所述其他计算节点使用的所述目标数据的卷积运算。
条款B18,如条款B14所述的装置,其特征在于,所述装置还包括:拆分单元,用于将输入数据拆分为所述多组待运算数据。
条款B19,如条款B18所述的装置,其特征在于,所述拆分单元用于:将所述输入数据根据负载均衡的原则拆分为所述多组待运算数据。
条款B20,如条款B18所述的装置,其特征在于,所述拆分单元用于:将所述输入数据沿着高度方向和/或宽度方向拆分为所述多组待运算数据。
条款B21,如条款B14所述的装置,其特征在于,所述装置还包括:接收单元,用于接收所述多组待运算数据。
条款B22,如条款B14或B18所述的装置,其特征在于,所述装置还包括:第一确定单元,用于根据所述人工智能处理器中所述计算节点的数量,确定所述待运算数据和/或所述输入数据。
条款B23,如条款B14所述的装置,其特征在于,所述装置还包括:第二执行单元,用于在完成各待运算数据的卷积运算后,执行所对应的后续的神经网络层的运算。
条款B24,如条款B14所述的装置,其特征在于,所述装置还包括:第二确定单元,用于当所述待运算数据包括接收其他计算节点的运算结果的情况下,确定是否已经完成对所述其他计算节点的运算结果的接收;第三执行单元,用于在确定完成对所述其他计算节点的运算结果的接收的情况下,根据所述目标数据执行卷积运算。
条款B25,如条款B14至B24任一项所述的装置,其特征在于,所述多个计算节点所形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款B26,如条款B14至B25任一项所述的装置,其特征在于,所述计算节点包括:神经网络芯片,和/或所述神经网络芯片中的计算核。
条款B27,一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如条款B1-B13任一所述的方法。
条款B28,一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如条款B1-B13任一项所述的方法。
条款B29,一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如条款B1-B13任一项所述的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的 限制。201910819940.6
201910819939.3
本申请涉及信息处理技术领域,具体涉及一种神经网络全连接层运算方法、装置以及相关产品。
目前,人工神经网络是所有智能方法中最常见的计算模型之一。在进行神经网络各个网络层的运算过程中以及神经网络训练的过程中,存在数据通信的通信时间以及处理数据的计算时间。
然而,现有技术中还没有有效减少通信时间,使得数据通信的时间被数据计算的时间覆盖的方案。为了改进性能,有必要采用各种手段来改进神经网络中的网络层运算。
为了解决上述问题,我们提出如下方案。
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
参阅图3-1,图3-1提供了一种神经网络构架示意图,如图3-1所示,神经网络构架可以包括多层结构,该多层结构如图3-1所示,可以包括:输入层、卷积层1、batchnorm层、卷积层2、中间层(依据不同功能的神经网络构架具有不同的中间层,该中间层可以为至少一层)、卷积层n、全连接层、激活(例如激活函数:softmax)层。对于神经网络构架,计算量较大的层可以称为计算层,例如卷积层、全连接层等等,当然在实际应用中,上述计算层还可以包含其他类型的层,另外,本申请提供的图3-1中的神经网络构架仅仅是为了举例说明,本申请中的神经网络并不局限如图3-1所示的构架。
图3-2提供了根据本申请一个实施例的多核系统的示意图。如图3-2所示,该核系统可以为一个神经网络芯片。该多核系统包括16个核(CORE)及4个存储节点,16个核通过一个环状的NOC与4个存储节点DRAM相连。需要注意的是,该多核系统的核可以 为神经网络芯片中的计算核,存储节点的类型可以是任意类型的存储器,例如,动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)等。
根据图3-2所示的示例,多核系统为16个核以及4的存储节点。然而,可以理解的是,多核系统可以包括任意数量的核以及任意数量的存储节点,这些都属于本申请覆盖的范围。
图3-3提供了根据本申请一个实施例的全连接层算法的示意图。在深度学习所处理的数据量较大时,考虑多个芯片或多个计算核协同处理的总体数据。
首先,将总的输出数据分为多个输出,在图3-3中显示为8个输出,即第1输出至第8输出。对于每一个输出来说,多个计算节点协同计算该输出的值。如图3-3所示,4个计算节点(计算节点1、计算节点2、计算节点3和计算节点4)协同计算第5输出的值。
需要注意的是,图3-3所示的输出的数量以及计算节点的数量是为了便于说明而举出的一个具体示例,在该实施例的启发下,本领域技术人员可以想到其他的输出数量以及计算节点数量,都属于本申请覆盖的范围。并且,图3-3所示为4个计算节点针对第5个输出协同全连接层运算,本领域技术人员能够理解,4个计算节点也能够针对其他输出协同全连接层运算。
其次,对于输入数据来说,将输入数据分为多个组。如图3-3所示,将输入数据分为12个组,其中,将第1、5、9组分配给计算节点1,将第2、6、10组分配给计算节点2,将第3、7、11组分配给计算节点3,并将第4、8、12组分配给计算节点4。
需要注意的是,图3-3所示的输入数据的分组以及将输入数据分配给各个计算节点的方式为了便于说明而举出的一个具体示例,本申请并不限制输入数据的分组以及将输入数据分配给各个计算节点的方式。在图3-3所示实施例的启发下,本领域技术人员可以想到其他的输入数据的分组以及将输入数据分配给各个计算节点的方式,都属于本申请覆盖的范围。例如,可以将将输入数据分为20个组,将连续5个输入分组分配给一个计算节点。又如,不是将输入数据分组平均分配给多个计算节点,即每个计算节点所分配到的输入分组的数量可以是不相同的,等等。
各个计算节点在获取到输入分组之后,就可以进行计算。根据图3-3所示的输入数据的分组以及将输入数据分配给各个计算节点的方式,针对5个输出,计算节点1对第1、5、9组输入数据,计算节点2对第2、6、10组输入数据,计算节点3对第3、7、11组输入数据,计算节点4对第4、8、12组输入数据进行全连接层计算,所得到的计算结果是针对第5输出的部分和。然后,4个计算节点启动归并加和传输过程,各个计算节点将自身所得的部分和数据与接收到的部分和数据进行加和,再将加和结果发送至下一个计算节点。
图3-4提供了根据本申请一个实施例的计算节点之间的拓扑结构的示意图。如图3-4所示,计算节点1、计算节点2、计算节点3和计算节点4构成一个环状拓扑。例如,针对第5个输出的计算,指定计算节点4为获取最终加和结果的计算节点。计算节点1将其全连接层计算的第1结果传输至计算节点2,计算节点2在收到来自计算节点1的第1结果后,将该第1结果与计算节点2全连接层运算后的第2结果进行加和,得到第1加和结果,将第1加和结果发送至计算节点3,计算节点3将第1加和结果与计算节点3全连接 层运算后的第3结果进行加和,得到第2加和结果,将第2加和结果发送至计算节点4,计算节点4将第2加和结果与计算节点4全连接层运算后的第4结果进行加和,得到第3加和结果,最后,第4结果进行加和存储该第3加和结果,作为针对第5个输出的最终计算结果。
需要提及的是,对于计算节点2来说,其在收到来自计算节点1的第1结果的过程中,将该第1结果与第2结果进行加和,一边接收第1结果,一边执行第1结果与第2结果的加和运算,即收到第1结果的一部分数据,就执行加和运算,一边接收一边执行加和运算。另外,计算节点2将第1结果与第2结果进行加和得到第1加和结果的过程中,将第1加和结果发送至计算节点3,一边执行第1结果与第2结果的加和运算,一边发送第1加和结果,即加和运算得到第1加和结果的一部分数据,就开始发送第1加和结果,一边执行加和运算一边发送。上述边收边运算以及边运算边发送的过程同样适用于其他计算节点,即计算节点1、计算节点3和计算节点4。
这样,接收一部分数据就处理一部分数据,计算获得一部分计算结果就发送一部分计算结果,不是待接收完成后再计算,也不是待计算完成后再发送计算结果,从而大大减少通信时间。
在图3-4所示的实施例中,指定计算节点4为获取最终加和结果的计算节点,本领域技术人员可以理解的是,也可以指定其他任一计算节点为获取最终加和结果的计算节点。而且,这对不同的输出,获取最终加和结果的计算节点可以是不同的。例如,针对第5个输出,将计算节点4指定为获取最终加和结果的计算节点,而针对第6个输出,可以指定计算节点3。
图3-4展示了计算节点1、计算节点2、计算节点3和计算节点4形成一种环状拓扑,本领域技术人员可以理解的是这只是为了便于说明而举出一种拓扑实现方式的示例。根据实际需要和具体的应用场景,多个计算节点形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构,等等。另外,需要注意的是,计算节点1、计算节点2、计算节点3和计算节点4包括神经网络芯片,和/或神经网络芯片中的计算核。
根据图3-4所示的实施例,计算节点1、计算节点2、计算节点3和计算节点4在各自完成针对第5个输出的计算或者针对第5个输出的加和运算之后,就可执行后续的计算。例如,根据图3-4所示的实施例,可以执行针对第6个输出的运算。可以理解的是,计算节点在完成针对当前输出的计算或者针对当前输出的加和运算之后,所执行的下一个输出的运算与当前输出可以不是同一个全连接层的。另外,计算节点在完成针对当前输出的计算或者针对当前输出的加和运算之后,也可以执行其他神经网络层的运算,例如卷积层、池化层等。
这样,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
基于上述实施例,本申请提出一种全连接层运算方法。如图3-5A至图3-5H所示,所述全连接层运算方法包括:步骤S501,基于针对第一输出的输入计算数据进行运算,得到第一结果。
如图3-3和图3-4所示,对于第5输出,需要12组输入数据得出该输出结果。计算节点2针对12组输入数据中的第2、6和10组进行运算,得到一个运算结果,称为第2 结果。
步骤S502,在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果。
如图3-3和图3-4所示,计算节点1将其运算得到的第1结果发送给计算节点2,计算节点2收到来自计算节点1的第1结果。
步骤S503,在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
如图3-3和图3-4所示,计算节点2在收到第1结果的过程中,将第2结果与第1结果进行加和运算,得到第1加和结果。
对于计算节点2来说,其在收到来自计算节点1的第1结果的过程中,将第1结果与第2结果进行加和,一边接收第1结果,一边执行第1结果与第2结果的加和运算,即收到第1结果的一部分数据,就执行加和运算,一边接收一边执行加和运算。这样,接收一部分数据就处理一部分数据,不是待接收完成后再计算,从而大大减少通信时间。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S504,在确定所述第三结果被第三计算节点使用的情况下,在将所述第一结果与所述第二结果进行加和运算得到第三结果的过程中,发送所述第三结果至所述第三计算节点。
如图3-3和图3-4所示,计算节点3需要来自计算节点2的第1加和结果进行后续的计算,那么,计算节点2将第1加和结果发送给计算节点3。在计算节点2将第1结果与第2结果进行加和得到第1加和结果的过程中,将第1加和结果发送给计算节点3。
对于计算节点2来说,其将第1结果与第2结果进行加和得到第1加和结果的过程中,将第1加和结果发送至计算节点3,一边执行第1结果与第2结果的加和运算,一边发送第1加和结果,即加和运算得到第1加和结果的一部分数据,就开始发送第1加和结果,一边执行加和运算一边发送。
这样,计算获得一部分计算结果就发送一部分计算结果,是待计算完成后再发送计算结果,从而大大减少通信时间。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S505,在确定所述第三结果不被第三计算节点使用的情况下,将所述第三结果作为所述第一输出的最终结果进行存储。
如图3-3和图3-4所示,对于计算节点4来说,其被指定为获取最终加和结果的计算节点,其进行加和运算获得的第3加和结果作为针对第5个输出的最终计算结果,存储在计算节点4中。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S506,在确定不存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,发送所述第一结果。
如图3-3和图3-4所示,对于计算节点1来说,其没有来自其他计算节点的针对第5个输出的发送的运算结果,那么,计算节点1就将第1结果发送至计算节点2。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S507,接收针对所述第一输出的输入计算数据。
如图3-3所示,针对第5输出,输入数据具有12组。当然,输入数据也可以包括其他的组数,这都属于本申请覆盖的范围。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S508,对所接 收的针对所述第一输出的输入计算数据进行分组。
可将接收的12输入数据进行分组,如图3-3所示,将第1、5、9组分配给计算节点1,将第2、6、10组分配给计算节点2,将第3、7、11组分配给计算节点3,并将第4、8、12组分配给计算节点4。
需要注意的是,图3-3所示的输入数据的分组以及将输入数据分配给各个计算节点的方式为了便于说明而举出的一个具体示例,本申请并不限制输入数据的分组以及将输入数据分配给各个计算节点的方式。在图3-3所示实施例的启发下,本领域技术人员可以想到其他的输入数据的分组以及将输入数据分配给各个计算节点的方式,都属于本申请覆盖的范围。例如,可以将将输入数据分为20个组,将连续5个输入分组分配给一个计算节点。又如,不是将输入数据分组平均分配给多个计算节点,即每个计算节点所分配到的输入分组的数量可以是不相同的,等等。
所以,分组的方式,对于各个计算节点既可以相隔相同数据分组均匀分配(图3-3中,每个计算节点获得数据分组之间均相隔4个数据组),也可以相隔不同数据分组不均匀的分配;每个计算节点获得数据分组既可以是相互隔开的,也可以是连续的;每个计算节点获得数据分组的数量既可以是相同,也可以是不同的,等等。本领域技术人员根据实际需要和具体的应用场景,可以采用任何适合的分组方式,这些都属于本申请覆盖的范围。
特别地,在一个优选实施例中,计算节点可以在针对第一输出的所有输入数据拆分成的N个数据组中,每间隔a个数据组接收一组输入数据,形成所述针对第一输出的输入计算数据,其中,a表示计算节点的个数,N为a的整数倍。这样,能够将输入数据更平均地分配给每个计算节点,使得每个计算节点所负担的运算数据更接近。
如图3-3所示,将输入数据分为12个组,对于4个计算节点,将第1、5、9组分配给计算节点1,将第2、6、10组分配给计算节点2,将第3、7、11组分配给计算节点3,并将第4、8、12组分配给计算节点4。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S509,在完成将所述第一结果与所述第二结果进行加和运算得到第三结果后,执行后续针对第二输出的运算。
如图3-4所示,计算节点2在完成针对第5个输出的加和运算之后,就可执行后续的计算。例如,根据图3-4所示的实施例,计算节点2可以执行针对第6个输出的运算。可以理解的是,计算节点在完成针对当前输出的加和运算之后,所执行的下一个输出的运算与当前输出可以不是同一个全连接层的。另外,计算节点在完成针对当前输出的计算或者针对当前输出的加和运算之后,也可以执行其他神经网络层的运算,例如卷积层、池化层等。
在进一步的实施例中,所述全连接层运算方法还包括如下步骤:步骤S510,在完成基于针对第一输出的输入计算数据的运算后,执行后续针对第二输出的运算。
如图3-4所示,计算节点1在完成针对第5个输出的计算之后,就可执行后续的计算。例如,根据图3-4所示的实施例,计算节点1可以执行针对第6个输出的运算。可以理解的是,计算节点在完成针对当前输出的计算之后,所执行的下一个输出的运算与当前输出可以不是同一个全连接层的。另外,计算节点在完成针对当前输出的计算之后,也可以执行其他神经网络层的运算,例如卷积层、池化层等。
这样,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后 续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
另外,需要注意的是,计算节点1、计算节点2、计算节点3和计算节点4包括神经网络芯片,和/或神经网络芯片中的计算核。而且,计算节点1、计算节点2、计算节点3和计算节点4之间可以采用任意的拓扑结构,譬如环状、网状、树状,或者其他包括环状的结构。
根据上述全连接层运算方法,在全连接层运算的过程中,多个计算节点针对一个输出协同运算,每个计算节点能够在接收其他计算节点的运算结果的过程中进行加和,并且在加和获得结果的过程中发送加和的结果,即接收一部分数据就处理一部分数据,计算获得一部分计算结果就发送一部分计算结果,不是待接收完成后再计算,也不是待计算完成后再发送计算结果,从而大大减少通信时间。此外,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。
进一步需要说明的是,虽然图3-5A至图3-5H的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图3-5A至图3-5H中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
根据另一个实施例,本发明还提供一种神经网络全连接层运算装置。如图3-6A至图3-6H所示,该神经网络全连接层运算装置包括:第一运算单元3-601,用于基于针对第一输出的输入计算数据进行运算,得到第一结果。
如图3-3和图3-4所示,对于第5输出,需要12组输入数据得出该输出结果。计算节点2针对12组输入数据中的第2、6和10组进行运算,得到一个运算结果,称为第2结果。
第一接收单元3-602,用于在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果
如图3-3和图3-4所示,计算节点1将其运算得到的第1结果发送给计算节点2,计算节点2收到来自计算节点1的第1结果。
加和单元3-603,用于在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
如图3-3和图3-4所示,计算节点2在收到第1结果的过程中,将第2结果与第1结果进行加和运算,得到第1加和结果。
对于计算节点2来说,其在收到来自计算节点1的第1结果的过程中,将第1结果 与第2结果进行加和,一边接收第1结果,一边执行第1结果与第2结果的加和运算,即收到第1结果的一部分数据,就执行加和运算,一边接收一边执行加和运算。这样,接收一部分数据就处理一部分数据,而不是待接收完成后再计算,从而大大减少通信时间。
在进一步的实施例中,所述全连接层运算装置还包括:第一发送单元3-604,用于在确定所述第三结果被第三计算节点使用的情况下,在将所述第一结果与所述第二结果进行加和运算得到第三结果的过程中,发送所述第三结果至所述第三计算节点。
如图3-3和图3-4所示,计算节点3需要来自计算节点2的第1加和结果进行后续的计算,那么,计算节点2将第1加和结果发送给计算节点3。在计算节点2将第1结果与第2结果进行加和得到第1加和结果的过程中,将第1加和结果发送给计算节点3。
对于计算节点2来说,其将第1结果与第2结果进行加和得到第1加和结果的过程中,将第1加和结果发送至计算节点3,一边执行第1结果与第2结果的加和运算,一边发送第1加和结果,即加和运算得到第1加和结果的一部分数据,就开始发送第1加和结果,一边执行加和运算一边发送。
这样,计算获得一部分计算结果就发送一部分计算结果,而不是待计算完成后再发送计算结果,从而大大减少通信时间。
在进一步的实施例中,所述全连接层运算装置还包括:存储单元3-605,用于在确定所述第三结果不被第三计算节点使用的情况下,将所述第三结果作为所述第一输出的最终结果进行存储。
如图3-3和图3-4所示,对于计算节点4来说,其被指定为获取最终加和结果的计算节点,其进行加和运算获得的第3加和结果作为针对第5个输出的最终计算结果,存储在计算节点4中。
在进一步的实施例中,所述全连接层运算装置还包括:第二发送单元3-606,用于在确定不存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,发送所述第一结果。
如图3-3和图3-4所示,对于计算节点1来说,其没有来自其他计算节点的针对第5个输出的发送的运算结果,那么,计算节点1就将第1结果发送至计算节点2。
在进一步的实施例中,所述全连接层运算装置还包括:第二接收单元3-607,用于接收针对所述第一输出的输入计算数据。
如图3-3所示,针对第5输出,输入数据具有12组。当然,输入数据也可以包括其他的组数,这都属于本申请覆盖的范围。
在进一步的实施例中,所述全连接层运算装置还包括:拆分单元3-608,用于对所接收的针对所述第一输出的输入计算数据进行分组。
可将接收的12输入数据进行分组,如图3-3所示,将第1、5、9组分配给计算节点1,将第2、6、10组分配给计算节点2,将第3、7、11组分配给计算节点3,并将第4、8、12组分配给计算节点4。
需要注意的是,图3-3所示的输入数据的分组以及将输入数据分配给各个计算节点的方式为了便于说明而举出的一个具体示例,本申请并不限制输入数据的分组以及将输入数据分配给各个计算节点的方式。在图3-3所示实施例的启发下,本领域技术人员可以想到其他的输入数据的分组以及将输入数据分配给各个计算节点的方式,都属于本申请覆盖的范围。例如,可以将将输入数据分为20个组,将连续5个输入分组分配给一个计算节 点。又如,不是将输入数据分组平均分配给多个计算节点,即每个计算节点所分配到的输入分组的数量可以是不相同的,等等。
所以,分组的方式,对于各个计算节点既可以相隔相同数据分组均匀分配(图3-3中,每个计算节点获得数据分组之间均相隔4个数据组),也可以相隔不同数据分组不均匀的分配;每个计算节点获得数据分组既可以是相互隔开的,也可以是连续的;每个计算节点获得数据分组的数量既可以是相同,也可以是不同的,等等。本领域技术人员根据实际需要和具体的应用场景,可以采用任何适合的分组方式,这些都属于本申请覆盖的范围。
特别地,在一个优选实施例中,计算节点可以在针对第一输出的所有输入数据拆分成的N个数据组中,每间隔a个数据组接收一组输入数据,形成所述针对第一输出的输入计算数据,其中,a表示计算节点的个数,N为a的整数倍。这样,能够将输入数据更平均地分配给每个计算节点,使得每个计算节点所负担的运算数据更接近。
如图3-3所示,将输入数据分为12个组,对于4个计算节点,将第1、5、9组分配给计算节点1,将第2、6、10组分配给计算节点2,将第3、7、11组分配给计算节点3,并将第4、8、12组分配给计算节点4。
在进一步的实施例中,所述全连接层运算装置还包括:第二运算单元3-609,用于在完成将所述第一结果与所述第二结果进行加和运算得到第三结果后,执行后续针对第二输出的运算。
如图3-4所示,计算节点2在完成针对第5个输出的加和运算之后,就可执行后续的计算。例如,根据图3-4所示的实施例,计算节点2可以执行针对第6个输出的运算。可以理解的是,计算节点在完成针对当前输出的加和运算之后,所执行的下一个输出的运算与当前输出可以不是同一个全连接层的。另外,计算节点在完成针对当前输出的计算或者针对当前输出的加和运算之后,也可以执行其他神经网络层的运算,例如卷积层、池化层等。
在进一步的实施例中,所述全连接层运算装置还包括:第三运算单元3-610,用于在完成基于针对第一输出的输入计算数据的运算后,执行后续针对第二输出的运算。
如图3-4所示,计算节点1在完成针对第5个输出的计算之后,就可执行后续的计算。例如,根据图3-4所示的实施例,计算节点1可以执行针对第6个输出的运算。可以理解的是,计算节点在完成针对当前输出的计算之后,所执行的下一个输出的运算与当前输出可以不是同一个全连接层的。另外,计算节点在完成针对当前输出的计算之后,也可以执行其他神经网络层的运算,例如卷积层、池化层等。
这样,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
另外,需要注意的是,计算节点1、计算节点2、计算节点3和计算节点4包括神经网络芯片,和/或神经网络芯片中的计算核。而且,计算节点1、计算节点2、计算节点3和计算节点4之间可以采用任意的拓扑结构,譬如环状、网状、树状,或者其他包括环状的结构。
根据上述全连接层运算装置,在全连接层运算的过程中,多个计算节点针对一个输出协同运算,每个计算节点能够在接收其他计算节点的运算结果的过程中进行加和,并且在加和获得结果的过程中发送加和的结果,即接收一部分数据就处理一部分数据,计算获 得一部分计算结果就发送一部分计算结果,而不是待接收完成后再计算,也不是待计算完成后再发送计算结果,从而大大减少通信时间。此外,每个计算节点在计算完自己的针对当前输出的全连接层运算后,即可执行后续的针对其他输出的全连接层运算或其他神经网络层的运算,而无须等待计算最慢的计算节点计算完成,从而提高了运算效率。
参阅图3-7,图3-7提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如图3-5A至图3-5H所示的方法以及细化方案。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本披露各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述处理器或芯片可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述片上缓存、片外内存、存储器可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本披露的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如图3-5A至图3-5H所示的方法以及细化方案。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如图3-5A至图3-5H所示的方法以及细化方案。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描 述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
条款C1,一种全连接层运算方法,所述方法应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述方法包括:基于针对第一输出的输入计算数据进行运算,得到第一结果;在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果;以及在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
条款C2,如条款C1所述的方法,还包括:在确定所述第三结果被第三计算节点使用的情况下,在将所述第一结果与所述第二结果进行加和运算得到第三结果的过程中,发送所述第三结果至所述第三计算节点。
条款C3,如条款C1所述的方法,还包括:在确定所述第三结果不被第三计算节点使用的情况下,将所述第三结果作为所述第一输出的最终结果进行存储。
条款C4,如条款C1所述的方法,还包括:在确定不存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,发送所述第一结果。
条款C5,如条款C1至C4任意一者所述的方法,还包括:接收针对所述第一输出的输入计算数据。
条款C6,如条款C5所述的方法,还包括对所接收的针对所述第一输出的输入计算数据进行分组。
条款C7,如条款C6所述的方法,其中,所述接收针对第一输出的输入计算数据包括:在针对第一输出的所有输入数据拆分成的N个数据组中,每间隔a个数据组接收一组输入数据,形成所述针对第一输出的输入数据,其中,a表示计算节点的个数,N为a的整数倍。
条款C8,如条款C1或C2所述的方法,还包括:在完成将所述第一结果与所述第二结果进行加和运算得到第三结果后,执行后续针对第二输出的运算。
条款C9,如条款C4所述的方法,还包括:在完成基于针对第一输出的输入计算数据的运算后,执行后续针对第二输出的运算。
条款C10,如条款C1至C9任意一者所述的方法,其中,所述多个计算节点形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款C11,如条款C1至C10任意一者所述的方法,其中,所述计算节点包括神经网络芯片或者所述神经网络芯片中的计算核。
条款C12,一种全连接层运算装置,所述装置应用于包括多个计算节点的人工智能处理器,对于任一计算节点,所述装置包括:第一运算单元,用于基于针对第一输出的输入计算数据进行运算,得到第一结果;第一接收单元,用于在确定存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,接收所述第二计算节点发送的所述第二结果;以及加和单元,用于在接收所述第二结果的过程中,将所述第一结果与所述第二结果进行加和运算得到第三结果。
条款C13,如条款C12所述的装置,还包括:第一发送单元,用于在确定所述第三结果被第三计算节点使用的情况下,在将所述第一结果与所述第二结果进行加和运算得到第三结果的过程中,发送所述第三结果至所述第三计算节点。
条款C14,如条款C12所述的装置,还包括:存储单元,用于在确定所述第三结果不被第三计算节点使用的情况下,将所述第三结果作为所述第一输出的最终结果进行存储。
条款C15,如条款C12所述的装置,还包括:第二发送单元,用于在确定不存在来自第二计算节点针对所述第一输出发送的第二结果的情况下,发送所述第一结果。
条款C16,如条款C12至C15任意一者所述的装置,还包括:第二接收单元,用于接收针对所述第一输出的输入计算数据。
条款C17,如条款C16所述的装置,还包括:拆分单元,用于对所接收的针对所述第一输出的输入计算数据进行分组。
条款C18,如条款C17所述的装置,其中,所述第二接收单元用于:在针对第一输出的所有输入数据拆分成的N个数据组中,每间隔a个数据组接收一组输入数据,形成所述针对第一输出的输入数据,其中,a表示计算节点的个数,N为a的整数倍。
条款C19,如条款C12或C13所述的装置,还包括:第二运算单元,用于在完成将所述第一结果与所述第二结果进行加和运算得到第三结果后,执行后续针对第二输出的运算。
条款C20,如条款C15所述的装置,还包括:第三运算单元,用于在完成基于针对第一输出的输入计算数据的运算后,执行后续针对第二输出的运算。
条款C21,如条款C12至C20任意一者所述的装置,其中,所述多个计算节点形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款C22,如条款C12至C21任意一者所述的装置,其中,所述计算节点包括神经网络芯片或者所述神经网络芯片中的计算核。
条款C23,一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如条款C1-C11任一所述的方法。
条款C24,一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如条款C1-C11任一项所述的方法。
条款C25,一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如条款C1-C11任一项所述的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。201910819939.3
201910819947.8
本申请涉及信息处理技术领域,具体涉及一种神经网络协同训练方法、装置以及相关产品。
目前,人工神经网络是所有智能方法中最常见的计算模型之一。在进行神经网络各个网络层的运算过程中以及神经网络训练的过程中,存在数据通信的通信时间以及处理数据的计算时间。
然而,现有技术中还没有有效减少通信时间,使得数据通信的时间被数据计算的时间覆盖的方案。为了改进性能,有必要采用各种手段来改进神经网络中的网络层运算以及协同训练的过程。
为了解决上述的问题,我们提出如下方案。下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
参阅图4-1,图4-1提供了一种神经网络构架示意图,如图4-1所示,神经网络构架可以包括多层结构,该多层结构如图4-1所示,可以包括:输入层、卷积层1、batchnorm层、卷积层2、中间层(依据不同功能的神经网络构架具有不同的中间层,该中间层可以为至少一层)、卷积层n、全连接层、激活(例如激活函数:softmax)层。对于神经网络构架,计算量较大的层可以称为计算层,例如卷积层、全连接层等等,当然在实际应用中,上述计算层还可以包含其他类型的层,另外,本申请提供的图4-1中的神经网络构架仅仅是为了举例说明,本申请中的神经网络并不局限如图4-1所示的构架。
图4-2提供了根据本申请一个实施例的多核系统的示意图。如图4-2所示,该核系统可以为一个神经网络芯片。该多核系统包括16个核(CORE),包含4个存储节点,16个核通过一个环状的NOC与4个存储节点DRAM相连。需要注意的是,该多核系统的核可以为神经网络芯片中的计算核,存储节点的类型可以是任意类型的存储器,例如,动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)等。
根据图4-2所示的示例,多核系统为16个核以及4的存储节点。然而,可以理解的是,多核系统可以包括任意数量的核以及任意数量的存储节点,这些都属于本申请覆盖的范围。
图4-3提供了根据本申请一个实施例的协同训练系统的拓扑结果的示意图。如图4-3所示,协同训练系统包括控制节点和多个计算节点,控制节点和多个计算节点之间可以传 递数据。在图4-3所示实施例中,控制节点和计算节点的数量分别为1个和8个,然而,本领域技术人员能够理解的是,根据实际需要和具体的应用,控制节点和计算节点的数量可以是任意的。
虽然图4-3显示了控制节点和计算节点采用环状拓扑结果,然而,这只是便于说明本申请方案所举的一个具体实现方式。需要注意的是,根据实际需要和具体的应用,控制节点和计算节点之间可以采用任意的拓扑结构,譬如环状、网状、树状,或者其他包括环状的结构。另外,控制节点包括参数服务节点。控制节点和计算节点包括神经网络芯片或者所述神经网络芯片中的计算核。
图4-4提供了根据本申请一个实施例的协同训练的示意图。控制节点向所有计算节点发送获取梯度更新数据信号。
在一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识。比如,控制节点希望获取计算节点1、计算节点2、计算节点4和计算节点5的权值梯度数据。那么,每个计算节点在收到获取梯度更新数据信号后,确认自己是否满足获取梯度更新数据信号的条件。
在另一个实施例中,该获取梯度更新数据信号还可以包括更新的权值梯度数据的代数的标识。计算节点将更新的权值梯度数据的代数于本地的权值梯度数据所标识的代数进行比较,如果二者的差值符合预期,计算节点将本地的权值梯度数据合并到此次训练传输中。比如,更新的权值梯度数据的代数的标识代数为8,而代数差值预定为3,本地的权值梯度数据所标识的代数为5,代数差值符合预期,那么,计算节点就将本地的权值梯度数据合并到此次训练传输中。
在又一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识。这样,计算节点在同时满足需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识的情况下,才需要将本地的权值梯度数据合并到此次训练传输中。
以上仅仅举出了获取梯度更新数据信号的一些具体实现方式,本申请并不限于上述实现方式,本领域技术人员在上述实施例的启示下想到的其他实现方式都属于本申请覆盖的范围。
控制节点通过发送获取梯度更新数据信号,各个计算节点在判断是否符合获取梯度更新数据信号的要求的过程中,计算节点自动形成了分组,从而在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。
如图4-4所示,假设计算节点1、计算节点2、计算节点4和计算节点5(用阴影表示)满足获取梯度更新数据信号的条件,需要将本地的权值梯度数据合并到此次训练传输中。对于计算节点1来说,其只需将所获取的权值梯度数据1发送至计算节点2。对于计算节点2来说,将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和,并将加和结果发送至计算节点3。对于计算节点3来说,由于其不满足获取梯度更新数据信号的条件,无需将本地的权值梯度数据合并到此次训练传输中,那么计算节点3只需将所接收的来自计算节点2的权值梯度数据发送出去(直传)。
在一个实施例中,计算节点2将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和并将加和结果发送至计算节点3的过程,是一边进行加和处理一边 发送加和结果,即计算得到一部分加和结果就发送一部分,不是待计算完成后再发送计算结果。在另一个实施例中,计算节点3将所接收的来自计算节点2的权值梯度数据发送出去的过程,是一边接收数据一边发送数据,即接收一部分数据就发送一部分,而不是待接收完成后再发送。所以,上述边计算边发送以及边接收边发送的方式,能够有效减少通信时间。
计算节点4和计算节点5采用类似于计算节点2处理和发送数据的方式,计算节点6、计算节点7和计算节点8采用类似于计算节点3处理和发送数据的方式。
如图4-4所示,当控制节点收到传输回来的归并的权值梯度数据后,更新权值数据,并将更新的权值数据广播给所有计算节点,同时在信息中标记标签,表示该更新的权值数据的代数。如图4-4所示,每个计算节点在收到更新的权值数据后将其保存,更新本地的权值数据,在下次训练的时候,使用该更新的权值数据进行训练,同时训练得到的权值梯度数据使用更新的权值数据附带的标签标记。
每个计算节点在接收到更新的权值数据的过程中,在存在接收权值数据的下一个计算节点的情况下,将该权值数据发送给下一个计算节点。如图4-4所示,计算节点1将权值数据发送给计算节点2,计算节点2将权值数据发送给计算节点3,……计算节点7将权值数据发送给计算节点8。计算节点在接收和发送权值数据,可以采用接收完成再发送的方式。并且,在一个优选的实施例中,计算节点在接收和发送权值数据,也可以采用边接收边发送的方式,即接收一部分数据就发送一部分,而不是待接收完成后再发送。
在一个可选的实施例中,为了自动化的优化传输分组,需要动态的对计算节点的分组进行优化。首先,计算节点在传递本地权值梯度数据的时候,将生成该权值梯度数据的时间附带在数据内,传递回控制节点。然后,控制节点在每一个分组传输结束之后,将每个分组各个计算节点反馈回来的时间戳进行比较,最后,对于各个分组有重合的部分进行交换,直到各个分组内各个计算节点返回的时间戳与另一个分组完全分离。
图4-5提供了根据本申请一个实施例的动态调整计算节点分组的示意图。如图4-5所示,原来的分组方式为:计算节点1、计算节点2、计算节点4和计算节点5为一个分组,而计算节点3、计算节点6、计算节点7和计算节点8为另一个分组。控制节点将计算节点1至计算节点8返回的时间戳进行相互比较后,为了使得两个分组的时间戳没有时间上的重叠交错,需要将计算节点3和计算节点5的位置交换,那么,控制节点就将计算节点3和计算节点5的位置交换,实现计算节点分组的动态调整,调整后的分组方式为:计算节点1、计算节点2、计算节点3和计算节点4为一个分组,而计算节点5、计算节点6、计算节点7和计算节点8为另一个分组。
为了便于对方案的理解,图4-5给出了8个计算节点和两个分组的实施例。本领域技术人员可以理解的是,根据实际需要和具体应用,上述动态分组的方式可以适用其他任意数量的计算节点和其他任意数量的分组。并且,本领域技术人员在上述实施例的启发下想到的其他动态分组方式都属于本申请覆盖的范围。
基于上述实施例,本申请提出一种协同方法。如图4-6A至图4-6I所示,所述协同训练方法包括:
步骤4-S601,获取第一权值梯度数据。
在图4-4所示的实施例中,每个计算节点在训练后获得本地获取的权值梯度数据。
步骤4-S602,在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据 的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据。
在图4-4所示的实施例中,对计算节点2来说,存在来自计算节点1的权值梯度数据1,将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和得到更新的权值梯度数据,并将该更新的权值梯度数据发送至计算节点3。
计算节点2将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和并将加和结果发送至计算节点3的过程,是一边进行加和处理一边发送加和结果,即计算得到一部分加和结果就发送一部分,不是待计算完成后再发送计算结果,从而能够有效减少通信时间。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S603,在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
在图4-4所示的实施例中,对计算节点1来说,由于不存在来自其他计算节点的权值梯度数据,其只需将所获取的权值梯度数据1发送至计算节点2。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S604,接收获取梯度更新数据信号。
在图4-4所示的实施例中,控制节点向所有计算节点发送获取梯度更新数据信号。
在一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识。比如,控制节点希望获取计算节点1、计算节点2、计算节点4和计算节点5的权值梯度数据。那么,每个计算节点在收到获取梯度更新数据信号,确认自己是否满足获取梯度更新数据信号的条件。
在另一个实施例中,该获取梯度更新数据信号还可以包括更新的权值梯度数据的代数的标识。计算节点将更新的权值梯度数据的代数于本地的权值梯度数据所标识的代数进行比较,如果二者的差值符合预期,计算节点将本地的权值梯度数据合并到此次训练传输中。比如,更新的权值梯度数据的代数的标识代数为8,而代数差值预定为3,本地的权值梯度数据所标识的代数为5,代数差值符合预期,那么,计算节点就将本地的权值梯度数据合并到此次训练传输中。
在又一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识。这样,计算节点在同时满足需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识的情况下,才需要将本地的权值梯度数据合并到此次训练传输中。
控制节点通过发送获取梯度更新数据信号,各个计算节点在判断是否符合获取梯度更新数据信号的要求的过程中,计算节点自动形成了分组,从而在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S605,在符合所述获取梯度更新数据信号的要求的情况下,执行步骤4-S602或步骤4-S603。
在图4-4所示的实施例中,计算节点1、计算节点2、计算节点4和计算节点5计算节点符合所述获取梯度更新数据信号的要求,需要将本地的权值梯度数据合并到此次训练传输中,而将本地的权值梯度数据合并到此次训练传输中的方式通过步骤4-S602或步骤4-S603来实现。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S606,在不符合所述获取梯度更新数据信号的要求且存在来自所述第二计算节点的所述第二权值梯度数据的情况下,在接收所述第二权值梯度数据的过程中,发送所述第二权值梯度数据。
在图4-4所示的实施例中,计算节点3不符合所述获取梯度更新数据信号的要求并存在来自计算节点2的权值梯度数据,那么计算节点3只需将所接收的来自计算节点2的权值梯度数据发送出去(直传)。
在一个实施例中,在另一个实施例中,计算节点3将所接收的来自计算节点2的权值梯度数据发送出去的过程,是一边接收数据一边发送数据,即接收一部分数据就发送一部分,而不是待接收完成后再发送。所以,能够有效减少通信时间。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S607,接收所述控制节点广播的权值数据。
在图4-4所示的实施例中,当控制节点收到传输回来的归并的权值梯度数据后,更新权值数据,并将更新的权值数据广播给所有计算节点,同时在信息中标记标签,表示该更新的权值数据的代数。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S608,保存所述权值数据。
如图4-4所示,每个计算节点在收到更新的权值数据后将其保存,更新本地的权值数据,在下次训练的时候,使用该更新的权值数据进行训练,同时训练得到的权值梯度数据使用更新的权值数据附带的标签标记。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S609,在存在接收所述权值数据的第三计算节点的情况下,在接收所述权值数据的过程中,将所述权值数据发送至所述第三计算节点。
如图4-4所示,每个计算节点在接收到更新的权值数据的过程中,在存在接收权值数据的下一个计算节点的情况下,将该权值数据发送给下一个计算节点。如图4-4所示,计算节点1将权值数据发送给计算节点2,计算节点2将权值数据发送给计算节点3,……计算节点7将权值数据发送给计算节点8。计算节点在接收和发送权值数据,可以采用接收完成再发送的方式。并且,在一个优选的实施例中,计算节点在接收和发送权值数据,也可以采用边接收边发送的方式,即接收一部分数据就发送一部分,而不是待接收完成后再发送。
在进一步的实施例中,所述协同训练方法还包括:步骤4-S610,发送获取所述第一权值梯度数据的时间戳。
每个计算节点在传递本地权值梯度数据的时候,将生成该权值梯度数据的时间附带在数据内,传递回控制节点。控制节点以及各个计算节点传递回的时间戳,对计算节点的分组进行动态调整。例如,在图4-5所示的实施例中,控制节点将计算节点3和计算节点5的位置进行交换,将计算节点3和计算节点5的分组进行了调整。
根据上述协同训练的方法,符合获取梯度更新数据信号的要求计算节点将本地的权值梯度数据与来自另外的计算节点的权值梯度数据进行加和,在加和的过程中,发送加和的结果,即一边计算一边发送计算结果,而不是待计算完成后再发送计算结果;不符合获取梯度更新数据信号的要求计算节点在接收其他计算节点的权值梯度数据的过程中发送所接收的权值梯度数据,在接收过程中发送数据,即一边接收数据一边发送数据,而不是待接收完成后再发送;从而,边计算边发送以及边接收边发送,能够有效减少通信时间; 并且,在训练的过程中,对多个计算节点进行分组,从而在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。
根据另一个实施例,本发明还提供一种协同训练的装置。如图4-7A至图4-7I所示,该协同训练的装置包括:获取单元4-701,用于获取第一权值梯度数据。
在图4-4所示的实施例中,每个计算节点在训练后获得本地获取的权值梯度数据。
第一发送单元4-702,用于在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据。
在图4-4所示的实施例中,对计算节点2来说,存在来自计算节点1的权值梯度数据1,将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和得到更新的权值梯度数据,并将该更新的权值梯度数据发送至计算节点3。
计算节点2将来自计算节点1的权值梯度数据1与本地获取的权值梯度数据2进行加和并将加和结果发送至计算节点3的过程,是一边进行加和处理一边发送加和结果,即计算得到一部分加和结果就发送一部分,不是待计算完成后再发送计算结果,从而能够有效减少通信时间。
在进一步的实施例中,所述协同训练装置还包括:
第二发送单元4-703,用于在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
在图4-4所示的实施例中,对计算节点1来说,由于不存在来自其他计算节点的权值梯度数据,其只需将所获取的权值梯度数据1发送至计算节点2。
在进一步的实施例中,所述协同训练装置还包括:
第一接收单元4-704,用于接收获取梯度更新数据信号。
在图4-4所示的实施例中,控制节点向所有计算节点发送获取梯度更新数据信号。
在一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识。比如,控制节点希望获取计算节点1、计算节点2、计算节点4和计算节点5的权值梯度数据。那么,每个计算节点在收到获取梯度更新数据信号,确认自己是否满足获取梯度更新数据信号的条件。
在另一个实施例中,该获取梯度更新数据信号还可以包括更新的权值梯度数据的代数的标识。计算节点将更新的权值梯度数据的代数于本地的权值梯度数据所标识的代数进行比较,如果二者的差值符合预期,计算节点将本地的权值梯度数据合并到此次训练传输中。比如,更新的权值梯度数据的代数的标识代数为8,而代数差值预定为3,本地的权值梯度数据所标识的代数为5,代数差值符合预期,那么,计算节点就将本地的权值梯度数据合并到此次训练传输中。
在又一个实施例中,该获取梯度更新数据信号可以包括需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识。这样,计算节点在同时满足需要相关计算节点的权值梯度数据的计算节点标识和更新的权值梯度数据的代数的标识的情况下,才需要将本地的权值梯度数据合并到此次训练传输中。
控制节点通过发送获取梯度更新数据信号,各个计算节点在判断是否符合获取梯度更新数据信号的要求的过程中,计算节点自动形成了分组,从而在多个计算节点算力不匹 配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。
在进一步的实施例中,所述协同训练装置还包括:执行单元4-705,用于在符合所述获取梯度更新数据信号的要求的情况下,执行第一发送单元4-702或第二发送单元4-703的步骤。
在图4-4所示的实施例中,计算节点1、计算节点2、计算节点4和计算节点5计算节点符合所述获取梯度更新数据信号的要求,需要将本地的权值梯度数据合并到此次训练传输中,而将本地的权值梯度数据合并到此次训练传输中的方式通过步骤4-S602或步骤4-S603来实现。
在进一步的实施例中,所述协同训练方法还包括:第三发送单元4-706,用于在不符合所述获取梯度更新数据信号的要求且存在来自所述第二计算节点的所述第二权值梯度数据的情况下,在接收所述第二权值梯度数据的过程中,发送所述第二权值梯度数据。
在图4-4所示的实施例中,计算节点3不符合所述获取梯度更新数据信号的要求并存在来自计算节点2的权值梯度数据,那么计算节点3只需将所接收的来自计算节点2的权值梯度数据发送出去(直传)。
在一个实施例中,计算节点3将所接收的来自计算节点2的权值梯度数据发送出去的过程,是一边接收数据一边发送数据,即接收一部分数据就发送一部分,而不是待接收完成后再发送。所以,能够有效减少通信时间。
在进一步的实施例中,所述协同训练装置还包括:第二接收单元4-707,用于接收所述控制节点广播的权值数据。
在图4-4所示的实施例中,当控制节点收到传输回来的归并的权值梯度数据后,更新权值数据,并将更新的权值数据广播给所有计算节点,同时在信息中标记标签,表示该更新的权值数据的代数。
在进一步的实施例中,所述协同训练装置还包括:保存单元4-708,用于保存所述权值数据。
如图4-4所示,每个计算节点在收到更新的权值数据后将其保存,更新本地的权值数据,在下次训练的时候,使用该更新的权值数据进行训练,同时训练得到的权值梯度数据使用更新的权值数据附带的标签标记。
在进一步的实施例中,所述协同训练装置还包括:第四发送单元4-709,用于在存在接收所述权值数据的第三计算节点的情况下,在接收所述权值数据的过程中,将所述权值数据发送至所述第三计算节点。
如图4-4所示,每个计算节点在接收到更新的权值数据的过程中,在存在接收权值数据的下一个计算节点的情况下,将该权值数据发送给下一个计算节点。如图4-4所示,计算节点1将权值数据发送给计算节点2,计算节点2将权值数据发送给计算节点3,……计算节点7将权值数据发送给计算节点8。计算节点在接收和发送权值数据,可以采用接收完成再发送的方式。并且,在一个优选的实施例中,计算节点在接收和发送权值数据,也可以采用边接收边发送的方式,即接收一部分数据就发送一部分,而不是待接收完成后再发送。
在进一步的实施例中,所述协同训练装置还包括:第五发送单元4-710,用于发送获取所述第一权值梯度数据的时间戳。
每个计算节点在传递本地权值梯度数据的时候,将生成该权值梯度数据的时间附带在数据内,传递回控制节点。控制节点以及各个计算节点传递回的时间戳,对计算节点的分组进行动态调整。例如,在图4-5所示的实施例中,控制节点将计算节点3和计算节点5的位置进行交换,将计算节点3和计算节点5的分组进行了调整。
根据上述协同训练的装置,符合获取梯度更新数据信号的要求计算节点将本地的权值梯度数据与来自另外的计算节点的权值梯度数据进行加和,在加和的过程中,发送加和的结果,即一边计算一边发送计算结果,而不是待计算完成后再发送计算结果;不符合获取梯度更新数据信号的要求计算节点在接收其他计算节点的权值梯度数据的过程中发送所接收的权值梯度数据,在接收过程中发送数据,即一边接收数据一边发送数据,而不是待接收完成后再发送;从而,边计算边发送以及边接收边发送,能够有效减少通信时间;并且,在训练的过程中,对多个计算节点进行分组,从而在多个计算节点算力不匹配的时候,可以只同步部分计算节点,从而减少了不同计算节点之间的等待开销,提高运算效率。
参阅图4-8,图4-8提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如图4-6A至图4-6I所示的方法以及细化方案。
应该理解,上述的装置实施例仅是示意性的,本披露的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本披露各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述处理器或芯片可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述片上缓存、片外内存、存储器可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本披露的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本申请实施例还提供一种计算机可读存储介质,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如图4-6A至图4-6I所示的方法以及细化方案。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如图4-6A至图4-6I所示的方法以及细化方案。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
条款D1,一种协同训练的方法,所述方法应用于包括多个节点的人工智能处理器,所述多个节点包括控制节点以及多个计算节点,对于所述多个计算节点中的任一计算节点,所述方法包括如下步骤:获取第一权值梯度数据;在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据。
条款D2,如条款D1所述的方法,还包括:在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
条款D3,如条款D2所述的方法,还包括:接收获取梯度更新数据信号;在符合所述获取梯度更新数据信号的要求的情况下,执行如下步骤中的一者:在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据;或者在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
条款D4,如条款D3所述的方法,还包括:在不符合所述获取梯度更新数据信号的要求且存在来自所述第二计算节点的所述第二权值梯度数据的情况下,在接收所述第二权值梯度数据的过程中,发送所述第二权值梯度数据。
条款D5,如条款D3或D4所述的方法,其中,所述获取梯度更新数据信号包括需要相关计算节点的权值梯度数据的计算节点标识以及/或者更新的权值梯度数据的代数的标识。
条款D6,如条款D5所述的方法,其中,所述获取梯度更新数据信号的要求包括:属于所述计算节点标识指示的计算节点;以及/或者所述第一权值梯度数据的代数与所述更新的权值梯度数据的代数之间的差值满足预设值。
条款D7,如条款D1至D6任意一者所述的方法,还包括:接收所述控制节点广播的权值数据;保存所述权值数据,其中,所述权值数据用于训练;在存在接收所述权值数据的第三计算节点的情况下,在接收所述权值数据的过程中,将所述权值数据发送至所述第三计算节点。
条款D8,如条款D1至D7任意一者所述的方法,还包括:发送获取所述第一权值梯度数据的时间戳,其中,所述时间戳用于将所述多个计算节点进行动态分组。
条款D9,如条款D1至D8任意一者所述的方法,其中,所述控制节点包括参数服务节点。
条款D10,如条款D1至D9任意一者所述的方法,其中,所述多个节点形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款D11,如条款D1至D10任意一者所述的方法,其中,所述节点包括神经网络芯片或者所述神经网络芯片中的计算核。
条款D12,一种协同训练的装置,所述装置应用于包括多个节点的人工智能处理器,所述多个节点包括控制节点以及多个计算节点,对于所述多个计算节点中的任一计算节点,所述装置包括:获取单元,用于获取第一权值梯度数据;第一发送单元,用于在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据。
条款D13,如条款D12所述的装置,还包括:第二发送单元,用于在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
条款D14,如条款D13所述的装置,还包括:第一接收单元,用于接收获取梯度更新数据信号;执行单元,用于在符合所述获取梯度更新数据信号的要求的情况下,执行如下中的一者:在存在来自所述多个计算节点中的第二计算节点的第二权值梯度数据的情况下,在将来自所述第二计算节点的所述第二权值梯度数据与所述第一权值梯度数据进行加和运算得到更新的权值梯度数据的过程中,发送所述更新的权值梯度数据;或者在不存在来自所述第二计算节点的权值梯度数据的情况下,发送所述第一权值梯度数据。
条款D15,如条款D14所述的装置,还包括:第三发送单元,用于在不符合所述获取梯度更新数据信号的要求且存在来自所述第二计算节点的所述第二权值梯度数据的情况下,在接收所述第二权值梯度数据的过程中,发送所述第二权值梯度数据。
条款D16,如条款D14或D15所述的装置,其中,所述获取梯度更新数据信号包括需要相关计算节点的权值梯度数据的计算节点标识以及/或者更新的权值梯度数据的代数的标识。
条款D17,如条款D16所述的装置,其中,所述获取梯度更新数据信号的要求包括:属于所述计算节点标识指示的计算节点;以及/或者所述第一权值梯度数据的代数与所述更新的权值梯度数据的代数之间的差值满足预设值。
条款D18,如条款D12至D17任意一者所述的装置,还包括:第二接收单元,用于接收所述控制节点广播的权值数据;保存单元,用于保存所述权值数据,其中,所述权值数据用于训练;第四发送单元,用于在存在接收所述权值数据的第三计算节点的情况下,在接收所述权值数据的过程中,将所述权值数据发送至所述第三计算节点。
条款D19,如条款D12至D18任意一者所述的装置,还包括:第五发送单元,用于发送获取所述第一权值梯度数据的时间戳,其中,所述时间戳用于将所述多个计算节点进行动态分组。
条款D20,如条款D12至D19任意一者所述的装置,其中,所述控制节点包括参数服务节点。
条款D21,如条款D12至D20任意一者所述的装置,其中,所述多个节点形成的拓扑结构包括环状、网状、树状,或者其他包括环状的结构。
条款D22,如条款D12至D21任意一者所述的装置,其中,所述节点包括神经网络芯片或者所述神经网络芯片中的计算核。
条款D23,一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如条款D1-D11任一所述的方法。
条款D24,一种计算机可读存储介质,其特征在于,其存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如条款D1-D11任一项所述的方法。
条款D25,一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如条款D1-D11任一项所述的方法。
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本披露的方法及其核心思想。同时,本领域技术人员依据本披露的思想,基于本披露的具体实施方式及应用范围上做出的改变或变形之处,都属于本披露保护的范围。综上所述,本说明书内容不应理解为对本披露的限制。201910819947.8

Claims (24)

  1. 一种芯片,包括数据总线以及与所述数据总线连接的存储器、数据接收器、运算处理单元、数据发送器,其中,
    所述数据接收器配置为接收来自外部的第一数据和头信息,将所述第一数据通过所述数据总线写入到所述存储器的对应区域,以及根据所述头信息配置对应的运算处理单元和/或数据发送器;
    所述运算处理单元配置为接收第一任务信息,根据所述第一任务信息执行运算处理并对所述数据发送器执行配置操作;
    所述数据发送器配置为获取第二任务信息以及第二数据,并基于至少部分所述第二数据向外输出第三数据。
  2. 如权利要求1所述的芯片,还包括:
    配置总线,所述运算处理单元、所述数据接收器、所述数据发送器与所述配置总线连接从而通过所述配置总线相互传输配置信息。
  3. 如权利要求1所述的芯片,其中,所述数据接收器还配置为根据所述头信息对所述第一数据进行拆解。
  4. 如权利要求1所述的芯片,其中,所述数据接收器包括:
    第一串行接口;
    数据缓冲器,用于缓存来自所述第一串行接口的所述第一数据;
    解码器,用于从所述头信息解析所述第一数据的格式和存放地址,根据所述第一数据的格式切分所述第一数据,以及根据所述头信息配置所述运算处理单元和所述数据发送器的对应位;
    DMA单元,用于接收来自所述解码器的所述第一数据和所述存放地址,从而将所述第一数据通过所述数据总线写入到所述存储器的对应区域。
  5. 如权利要求1所述的芯片,其中,所述数据接收器还包括:
    解压单元,用于对来自所述解码器的所述第一数据进行解压,并将解压后的第一数据发送给所述DMA单元。
  6. 如权利要求1所述的芯片,所述数据发送器包括发送解码器、数据重排序缓冲器、发送缓冲器和第二串行接口,其中,
    所述发送解码器配置为:将所述第二任务信息打包第二为头信息并将所述第二头信息发送至所述发送缓冲器,以及根据所述第二任务信息向所述数据重排序缓冲器发送数据读取请求信息;
    所述数据重排序缓冲器配置为根据所述数据读取请求信息通过所述数据总线获取并发送所述第二数据,所述第二数据包括至少部分所述第一数据和/或所述运算处理结果;
    所述发送缓冲器配置为对接收的数据进行缓存,并按照所述第二串行接口的格式发送缓存的数据。
  7. 如权利要求6所述的芯片,其中,
    所述发送缓冲器配置为接收所述第二头信息以及接收并缓存所述第二数据,以及按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第二数据;
    第二串行接口配置为接收并发送所述第三数据。
  8. 如权利要求6所述的芯片,其中所述数据发送器还包括算术逻辑单元,
    其中,所述算术逻辑单元配置为对至少部分所述第二数据进行运算,并将所得到的运算结果和/或所述第二数据的部分或全部作为第四数据发送给所述发送缓冲器;
    其中,所述发送缓冲器配置为接收所述第二头信息以及接收并缓存来自所述算术逻辑单元的所述第四数据,以及按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第四数据;
    第二串行接口配置为接收并发送所述第三数据。
  9. 如权利要求6所述的芯片,其中所述数据发送器还包括压缩单元,
    其中,所述压缩单元配置为将所述第二数据压缩为第四数据并发送给所述发送缓冲器;
    其中,所述发送缓冲器配置为接收所述第二头信息以及接收并缓存来自所述压缩单元的第四数据,按照所述第二串行接口的格式发送所述第三数据,所述第三数据包括所述第四数据;
    其中,所述第二串行接口配置为接收并发送所述第三数据。
  10. 如权利要求1所述的芯片,其中还包括设置在所述数据总线与所述运算处理单元或所述数据发送器之间的归并模块,所述归并模块包括归并模式单元、任务预取单元和任务发送单元,
    其中,所述归并模式单元接收并存储其他运算处理单元和/或数据发送器的执行信息;
    其中,所述任务预取单元配置为根据软件配置的寄存器信息从所述存储器获取所述第一任务信息,根据所述第一任务信息对所述执行信息进行处理并根据处理结果确定并发送配置信息和/或所述第二任务信息;
    其中,所述任务发送单元配置为从所述任务预取单元接收所述第二任务信息并发送给其他运算处理单元和/或者数据发送器。
  11. 如权利要求10所述的芯片,其中所述任务预取单元还配置为根据所述第一任务信息将相应任务拆解为多个传输子任务,并根据所述执行信息发送多个传输子任务的所述第二任务信息给所述任务发送单元。
  12. 如权利要求10所述的芯片,其中所述任务发送单元还配置为监听所述运算处理单元或所述数据发送器的状态,并根据所述运算处理单元或所述数据发送器的执行结束状态向其他运算处理单元和/或数据发送器发送配置信息。
  13. 如权利要求1所述的芯片,其中,所述数据总线包括NOC。
  14. 如权利要求1所述的芯片,其中,所述芯片为人工智能芯片,所述运算处理单元为人工智能处理单元或机器学习处理单元。
  15. 如权利要求1所述的芯片,其中,所述数据接收器、所述数据发送器和所述运算处理单元通过所述数据总线相互传输数据以及访问所述存储器。
  16. 如权利要求2所述的芯片,其中,
    所述数据接收器、所述数据发送器和所述运算处理单元通过所述数据总线相互传输数据以及访问所述存储器;
    所述运算处理单元、所述数据接收器、所述数据发送器通过所述配置总线相互传输配置信息。
  17. 一种多芯片系统,包括多个如权利要求1-16中任一项所述的芯片。
  18. 如权利要求17所述的多芯片系统,其中所述多个芯片配置为包括环状、网状、树状结构中的至少一种的布局结构。
  19. 如权利要求18所述的多芯片系统,其中所述多个芯片构建为环形连接结构。
  20. 一种电子设备,包括如权利要求1-16中任一项所述的芯片或如权利要求17-19中任一项所述的多芯片系统。
  21. 一种用于计算节点传输数据的方法,包括:
    开始接收第一数据;
    在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,转发所述第一数据的所述一部分;和/或
    在接收到所述第一数据的一部分之后,在继续接收所述第一数据的同时,对所述第一数据的所述一部分进行处理并转发处理结果。
  22. 一种数据传输方法,包括:利用如权利要求1-16中任一项所述的芯片执行如权利要求21所述的用于计算节点传输数据的方法。
  23. 一种数据传输方法,用于包括多个计算节点的系统,其中所述多个计算节点中的至少部分节点执行如权利要求21或22所述的方法。
  24. 如权利要求23所述的数据传输方法,其中所述多个计算节点构建为环形连接结构。
PCT/CN2020/095205 2019-08-31 2020-06-09 数据传输方法及相关设备 WO2021036404A1 (zh)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201910819947.8 2019-08-31
CN201910819940.6 2019-08-31
CN201910819939.3 2019-08-31
CN201910819939.3A CN112446463B (zh) 2019-08-31 2019-08-31 一种神经网络全连接层运算方法、装置以及相关产品
CN201910819947.8A CN112446485B (zh) 2019-08-31 2019-08-31 一种神经网络协同训练方法、装置以及相关产品
CN201910819946.3 2019-08-31
CN201910819946.3A CN112446474B (zh) 2019-08-31 2019-08-31 芯片和多芯片系统及电子设备和数据传输方法
CN201910819940.6A CN112446464B (zh) 2019-08-31 2019-08-31 一种神经网络卷积运算方法、装置以及相关产品

Publications (1)

Publication Number Publication Date
WO2021036404A1 true WO2021036404A1 (zh) 2021-03-04

Family

ID=74684074

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/095205 WO2021036404A1 (zh) 2019-08-31 2020-06-09 数据传输方法及相关设备

Country Status (1)

Country Link
WO (1) WO2021036404A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI774295B (zh) * 2021-03-29 2022-08-11 瑞昱半導體股份有限公司 用於跨場域可編程邏輯閘陣列之資料傳輸控制的方法及相關設備

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1881932A (zh) * 2005-06-15 2006-12-20 华为技术有限公司 Spⅰ4ⅱ接口远距离传输的实现方法及装置
CN102279763A (zh) * 2011-08-30 2011-12-14 福州瑞芯微电子有限公司 一种bootrom的优化方法
CN102799561A (zh) * 2012-06-18 2012-11-28 龙芯中科技术有限公司 嵌入式可重构数据处理方法、装置及系统
CN108617009A (zh) * 2016-12-13 2018-10-02 中国移动通信有限公司研究院 一种数据传输方法、装置、系统及分组数据网网关
CN110072257A (zh) * 2019-03-07 2019-07-30 武汉星耀科技有限公司 一种mec下用户互通的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1881932A (zh) * 2005-06-15 2006-12-20 华为技术有限公司 Spⅰ4ⅱ接口远距离传输的实现方法及装置
CN102279763A (zh) * 2011-08-30 2011-12-14 福州瑞芯微电子有限公司 一种bootrom的优化方法
CN102799561A (zh) * 2012-06-18 2012-11-28 龙芯中科技术有限公司 嵌入式可重构数据处理方法、装置及系统
CN108617009A (zh) * 2016-12-13 2018-10-02 中国移动通信有限公司研究院 一种数据传输方法、装置、系统及分组数据网网关
CN110072257A (zh) * 2019-03-07 2019-07-30 武汉星耀科技有限公司 一种mec下用户互通的方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI774295B (zh) * 2021-03-29 2022-08-11 瑞昱半導體股份有限公司 用於跨場域可編程邏輯閘陣列之資料傳輸控制的方法及相關設備

Similar Documents

Publication Publication Date Title
US9294097B1 (en) Device array topology configuration and source code partitioning for device arrays
US8930593B2 (en) Method for setting parameters and determining latency in a chained device system
WO2020078470A1 (zh) 片上网络数据处理方法及装置
US8065503B2 (en) Iteratively processing data segments by concurrently transmitting to, processing by, and receiving from partnered process
CN106503791A (zh) 用于有效神经网络部署的系统和方法
CN104821887A (zh) 通过使用具有不同延迟的存储器来进行分组处理的设备和方法
CN110308984B (zh) 一种用于处理地理分布式数据的跨集群计算系统
US11789733B2 (en) Instruction processing apparatus, acceleration unit, and server
US10601723B2 (en) Bandwidth matched scheduler
US20230132724A1 (en) Broadcast adapters in a network-on-chip
WO2021036404A1 (zh) 数据传输方法及相关设备
US10534737B2 (en) Accelerating distributed stream processing
CN114399035A (zh) 搬运数据的方法、直接存储器访问装置以及计算机系统
US8589584B2 (en) Pipelining protocols in misaligned buffer cases
Sun et al. Multi-node acceleration for large-scale GCNs
WO2021213075A1 (zh) 一种基于多处理节点来进行节点间通信的方法和设备
WO2021213076A1 (zh) 基于多处理节点来构建通信拓扑结构的方法和设备
CN112995245B (zh) 一种基于fpga的可配置负载均衡系统与方法
WO2021037261A1 (zh) 芯片和多芯片系统及电子设备和数据传输方法
WO2023151216A1 (zh) 图数据处理的方法和芯片
WO2024077999A1 (zh) 集合通信方法及计算集群
CN114844757B (zh) 一种面向分布式并行运算类算法的片上网络设计方法
WO2023093065A1 (zh) 数据传输方法、计算设备及计算系统
CN114489496B (zh) 基于fpga人工智能加速器的数据存储和传输方法
US20230259486A1 (en) Neural processing unit synchronization systems and methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20857676

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20857676

Country of ref document: EP

Kind code of ref document: A1