WO2020042182A1 - 数据处理系统和数据处理方法 - Google Patents

数据处理系统和数据处理方法 Download PDF

Info

Publication number
WO2020042182A1
WO2020042182A1 PCT/CN2018/103669 CN2018103669W WO2020042182A1 WO 2020042182 A1 WO2020042182 A1 WO 2020042182A1 CN 2018103669 W CN2018103669 W CN 2018103669W WO 2020042182 A1 WO2020042182 A1 WO 2020042182A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
aggregation
computing node
processing system
data processing
Prior art date
Application number
PCT/CN2018/103669
Other languages
English (en)
French (fr)
Inventor
戴明扬
林嘉树
程传宁
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18931858.7A priority Critical patent/EP3819788A4/en
Priority to CN201880091518.7A priority patent/CN111886593A/zh
Priority to PCT/CN2018/103669 priority patent/WO2020042182A1/zh
Publication of WO2020042182A1 publication Critical patent/WO2020042182A1/zh
Priority to US17/173,691 priority patent/US20210166156A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of artificial intelligence computing, and in particular, to a data processing system and a data processing method.
  • AI Artificial intelligence
  • an aggregation operation needs to be performed on multiple data, for example, an addition operation is performed on data generated by two AI computing nodes.
  • an AI computing node e.g., AI computing node 1
  • an AI computing node needs to read data 0 from another AI computing node (e.g., AI computing node 0) and write the data 0 to the AI computing node 1
  • the AI computing node 1 then reads data 1 from its own memory, sends data 1 to the AI processor, and sends data 0 from the buffer to the AI processor. Waiting for data 0 and data 1
  • the result of the aggregation operation is written into the memory of AI compute node 1.
  • AI calculations and aggregation operations are performed on the same processor in a time-sharing manner, and the calculation efficiency is low. How to improve the efficiency of aggregation operations becomes a problem.
  • the present application provides a data processing system and a data processing method, which can improve the aggregation operation efficiency.
  • a data processing system includes a first computing node.
  • the first computing node includes an AI processor and an aggregation operator.
  • the AI processor is configured to perform AI operations to generate first data of the first computing node.
  • the aggregation operator is configured to perform an aggregation operation on the second data and the first data from the second computing node to generate an aggregation operation result.
  • the data processing system provided by the present application can improve the efficiency of the aggregation operation.
  • the aggregation operator includes: an aggregation operation engine, configured to perform an aggregation operation on the first data and the second data to generate an aggregation operation result.
  • the aggregation operator further includes a memory access engine, configured to: obtain the second data from the second memory module of the second computing node; obtain the first data from the first memory module of the first computing node; And the second data are sent to the aggregation operation engine; the result of the aggregation operation is written into the first memory module.
  • This solution has the following beneficial effects: reducing the number of reads and writes to the memory module of the first computing node in the aggregation operation, reducing the number of scheduling, avoiding the effect of the aggregation operation on the cache of the AI processor, so that the aggregation operation and the AI operation can be performed in parallel. This improves the training efficiency of deep neural networks.
  • a memory access engine configured to: obtain the second data from the second memory module of the second computing node; obtain the first data from the first memory module of the first computing node; And the second data are sent to the aggregation operation engine; the result of the aggregation operation is written into the first memory module.
  • the memory access engine is specifically configured to: receive the aggregate operation instruction; execute according to the aggregate operation instruction: obtain the first data from the first memory module, obtain the second data from the second memory module; and combine the first data with the second
  • the data is sent to the aggregation calculation engine.
  • the solution's memory access engine can be controlled by software-level instructions.
  • the above scheme can prevent data that does not need to be aggregated from being sent to the aggregate calculation engine, and improves the efficiency of data movement.
  • the memory access engine is further configured to generate an atomic command including at least one of a read command or a write command, and the read command is used to instruct the memory controller to read the first data from the first memory module and send the The aggregation operation engine, the write command is used to instruct the memory controller to write the result of the aggregation operation to the first memory module; and send the atomic command to the memory controller of the second memory module.
  • the operation corresponding to an atomic command is an atomic operation.
  • An atomic operation refers to an operation that will not be interrupted by the thread scheduling mechanism. Once this operation starts, it will run to the end. During the operation, it will not be interrupted by the operation of other threads. In this way, even if the write operation and the read operation conflict with other memory update operations during the aggregation operation, the above-mentioned optional embodiment can ensure that the result of the aggregation operation is not destroyed.
  • the commands of the write operation and the read operation need not be transmitted on the bus, thereby reducing the occupation of bus resources by the aggregation operation.
  • the memory access engine is a direct memory access (DMA) engine or a remote direct memory access (RDMA) engine.
  • DMA direct memory access
  • RDMA remote direct memory access
  • the aggregation operator further includes a converter for performing a data format conversion process on the result of the aggregation operation. Since the data type conversion processing does not need to be performed in the AI processor, the above scheme can enable the AI processor to focus on AI computing and improve the training efficiency of deep neural networks.
  • the first computing node further includes a first memory module, and the first memory module is configured to store the first data.
  • the data processing system further includes a second computing node.
  • the first computing node and the second computing node are located in different devices.
  • the aggregation operator includes at least two operation channels, and the at least two operation channels are used to perform the aggregation operation in parallel. Therefore, each channel processes a complete aggregation operation pipeline, and multiple aggregation operation pipelines run concurrently, thereby improving the training performance of the entire deep neural network.
  • the present application further provides a data processing method, which includes: using an AI processor in a first computing node in the data processing system to execute an AI operation to generate first data of the first computing node; and using the first computing node
  • the aggregation operation unit in performs the aggregation operation on the first data and the second data from the second computing node in the data processing system to generate an aggregation operation result.
  • the method further includes: using a memory access engine in the aggregation operator to obtain the second data from the second memory module in the second computing node.
  • the method further includes: using a converter in the aggregation operator to perform a data format conversion process on the aggregation operation result. Since the data type conversion processing does not need to be performed in the AI processor, the above scheme can enable the AI processor to focus on AI computing and improve the training efficiency of deep neural networks.
  • using an aggregation operator in the first computing node to perform an aggregation operation on the first data and the second data from the second computing node in the data processing system to generate an aggregation operation result includes using the aggregation operator At least two operation channels in the execution of a multi-channel parallel aggregation operation on the first data and the second data. Since the aggregation operator can simultaneously process data generated by at least two loops, the above scheme can improve the training efficiency of deep neural networks.
  • the above method has the following beneficial effects: reducing the number of reads and writes to the memory module of the first computing node in the aggregation operation, reducing the number of scheduling, avoiding the effect of the aggregation operation on the cache of the AI processor, and enabling the aggregation operation and the AI operation to be performed in parallel. This improves the training efficiency of deep neural networks.
  • beneficial effects reducing the number of reads and writes to the memory module of the first computing node in the aggregation operation, reducing the number of scheduling, avoiding the effect of the aggregation operation on the cache of the AI processor, and enabling the aggregation operation and the AI operation to be performed in parallel. This improves the training efficiency of deep neural networks.
  • a computer-readable storage medium stores computer program code.
  • the computer program code When the computer program code is executed by a processing unit or a processor, the computer program code described in the second aspect can be implemented. method.
  • the present application provides a computer program product.
  • the computer program product includes: computer program code that implements the method described in the second aspect when the computer program code is run by a processing unit or processor.
  • the computer program product may be installed in the data processing system described in the first aspect, so that the data processing system implements the functions described in the first aspect.
  • FIG. 1 is a schematic diagram of a ring applicable to the present application
  • FIG. 2 is a schematic diagram of an initial state of each ring computing node performing a ring aggregation algorithm
  • FIG. 3 is a schematic diagram of a step of the ring aggregation algorithm
  • FIG. 4 is a schematic diagram of another step of the ring aggregation algorithm
  • FIG. 5 is a schematic diagram of the end state of each ring computing node executing the ring aggregation algorithm
  • FIG. 6 is a schematic diagram of a data processing system provided by the present application.
  • FIG. 7 is a schematic diagram of another data processing system provided by the present application.
  • FIG. 8 is a schematic diagram of an aggregation operation performed by an aggregation operation engine provided in the present application.
  • FIG. 9 is a schematic diagram of a data movement operation performed by a memory access engine provided in this application.
  • FIG. 10 is another schematic diagram of a data movement operation performed by a memory access engine provided in the present application.
  • FIG. 11 is a schematic diagram of a data format conversion operation performed by a converter provided in the present application.
  • FIG. 12 is a schematic diagram of another data processing system provided by the present application.
  • FIG. 13 is a schematic diagram of a data processing method provided by the present application.
  • one method is to use parallel distributed training algorithms for training.
  • the process of parallel distributed training algorithms is as follows:
  • Each computing node in the cluster independently completes the calculation of its own mini-batch training data to obtain the gradient
  • All computing nodes in the cluster need to aggregate the calculated gradients to form the aggregated gradient
  • Each computing node calculates new parameter values based on the aggregated gradients, combined with hyper-parameters such as the learning rate;
  • ring aggregation Ring All Reduce
  • FIG. 1 The logical structure of the ring is shown in FIG. 1.
  • the ring includes 5 AI computing nodes, and each AI computing node is, for example, an AI chip.
  • Each AI computing node has a pre-order node and a post-order node, and the position of each AI computing node in the ring is determined by the creator of the ring (for example, user software).
  • the pre-order node of AI compute node 0 is AI compute node 4
  • the post-order node of AI compute node 0 is AI compute node 1.
  • Each AI computing node can receive data from the predecessor node of the AI computing node, and can also send its own data to the subsequent node of the AI computing node.
  • Multiple compute nodes are located within a system.
  • the system is a cluster of one or more devices.
  • Each computing node can be a device or device, or multiple computing nodes are located in a device or device.
  • the device or equipment may be various types of electronic equipment, including but not limited to servers, mainframes, minicomputers, portable computers, or terminals.
  • Each node can be a computing element in a device or device, such as a chip, chipset, or a circuit board that carries the chip or chipset.
  • the creator of the ring sends control information to each AI computing node, slices the data, and each AI computing node calculates
  • the gradient data is equally divided into 5 blocks.
  • the gradient data calculated by the five AI computing nodes shown in Figure 1 are a, b, c, d, and e, respectively.
  • Each AI computing node has its own complete data calculated.
  • the initial state is shown in Figure 2.
  • each AI computing node entered the scatter reduce phase, and each AI computing node sent its own piece of data to its subsequent nodes, and aggregated the data received from the previous nodes and the data stored by itself. deal with.
  • FIG 3 shows one step in the hash aggregation phase.
  • the AI computing node 0 sends the data block a0 to the AI computing node 1.
  • the AI computing node 1 After receiving the data block a0, the AI computing node 1 performs an aggregation operation on a0 and the data block a1 stored by itself.
  • the AI computing node 1 sends the data block b1 to the AI computing node 2.
  • the AI computing node 2 After receiving the data block b1, the AI computing node 2 performs an aggregation operation on b1 and the data block b2 stored by itself.
  • the operation of other AI computing nodes is similar.
  • FIG. 4 shows another step in the hash aggregation phase.
  • AI computing node 0 receives data b4 + b3 + b2 + b1 from the predecessor node (AI computing node 4), and performs aggregation operation on the data with the data b0 stored by itself.
  • the obtained aggregation operation result is b0 + b1 + b2 + b3 + b4.
  • AI computing node 0 receives data b4 + b3 + b2 + b1 and sends its stored data c0 + c4 + c3 + c2 to the subsequent node (AI computing node 1) so that the subsequent nodes can perform gradient aggregation operations.
  • the ring aggregation algorithm proceeds to the next step, the all gather phase.
  • the ring shown in Figure 1 sends the final result obtained by each AI computing node to other AI computing nodes through four passes.
  • the final result obtained by the aggregation operation of data b by AI computing node 0 is b0. + b1 + b2 + b3 + b4, then AI computing section 0 passes the result to AI computing node 1, AI computing node 1 passes the result to AI computing node 2, and so on.
  • each AI computes The nodes all get the final result of the aggregation operation of data b.
  • each AI computing node also obtains the final result of the aggregation operation of each data, as shown in Figure 5.
  • FIG. 6 shows a data processing system provided by the present application, which can reduce the number of memory reads and writes in an aggregation operation, thereby improving the training efficiency of a deep neural network.
  • the data processing system 600 includes a first computing node, and the first computing node includes an AI processor 610 and an aggregation operator 620.
  • the AI processor 610 is configured to perform AI operations to generate first data of a first computing node.
  • the aggregation operator 620 is configured to perform an aggregation operation on the second data from the second computing node and the first data to generate an aggregation operation result.
  • the AI processor 610 is, for example, a neural network processor, such as a matrix operation array.
  • the aggregation operator 620 is, for example, an addition operator, a multiplication operator, a maximum operator, or a minimum operator, and may also be another type of device or logic circuit for performing an aggregation operation.
  • the AI processor 610 is a unit dedicated to artificial intelligence computing, also called a neural-network processor (NPU).
  • NPU neural-network processor
  • it can be a convolutional neural network (CNN) calculator or a recurrent neural network (RNN). Calculator or other similarly functioning neural processing unit.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • the aggregation operator 620 may be a general-purpose processor, a digital signal processor (DSP), or a hardware accelerator, for example, an application-specific integrated circuit (ASIC), a field programmable gate array ( field programmable array (FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field programmable array
  • an aggregation operation refers to an operation performed on at least two data according to a preset rule, and may be one of addition operation, subtraction operation, multiplication operation, division operation, maximum value operation, and minimum value operation.
  • the operation or operations may be other types of operations.
  • the aggregation operator 620 may perform an addition operation on the first data and the second data, and the result obtained is the sum of the two data.
  • the aggregation operator 620 may perform a maximum value operation on the first data and the second data, and the obtained result is the data with a larger value among the two data.
  • the aggregation operator 620 may perform a subtraction operation on the first data and the second data, and then multiply the result of the subtraction operation by the first data or the second data.
  • the AI processor 610 and the aggregation operator 620 may be two devices that are physically separated, for example, located on two motherboards, respectively.
  • the AI processor 610 and the aggregation operator 620 may be two physically inseparable devices, for example, the two devices are located on a system-on-chip (SOC).
  • SOC system-on-chip
  • AI processor 610 and the aggregation operator 620 are merely examples, and should not be construed as limiting the data processing system provided by the present application.
  • the first data is, for example, data c1 in a memory module of the AI computing node 1 shown in FIG. 4, and the second data is, for example, data c0 + c4 + c3 + c2 stored in the memory module of the AI computing node 0 shown in FIG. 4,
  • the AI processor 610 is, for example, a processor in the AI computing node 1 in FIG. 3.
  • the controller of the data processing system 600 such as a central processing unit (CPU)
  • the aggregation operator 620 may transfer c0 + c4 + c3 + c2 from the AI compute node 0.
  • the aggregation operator 620 writes the result of the aggregation operation into the memory module of the AI computing node 1, thereby completing a gradient aggregation operation of the deep neural network.
  • the AI computing node 1 has only experienced one read operation and one write operation.
  • the aggregation operation method provided by the above example reduces the memory of the aggregation operation to the AI computing node 1
  • the consumption of bandwidth resources and the saved memory bandwidth resources can be used for other AI calculations, thereby improving the training efficiency of deep neural networks.
  • the aggregation operator 620 has the ability to read the data in the memory module of AI compute node 0 and the memory module of AI compute node 1.
  • the data processing system 600 only needs to schedule once (i.e., inline aggregation). It can complete one aggregation operation, which reduces the time required for one copy scheduling compared to the aggregation operation device in the prior art, and also improves the training efficiency of deep neural networks.
  • AI calculations for example, deep neural network training
  • SIMMT single instruction stream multiple data stream
  • Data streams, and AI calculations and aggregation operations correspond to two different instruction stream sequences, which makes AI calculations and aggregation operations in the prior art need to be performed serially.
  • AI calculations and aggregation operations are performed in different modules, therefore, the data processing system 600 can process AI calculation tasks and aggregation operation tasks in parallel, which improves the training efficiency of deep neural networks.
  • AI calculation and aggregation operations are performed in the same processor.
  • the processor performs AI calculations, it needs to read AI calculation-related data from memory and write the data to the cache. in.
  • the processor performs the aggregation operation it needs to read the data related to the aggregation operation from the memory and write the data into the cache. If the processor executes AI calculations and aggregation operations in series, the data related to the aggregation operations stored in the cache will pollute the AI calculation-related data, so that the processor needs to read the AI from memory again after performing the aggregation operations. Calculating the relevant data and writing it to the cache affects the cache hit rate of the AI calculation, resulting in increased pressure on the cache system, which has a negative impact on the efficiency of the AI calculation.
  • the data related to the aggregation operation will not enter the AI processor 610, thereby avoiding pollution of the AI calculation related data in the cache, that is, Does not affect the cache hit rate of AI calculations, reduces the pressure on the cache system, and improves the training efficiency of deep neural networks.
  • the above example only uses the deep neural network as an example to describe the data processing system provided in this application.
  • the data processing system provided in this application is not only suitable for deep neural networks, but also for data that needs to be processed among other computing nodes. Aggregate computing scenarios, for example, in the field of supercomputers.
  • the aggregation operation unit 620 may include an aggregation operation engine (reduce engine) 621. As shown in FIG. 7, the aggregation operation engine 621 is configured to perform an aggregation operation on the first data and the second data to generate an aggregation operation result.
  • the CPU in FIG. 7 is configured to perform tasks on the first computing node and the second computing node, for example, execute an AI computing task or perform an aggregation computing task.
  • the CPU is merely an example, and the data processing system 600 may further include other types of controllers or schedulers.
  • FIG. 8 shows a schematic flowchart of an aggregation operation engine 621 provided by the present application for performing aggregation operations.
  • the aggregation operation engine 621 may receive data input from the memory access engine 622 described below, and may also receive data input from row 1 and column 1, perform an aggregation operation on the received data, and then write the result of the aggregation operation into the HBM.
  • the aggregation operation type supported by the aggregation operation engine 621 is, for example, one or more operations among the above-mentioned addition operation, subtraction operation, multiplication operation, division operation, maximum value operation and minimum value operation.
  • the aggregation operator 620 may further include a memory access engine 622, and the memory access engine 622 is configured to execute:
  • the first memory module is, for example, high bandwidth memory (HBM) of the first computing node
  • the second memory module is, for example, HBM of the second computing node.
  • One or more data chunks are stored in the HBM of the first computing node, and the one or more data chunks constitute rank1.
  • one or more data chunks are stored in the HBM of the second computing node, and the one or more data chunks constitute a rank 0.
  • the memory access engine 622 reads data block # 0 from row and column 0 (that is, the second data, for example, c0 + c4 + c3 + c2) and reads data block # 0 from row and column 1 (that is, The first data is, for example, c1), and the two data blocks # 0 are sent to the aggregation operation engine 621. After the aggregation operation engine 621 completes the aggregation operation, the result of the aggregation operation is written into the column 1.
  • the memory access engine 622 is a method of moving data completely by hardware. It does not require the participation of a central processing unit (CPU). This method uses a set of CPU-independent mechanisms to store data in main memory (main memory) and buffer (buffer), between main memory and main memory, or between main memory and peripherals. For example, the memory access engine 622 accepts the removal task from the software through the descriptor, controls the hardware (for example, a chip circuit) to complete the removal operation, and then notifies the software of the removal completion status through the descriptor or the interrupt mode. Since the above scheme does not require CPU parameters, the above scheme releases the processing power of the CPU and realizes high-bandwidth and low-latency data movement.
  • CPU central processing unit
  • the memory access engine 622 also has a single data stream processing logic. That is, the memory access engine 622 determines whether an aggregation operation needs to be performed on the current data stream according to the instruction type.
  • the instructions come from software, for example the software running by the CPU can generate the instructions.
  • the memory access engine 622 receives an aggregation operation instruction, where the aggregation operation instruction is used to instruct the first data and the second data to perform the aggregation operation, and the memory access engine 622 sends the first data to the aggregation operation engine. 621. In a case where the memory access engine 622 does not receive the aggregation operation instruction, or in a case where the memory access engine 622 receives a move instruction, the memory access engine 622 sends the first data to the HBM of the first computing node.
  • the above solution can prevent data that does not need to be aggregated from being sent to the aggregate calculation engine 621, and improves the efficiency of data movement.
  • the memory access engine 622 is further configured to:
  • the atomic command includes at least one of a read command or a write command
  • the read command is used to instruct the memory controller to read the first data from the first memory module and send the first data to the aggregation operation engine
  • write The command is used to instruct the memory controller to write the aggregation operation result to the first memory module.
  • FIG. 10 shows a schematic flowchart of the memory access engine 622 moving data.
  • the memory access engine 622 When the memory access engine 622 needs to read the first data, the memory access engine 622 generates an atomic command that includes a source address (that is, an address where the first data is stored in row 1) and a purpose for the first data. 2 operands of the address (that is, the address of the aggregate operation engine 621). The atomic command also includes a read command and a write command. After receiving the atomic command, the memory controller corresponding to row and column 1 removes the first data from row and column 1. Sent to the aggregate operation engine 621, thereby completing the memory read operation.
  • a source address that is, an address where the first data is stored in row 1
  • the atomic command also includes a read command and a write command.
  • the memory controller of row 1 and column 1 sends the result of the aggregation operation from the aggregation operation engine 621 to row 1 based on receiving the atomic command, thereby completing the memory write operation.
  • the above operand may also be an immediate value, which is not expanded in this embodiment.
  • Atomic operations refer to operations that are not interrupted by the thread scheduling mechanism. Once this operation starts, it continues to run to the end. , The operation process will not be interrupted by the operation of other threads, so that even during the aggregate operation, write operations and read operations conflict with other memory update operations, the above-mentioned optional embodiment can also ensure that the aggregate operation results will not destroyed.
  • the commands of the write operation and the read operation need not be transmitted on the bus, thereby reducing the occupation of bus resources by the aggregation operation.
  • the aggregation operator 620 further includes:
  • a converter 623 is configured to perform a data format (also referred to as a "data type”) conversion process on an aggregate operation result.
  • the data type of the aggregate operation result generated by the aggregate operation engine 621 may be one or more of the following data types: 32-bit floating point number (float32), 16-bit floating point number (float16), rounding (int), unsigned integer (uint), keyword (char), 64-bit floating point number (float64), int64, uint64. If the data type of the result of the aggregation operation is not the type required by the HBM, the converter 623 may convert the result of the aggregation operation to a data type required by the HBM, and then the converter 623 sends the result of the aggregation operation after the data type conversion is completed to the HBM.
  • FIG. 11 shows a schematic flowchart of a data conversion provided by the present application.
  • the data type of the aggregate operation result generated by the aggregate operation engine 621 is float32, and the data type supported by HBM is float16.
  • the converter can convert the aggregate operation result of float32 to the aggregate operation result of float16.
  • the above embodiments are merely examples, and the aggregation operation engine 621 provided in this application can support conversion of more types of data.
  • the above scheme can enable the AI processor to focus on AI computing and improve the training efficiency of deep neural networks.
  • the aggregation operator 620 provided in the present application may support at least two operation channels, and the at least two operation channels are used to perform the aggregation operation processing in parallel.
  • the current deep neural network has three loops, and the data generated by each loop forms an aggregate operation pipeline (reduce pipeline).
  • the aggregation operator 620 includes three channels, each of which is independent of each other. Each channel processes a complete aggregation operation pipeline, and multiple aggregation operation pipelines run concurrently, thereby improving the training performance of the entire deep neural network.
  • the data processing system 600 further includes a first memory module and a second memory module, that is, the first memory module and the second memory module, and the aggregation operator 620 and the AI processor 610 perform data processing tasks as a whole, for example,
  • the user can purchase a data processing system 600 including a first memory module and a second memory module to complete deep neural network training, without the need to separately purchase the first memory module and the second memory module, or the need to lease the first memory module from other vendors.
  • the first memory module and the second memory module are, for example, the HBM described above, or may be other types of memory.
  • the specific product forms of the first memory module and the second memory module are not limited in this application.
  • the data processing system 600 may further include more memory modules and / or other devices.
  • the data processing system 600 further includes a first memory module and a second memory module, which does not mean that the first memory module and the second memory module must be in the same physical entity (for example, a server).
  • the first memory module and the second memory module are located in the same server.
  • the memory access engine 622 may be a direct memory access (DMA) engine.
  • DMA direct memory access
  • the first memory module and the second memory module are located in the same server.
  • the memory access engine 622 may be a remote direct memory access (RDMA) engine.
  • RDMA remote direct memory access
  • the present application also provides a data processing method, which can be executed by the data processing system 600. As shown in FIG. 13, the method 1300 includes:
  • S1320 Perform an aggregation operation on the first data and the second data from the second computing node in the data processing system by using an aggregation operator in the first computing node to generate an aggregation operation result.
  • the method 1300 has the following beneficial effects: reducing the number of reads and writes to the memory module of the first computing node in the aggregation operation, reducing the number of scheduling, avoiding the impact of the aggregation operation on the cache of the AI processor, and enabling the aggregation operation and the AI operation to be parallel get on. This improves the training efficiency of deep neural networks.
  • the method 1300 further includes: using a memory access engine in the aggregation operator to obtain second data from a second memory module in the second computing node.
  • the method 1300 further includes: performing a data format conversion process on the aggregation operation result by using a converter in the aggregation operator.
  • the above scheme can enable the AI processor to focus on AI computing and improve the training efficiency of deep neural networks.
  • S1320 includes: using at least two operation channels in the aggregation operator to perform a multi-channel parallel aggregation operation on the first data and the second data.
  • the above scheme can improve the training efficiency of deep neural networks.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of this application.
  • At least one of the AI processor and the aggregation operator may be a processor including a large number of logic circuits or circuit elements, which may perform a corresponding function through a logic algorithm.
  • at least one of the AI processor and the aggregation operator may run software and complete the above-mentioned calculations by running the software.
  • software can be composed of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), and erasable In addition to programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EPROM), registers, hard disks, mobile hard disks, read-only optical disks (CD-ROMs), or any other device known in the art Other forms of storage media.
  • the storage medium is coupled to any one of the AI processor and the aggregation operator mentioned above, so that it can read information from and write information to the storage medium.
  • the storage medium may also be an integral part of the processor. Therefore, the method flow of this embodiment can be understood as being completed by software driving hardware.
  • the AI processor and the aggregation operator can be driven to execute the graph.
  • the method flow shown in 13 is not limited in this embodiment.

Abstract

一种数据处理系统(600)和一种数据处理方法。该数据处理系统(600)包括第一计算节点,第一计算节点包括AI处理器(610)和聚合运算器(620),AI处理器(610)用于:执行AI运算生成第一计算节点的第一数据;聚合运算器(620)用于:对来自第二计算节点的第二数据和第一数据执行聚合运算生成聚合运算结果。由于上述AI处理器(610)和聚合运算器(620)能够并行运行,能够减少聚合运算中对第一计算节点的内存模块的读写次数,减少调度次数,避免聚合运算对AI处理器(610)的缓存的影响,使得聚合运算和AI运算能够并行进行,从而提高了深度神经网络的训练效率。

Description

数据处理系统和数据处理方法 技术领域
本申请涉及人工智能计算领域,尤其涉及一种数据处理系统和数据处理方法。
背景技术
人工智能(artificial intelligence,AI)是通过计算机模拟人的智能的方法,其在语音识别、图像处理以及复杂游戏等领域有广阔的应用前景。使用深度神经网络从海量原始数据中提取特征并进行学习是AI能够在上述领域得到广泛应用的一个重要原因,随着深度神经网络的性能的提升,网络的深度、网络参数数量、计算算法强度、训练数据集都在增加,计算复杂度也大大增加,随之而来的结果就是训练耗时大幅度增加。
例如,以ResNet-50网络为例,基于ImageNet训练数据集进行训练,采用常用的8块K80组成的高性能服务器,需要44小时才能完成90代的训练。对于一些新的深度神经网络,往往需要尝试多组超参数(hyper parameters),对深度神经网络反复进行调整和优化才能得到理想的结果,采用现有的深度神经网络训练方法需要花费更多的时间,这对AI的应用带来了不利影响。
在对深度神经网络进行训练的过程中,需要对多个数据执行聚合运算,例如,对两个AI计算节点生成的数据执行加法运算。在执行聚合运算时,一个AI计算节点(例如,AI计算节点1)需要从另一个AI计算节点(例如,AI计算节点0)中读取数据0,并将数据0写入AI计算节点1的缓冲区中,随后,AI计算节点1从自身的内存中读取数据1,将数据1发送至AI处理器,并将数据0从缓冲区发送到AI处理器中,待数据0和数据1的完成聚合运算之后,再将聚合运算结果写入AI计算节点1的内存中。此外,AI计算和聚合运算是在同一个处理器上分时执行,运算效率较低。如何提高聚合运算效率就成为一个问题。
发明内容
本申请提供了一种数据处理系统和一种数据处理方法,能够提高聚合运算效率。
第一方面,提供了一种数据处理系统,该系统包括第一计算节点,第一计算节点包括AI处理器和聚合运算器,AI处理器用于:执行AI运算生成第一计算节点的第一数据;聚合运算器用于:对来自第二计算节点的第二数据和第一数据执行聚合运算生成聚合运算结果。
由于上述AI处理器和聚合运算器能够并行运行,因此,本申请提供的数据处理系统能够提高聚合运算效率。
可选地,上述聚合运算器包括:聚合运算引擎,用于对第一数据和第二数据执行聚合运算生成聚合运算结果。
可选地,聚合运算器还包括内存访问引擎,用于:从第二计算节点的第二内存模块获 取第二数据;从第一计算节点的第一内存模块获取第一数据;将第一数据和第二数据发送至聚合运算引擎;将聚合运算结果写入第一内存模块。本方案具有以下有益效果:减少聚合运算中对第一计算节点的内存模块的读写次数,减少调度次数,避免聚合运算对AI处理器的缓存的影响,使得聚合运算和AI运算能够并行进行。从而提高了深度神经网络的训练效率。有益效果与技术特征的关联关系可参见具体实施方式中的描述。
可选地,内存访问引擎具体用于:接收聚合运算指令;根据该聚合运算指令执行:从第一内存模块获取第一数据,从第二内存模块获取第二数据;将第一数据和第二数据发送至聚合运算引擎。该方案的内存访问引擎可接受软件层面的指令控制。此外,上述方案能够避免无需进行聚合运算的数据被送入聚合运算引擎,提高了数据搬移的效率。
可选地,内存访问引擎还用于:生成原子命令,该原子命令包括读命令或写命令中的至少一个,读命令用于命令内存控制器从第一内存模块读取第一数据并发送至聚合运算引擎,写命令用于命令内存控制器将聚合运算结果写入第一内存模块;向第二内存模块的内存控制器发送该原子命令。
原子命令对应的操作为原子操作,原子操作指的是不会被线程调度机制打断的操作,这种操作一旦开始,就一直运行到结束,运行过程中不会被其它线程的操作打断,这样,即使在聚合运算过程中,写操作和读操作与其它内存更新操作发生冲突,上述可选的实施例也能够保证聚合运算结果不会被破坏。此外,在上述可选的实施例中,写操作和读操作的命令不需要在总线上传递,从而能够减少聚合运算对总线资源的占用。
可选地,内存访问引擎为直接存储器访问(DMA)引擎或者远程直接存储器访问(RDMA)引擎。
可选地,聚合运算器还包括:转换器,用于对聚合运算结果执行数据格式转换处理。由于数据类型转换处理无需在AI处理器中执行,因此,上述方案能够使得AI处理器专注于AI计算,提高深度神经网络的训练效率。
可选地,第一计算节点还包括第一内存模块,第一内存模块用于存储第一数据。
可选地,上述数据处理系统还包括第二计算节点。
可选地,第一计算节点和第二计算节点位于不同的装置中。
可选地,聚合运算器包括至少两个运算通道,该至少两个运算通道用于并行执行聚合运算。因此,每个通道处理一条完整的聚合运算流水线,多条聚合运算流水线并发运行,从而提高了整个深度神经网络的训练性能。
第二方面,本申请还提供了一种数据处理方法,包括:利用数据处理系统中的第一计算节点中的AI处理器执行AI运算生成第一计算节点的第一数据;利用第一计算节点中的聚合运算器对所述第一数据和来自上述数据处理系统中的第二计算节点的第二数据执行聚合运算生成聚合运算结果。
可选地,上述方法还包括:利用聚合运算器中的内存访问引擎从第二计算节点中的第二内存模块获取第二数据。
可选地,上述方法还包括:利用聚合运算器中的转换器对聚合运算结果执行数据格式转换处理。由于数据类型转换处理无需在AI处理器中执行,因此,上述方案能够使得AI处理器专注于AI计算,提高深度神经网络的训练效率。
可选地,利用第一计算节点中的聚合运算器对所述第一数据和来自上述数据处理系统 中的第二计算节点的第二数据执行聚合运算生成聚合运算结果,包括:利用聚合运算器中的至少两个运算通道对第一数据和第二数据执行多通道并行聚合运算。由于聚合运算器能够同时处理至少两个环产生的数据,上述方案能够提高深度神经网络的训练效率。
上述方法具有以下有益效果:减少聚合运算中对第一计算节点的内存模块的读写次数,减少调度次数,避免聚合运算对AI处理器的缓存的影响,使得聚合运算和AI运算能够并行进行。从而提高了深度神经网络的训练效率。有益效果与技术特征的关联关系可参见具体实施方式中的描述。
第三方面,本身还提供了一种计算机可读存储介质,该计算机可读存储介质中存储了计算机程序代码,该计算机程序代码被处理单元或处理器执行时,能够实现第二方面所述的方法。
第四方面,本申请提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码被处理单元或处理器运行时,实现第二方面所述的方法。此外,该计算机程序产品可以安装在第一方面所述的数据处理系统中,使得该数据处理系统实现第一方面所述的功能。
附图说明
图1是适用于本申请的一种环的示意图;
图2是环的各个计算节点执行环形聚合算法的初始状态的示意图;
图3是环形聚合算法的一个步骤的示意图;
图4是环形聚合算法的另一个步骤的示意图;
图5是环的各个计算节点执行环形聚合算法的结束状态的示意图;
图6是本申请提供的一种数据处理系统的示意图;
图7是本申请提供的另一种数据处理系统的示意图;
图8是本申请提供的聚合运算引擎进行聚合运算的示意图;
图9是本申请提供的内存访问引擎进行数据搬移操作的示意图;
图10是本申请提供的内存访问引擎进行数据搬移操作的另一示意图;
图11是本申请提供的转换器进行数据格式转换操作的示意图;
图12是本申请提供的另再一种数据处理系统的示意图;
图13是本申请提供的一种数据处理方法的示意图。
具体实施方式
为了提高深度神经网络的训练效率,一种方法是使用并行分布式训练算法进行训练,并行分布式训练算法的流程如下所示:
1、集群中每个计算节点独立完成各自小批量(mini-batch)训练数据的计算,获得梯度;
2、集群中所有的计算节点需要将计算获得的梯度进行聚合,形成聚合后的梯度;
3、将聚合后的梯度分发到集群中每个计算节点;
4、每个计算节点基于聚合后的梯度,再结合学习速率等超参数,计算出新的参数值;
5、所有的计算节点只有在获取到新的参数之后,才能启动下一轮的迭代计算。
从上述训练算法可以看出,节点间梯度聚合不但在关键路径上,而且非常频繁。因此,在并行分布式训练方案中,计算节点间的梯度聚合是影响训练效率的关键。
为了高效地进行梯度聚合,目前学术界、工业界常用的是环形聚合(Ring All Reduce)算法,其中,环的逻辑结构如图1所示。
图1中,环包括5个AI计算节点,每个AI计算节点例如是一个AI芯片。每个AI计算节点均具有一个前序节点和一个后序节点,每个AI计算节点在环中的位置由环的创建者(例如,用户软件)确定。例如,AI计算节点0的前序节点是AI计算节点4,AI计算节点0的后序节点是AI计算节点1。每个AI计算节点均能够从该AI计算节点的前序节点接收数据,还能够将自身的数据发送至该AI计算节点的后序节点。多个计算节点位于一个系统内。该系统是一个设备或多个设备形成的集群。每个计算节点可以是一个装置或设备,或者多个计算节点位于一个装置或设备中。所述装置或设备可以是各类电子设备,包括但不限于服务器、大型机、小型机、便携机或终端。每个节点可以是装置或设备中的一个计算元件,例如芯片、芯片组或承载了芯片或芯片组的电路板。
以图1所示的环为例,在环形聚合算法的准备阶段,环的创建者(例如,用户软件)向各个AI计算节点发送控制信息,对数据进行切片处理,每个AI计算节点计算出的梯度数据被均等地划分成5块。例如,图1所示的5个AI计算节点计算得到的梯度数据分别为a、b、c、d和e,每个AI计算节点都拥有自己计算所得的完整数据,该5个AI计算节点的初始状态如图2所示。
随后,5个AI计算节点进入散列聚合(scatter reduce)阶段,每个AI计算节点将自己的一块数据发送给其后序节点,并将从前序节点接收到的数据和自己存储的数据进行聚合处理。
图3示出了散列聚合阶段的一个步骤。在该步骤中,AI计算节点0将数据块a0发送到AI计算节点1,AI计算节点1收到数据块a0后,对a0和自己存储的数据块a1进行聚合运算。与此同时,AI计算节点1将数据块b1发送到AI计算节点2,AI计算节点2收到数据块b1后,对b1和自己存储的数据块b2进行聚合运算。其它的AI计算节点的操作与此类似。
图4示出了散列聚合阶段的另一个步骤。在该步骤中,以AI计算节点0为例,AI计算节点0从前序节点(AI计算节点4)接收数据b4+b3+b2+b1,并将该数据与自身存储的数据b0进行聚合运算,得到的聚合运算结果为b0+b1+b2+b3+b4。AI计算节点0在接收数据b4+b3+b2+b1的同时将自身存储的数据c0+c4+c3+c2发送至后序节点(AI计算节点1),以便于后序节点进行梯度聚合运算。
散列聚合阶段完成后,环形聚合算法进行到下一步,即,全收集(all gather)阶段。在全收集阶段,图1所示的环通过4次传递,将各个AI计算节点得到的最终结果发送至其它AI计算节点,例如,AI计算节点0对数据b进行聚合运算得到的最终结果为b0+b1+b2+b3+b4,则AI计算节0将该结果传递给AI计算节点1,AI计算节点1将该结果传递给AI计算节点2,依次类推,经过4次传递,每个AI计算节点均得到了数据b的聚合运算的最终结果。类似地,对于其它4个数据(a、c、d和e),经过4次传递后,每个AI计算节点也都获得到了各个数据的聚合运算的最终结果,如图5所示。
图6示出了本申请提供的一种数据处理系统,能够减小聚合操作中的内存读写次数, 从而提高深度神经网络的训练效率。
如图6所示,数据处理系统600包括第一计算节点,第一计算节点包括AI处理器610和聚合运算器620,其中,
AI处理器610用于:执行AI运算生成第一计算节点的第一数据。
聚合运算器620用于:对来自第二计算节点的第二数据和所述第一数据执行聚合运算生成聚合运算结果。
AI处理器610例如是神经网络处理器,例如矩阵运算阵列。
聚合运算器620例如是加法运算器、乘法运算器、最大值运算器或最小值运算器,还可以是其它类型的用于执行聚合运算的器件或逻辑电路。
AI处理器610是一种专用于人工智能计算的单元,也叫神经网络处理器(neural-network process unit,NPU),例如可以是卷积神经网络(CNN)计算器或循环神经网络(RNN)计算器或其他类似功能的神经处理单元。
聚合运算器620可以是通用处理器、数字信号处理器(digital signal processor,DSP),也可以是硬件加速器,例如可以是专用集成电路(application-specific integrated circuit,ASIC),现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。
在本申请中,聚合运算指的是对至少两个数据按照预设规则执行的运算,可以是加法运算、减法运算、乘法运算、除法运算、取最大值运算和取最小值运算中的一种运算或多种运算,还可以是其它种类的运算。
例如:聚合运算器620可以对第一数据和第二数据执行加法运算,得到的结果即为该两个数据的和。或者,聚合运算器620可以对第一数据和第二数据执行取最大值运算,得到的结果即为该两个数据中数值较大的一个数据。或者,聚合运算器620可以对第一数据和第二数据先执行减法运算,再将该减法运算的结果乘以第一数据或者第二数据。
AI处理器610和聚合运算器620可以是物理上分离的两个器件,例如,分别位于两个主板上。AI处理器610和聚合运算器620可以是物理上不可分割的两个器件,例如,该两个器件位于一个系统级芯片(system on chip,SOC)上。
上述对AI处理器610和聚合运算器620的描述仅是举例说明,而不应被理解为对本申请提供的数据处理系统的限制。
第一数据例如是图4所示的AI计算节点1的内存模块中的数据c1,第二数据例如是图4中的AI计算节点0的内存模块中存储的数据c0+c4+c3+c2,AI处理器610例如是图3中的AI计算节点1中的处理器。当数据处理系统600的控制器,例如中央处理单元(CPU)需要调度c0+c4+c3+c2和c1完成聚合运算时,聚合运算器620可以将c0+c4+c3+c2从AI计算节点0的内存模块中读取出来,并将c1从AI计算节点1的内存模块中读取出来,随后对c0+c4+c3+c2和c1执行聚合运算(例如,加法运算),得到聚合运算结果c0+c1+c2+c3+c4。聚合运算器620将该聚合运算结果写入AI计算节点1的内存模块中,从而完成了一次深度神经网络的梯度聚合运算。
在上述示例中,AI计算节点1仅经历了一次读操作和一次写操作,相对于背景技术中的聚合运算方法,上述示例所提供的聚合运算方法减小了聚合运算对AI计算节点1的内存带宽资源的消耗,节省的内存带宽资源可以用于其它AI计算,从而提高了深度神经 网络的训练效率。
其次,聚合运算器620具有读取AI计算节点0的内存模块和AI计算节点1的内存模块中的数据的能力,数据处理系统600仅需一次调度(即,内联聚合(inline reduce))即可完成一次聚合运算,相对于现有技术中的聚合运算装置减少了一次复制(copy)调度所需的时间,同样提高了深度神经网络的训练效率。
再次,AI计算(例如,深度神经网络训练)使用的是单指令流多数据流(single instruction multiple thread,SIMT)算法架构,即,处理器在一个时刻只能基于一个指令处理一个数据流或者多个数据流,而AI计算和聚合运算对应两个不同的指令流序列,这使得现有技术中的AI计算和聚合运算需要串行执行。在本申请提供的数据处理系统600中,AI计算和聚合运算分别在不同的模块中进行,因此,数据处理系统600能够并行处理AI计算任务和聚合运算任务,提高了深度神经网络的训练效率。
再次,在现有技术中,AI计算和聚合运算在同一个处理器中进行,处理器在进行AI计算时,需要从内存中读取AI计算相关的数据并将该数据写入缓存(cache)中。处理器在进行聚合运算时,需要从内存中读取聚合运算相关的数据并将该数据写入缓存中。若处理器串行执行AI计算和聚合运算,则缓存中存储的聚合运算相关的数据会对AI计算相关的数据造成污染,使得处理器在执行完聚合运算后,需要重新从内存中读取AI计算相关的数据,并写入缓存中,影响了AI计算的缓存命中率,导致缓存系统的压力增大,这对AI计算的效率造成了负面影响。
在数据处理系统600中,由于聚合运算不在AI处理器610中执行,因此,与聚合运算相关的数据不会进入AI处理器610中,避免对缓存中的AI计算相关的数据造成污染,即,不会影响AI计算的缓存命中率,减小了缓存系统的压力,从而提高了深度神经网络的训练效率。
应理解,上述示例仅是以深度神经网络为例对本申请提供的数据处理系统进行说明,本申请所提供的数据处理系统不仅适用于深度神经网络,还适用于其它多个计算节点间需要进行数据聚合运算的场景,例如,超级计算机领域。
在数据处理系统600中,聚合运算器620可以包括聚合运算引擎(reduce engine)621,如图7所示,聚合运算引擎621用于对第一数据和第二数据执行聚合运算生成聚合运算结果。图7中的CPU用于对调度第一计算节点和第二计算节点执行任务,例如,执行AI计算任务或者执行聚合运算任务。其中,CPU仅是举例说明,数据处理系统600还可以包括其它类型的控制器或者调度器。
图8示出了本申请提供的一种聚合运算引擎621进行聚合运算的示意性流程图。聚合运算引擎621可以接收下文所述的内存访问引擎622输入的数据,还可以接收行列1输入的数据,并对接收的数据执行聚合运算,随后将聚合运算结果写入到HBM中。
聚合运算引擎621支持的聚合运算类型例如是上文所述的加法运算、减法运算、乘法运算、除法运算、取最大值运算和取最小值运算中的一种运算或多种运算。
聚合运算器620还可以包括内存访问引擎622,内存访问引擎622用于执行:
从第一内存模块获取第一数据;
从第二内存模块获取第二数据;
将第一数据和第二数据发送至聚合运算引擎610;
将聚合运算结果写入第一内存模块。
第一内存模块例如是第一计算节点的高带宽显存(high bandwidth memory,HBM),第二内存模块例如是第二计算节点的HBM。在第一计算节点的HBM中存储有一个或多个数据块(chunk),该一个或多个数据块构成行列(rank)1。类似地,在第二计算节点的HBM中存储有一个或多个数据块(chunk),该一个或多个数据块构成行列(rank)0。
如图7所示,内存访问引擎622从行列0中读取数据块#0(即,第二数据,例如是c0+c4+c3+c2),从行列1中读取数据块#0(即,第一数据,例如是c1),并将该两个数据块#0发送至聚合运算引擎621,待聚合运算引擎621完成聚合运算后,将聚合运算的结果写入行列1。
内存访问引擎622搬运数据是一种完全由硬件完成数据搬移的工作方式,不需要中央处理器(central processing unit,CPU)参与,该方法通过一套独立于CPU的机制将数据在主内存(main memory)与缓冲(buffer)之间、主内存与主内存之间或主内存与外设之间进行搬移。例如,内存访问引擎622通过描述符接受来自软件的搬移任务,控制硬件(例如,芯片电路)完成搬移操作,再通过描述符或中断方式将搬移完成状态通知软件。由于上述方案不需要CPU参数,因此,上述方案释放了CPU的处理能力,实现高带宽低延迟的数据搬移。
此外,内存访问引擎622中还具有单数据流处理逻辑。即,内存访问引擎622根据指令类型判断是否需要对当前的数据流执行聚合运算。该指令来自于软件,例如CPU运行的软件可以生成所述指令。
如图9所示,内存访问引擎622接收到一个聚合运算指令,在该聚合运算指令用于指示对第一数据和第二数据执行聚合运算,内存访问引擎622将第一数据送入聚合运算引擎621。在内存访问引擎622未接收到该聚合运算指令的情况下,或者,在内存访问引擎622接收到搬移指令的情况下,内存访问引擎622将第一数据送入第一计算节点的HBM。
上述方案能够避免无需进行聚合运算的数据被送入聚合运算引擎621,提高了数据搬移的效率。
作为一个可选的实施例,内存访问引擎622还用于:
生成原子(atomic)命令,所述原子命令包括读命令或写命令中的至少一个,所述读命令用于命令内存控制器从第一内存模块读取第一数据并发送至聚合运算引擎,写命令用于命令内存控制器将聚合运算结果写入第一内存模块。
向第一内存模块的内存控制器发送上述原子命令。
图10示出了内存访问引擎622搬移数据的示意性流程图。
当内存访问引擎622需要读取第一数据时,内存访问引擎622生成原子命令,该原子命令包括用于指示第一数据的源地址(即,第一数据在行列1中存储的地址)和目的地址(即,聚合运算引擎621的地址)的2个操作数,该原子命令还包括读命令和写命令,行列1对应的内存控制器接收到该原子命令后,将第一数据从行列1中发送至聚合运算引擎621,从而完成了内存读操作。
当内存访问引擎622需要将聚合运算结果写入行列1时,行列1的内存控制器基于接收到上述原子命令将聚合运算结果从聚合运算引擎621中发送至行列1,从而完成了内存写操作。例如,上述操作数还可以是立即数,本实施例对此不作展开。
原子命令对应的操作为原子操作(例如,图10所示的写操作和读操作),原子操作指的是不会被线程调度机制打断的操作,这种操作一旦开始,就一直运行到结束,运行过程中不会被其它线程的操作打断,这样,即使在聚合运算过程中,写操作和读操作与其它内存更新操作发生冲突,上述可选的实施例也能够保证聚合运算结果不会被破坏。
此外,在上述可选的实施例中,写操作和读操作的命令不需要在总线上传递,从而能够减少聚合运算对总线资源的占用。
作为另一个可选的实施例,聚合运算器620还包括:
转换器(convertor)623,用于对聚合运算结果执行数据格式(也可称为“数据类型”)转换处理。
聚合运算引擎621生成的聚合运算结果的数据类型可以是下列数据类型中的一种或多种:32位浮点数(float32)、16位浮点数(float16)、取整(int)、无符号整数(uint)、关键字(char)、64位浮点数(float64)、int64、uint64。若聚合运算结果的数据类型不是HBM需要的类型,则转换器623可以将聚合运算结果转换为HBM所需要的数据类型,随后,转换器623将数据类型转换完成后的聚合运算结果发送至HBM。
图11示出了本申请提供的一种数据转换的示意性流程图。
聚合运算引擎621生成的聚合运算结果的数据类型为float32,HBM支持的数据类型为float16,则转换器可以将float32的聚合运算结果转换为float16的聚合运算结果。
上述实施例仅是举例说明,本申请提供的聚合运算引擎621可以支持更多种数据类型的转换。
由于数据类型转换处理无需在AI处理器中执行,因此,上述方案能够使得AI处理器专注于AI计算,提高深度神经网络的训练效率。
在深度神经网络的训练过程中,通常会有多个环并行运行。本申请还提供的聚合运算器620可以支持至少两个运算通道,该至少两个运算通道用于并行执行聚合运算处理。
如图12所示,当前深度神经网络有3个环,每个环产生的数据形成一条聚合运算流水线(reduce pipeline)。聚合运算器620包括3个通道,各个通道之间相互独立,每个通道处理一条完整的聚合运算流水线,多条聚合运算流水线并发运行,从而提高了整个深度神经网络的训练性能。
可选地,数据处理系统600还包括第一内存模块和第二内存模块,即,第一内存模块和第二内存模块与聚合运算器620和AI处理器610作为一个整体执行数据处理任务,例如,用户可以购买包含第一内存模块和第二内存模块的数据处理系统600完成深度神经网络训练,而无需再单独购买第一内存模块和第二内存模块,或者,无需从其它供应商处租借第一内存模块和第二内存模块。第一内存模块和第二内存模块例如是上文所述的HBM,也可以是其它类型的内存,本申请对第一内存模块和第二内存模块的具体产品形态不作限定。
可以理解的是,数据处理系统600还可以是包括更多的内存模块和/或其它器件。
需要说明的是,数据处理系统600还包括第一内存模块和第二内存模块,并不意味着第一内存模块和第二内存模块一定在同一个物理实体(例如,服务器)中。
例如,第一内存模块与第二内存模块位于同一个服务器中,在该情况中,内存访问引擎622可以为直接存储器访问(direct memory access,DMA)引擎。
又例如,第一内存模块与第二内存模块位于同一个服务器中,在该情况中,内存访问引擎622可以为远程直接存储器访问(remote direct memory access,RDMA)引擎。
本申请还提供了一种数据处理方法,可以由数据处理系统600执行。如图13所示,该方法1300包括:
S1310,利用数据处理系统中的第一计算节点中的AI处理器执行AI运算生成第一计算节点的第一数据。
S1320,利用第一计算节点中的聚合运算器对所述第一数据和来自上述数据处理系统中的第二计算节点的第二数据执行聚合运算生成聚合运算结果。
本领域技术人员可以了解,方法1300中,各个步骤的具体实施方式可参照数据处理系统600中的聚合运算器620处理数据的过程,为了简洁,在此不再赘述。
因此,方法1300具有以下有益效果:减少聚合运算中对第一计算节点的内存模块的读写次数,减少调度次数,避免聚合运算对AI处理器的缓存的影响,使得聚合运算和AI运算能够并行进行。从而提高了深度神经网络的训练效率。
可选地,方法1300还包括:利用聚合运算器中的内存访问引擎从第二计算节点中的第二内存模块获取第二数据。
可选地,方法1300还包括:利用聚合运算器中的转换器对聚合运算结果执行数据格式转换处理。
由于数据类型转换处理无需在AI处理器中执行,因此,上述方案能够使得AI处理器专注于AI计算,提高深度神经网络的训练效率。
可选地,S1320包括:利用所述聚合运算器中的至少两个运算通道对第一数据和第二数据执行多通道并行聚合运算。
由于聚合运算器能够同时处理至少两个环产生的数据,上述方案能够提高深度神经网络的训练效率。
在本申请各个实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施过程构成任何限定。
可以理解,AI处理器和聚合运算器中的至少一个可以是包括大量逻辑电路或电路元件的处理器,其可以通过逻辑算法执行相应功能。或者,AI处理器和聚合运算器中的至少一个可以运行软件,并通过运行软件完成上述的计算。可以理解,软件(或软件指令)可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read only memory,ROM)、可擦除可编程只读存储器(erasable programmable ROM,EPROM)、电可擦可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。作为一个可选的示例,存储介质耦合至以上提到的AI处理器和聚合运算器中任一个,从而使其能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。因此,本实施例的方法流程可以理解为是以软件驱动硬件完成,当软件被处理器执行,例如被AI处理器和聚合运算器执行,可以驱动AI处理器和聚合运算器工作,以执行图13所示的方法流程,本实施例对此不作限定。
另外,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种 关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (12)

  1. 一种数据处理系统,其特征在于,包括第一计算节点,所述第一计算节点包括人工智能AI处理器和聚合运算器,
    所述AI处理器用于:执行AI运算生成第一计算节点的第一数据;
    所述聚合运算器用于:对来自第二计算节点的第二数据和所述第一数据执行聚合运算生成聚合运算结果。
  2. 根据权利要求1所述的数据处理系统,其特征在于,所述聚合运算器包括:
    聚合运算引擎,用于:对所述第一数据和所述第二数据执行聚合运算生成所述聚合运算结果。
  3. 根据权利要求2所述的数据处理系统,其特征在于,所述聚合运算器还包括:
    内存访问引擎,用于:从所述第二计算节点的第二内存模块获取所述第二数据;从所述第一计算节点的第一内存模块获取所述第一数据;将所述第一数据和所述第二数据发送至所述聚合运算引擎;将所述聚合运算结果写入所述第一内存模块。
  4. 根据权利要求3所述的数据处理系统,其特征在于,所述内存访问引擎具体用于:
    接收聚合运算指令;
    根据所述聚合运算指令执行:从所述第一内存模块获取所述第一数据,从所述第二内存模块获取所述第二数据;将所述第一数据和所述第二数据发送至所述聚合运算引擎。
  5. 根据权利要求3或4所述的数据处理系统,其特征在于,所述内存访问引擎还用于:
    生成原子命令,所述原子命令包括读命令或写命令中的至少一个,所述读命令用于命令内存控制器从所述第一内存模块读取所述第一数据并发送至所述聚合运算引擎,所述写命令用于命令所述内存控制器将所述聚合运算结果写入所述第一内存模块;
    向所述第二内存模块的内存控制器发送所述原子命令。
  6. 根据权利要求3至5中任一项所述的数据处理系统,其特征在于,所述内存访问引擎为直接存储器访问DMA引擎或者远程直接存储器访问RDMA引擎。
  7. 根据权利要求2至6中任一项所述的数据处理系统,其特征在于,所述聚合运算器还包括:
    转换器,用于对所述聚合运算结果执行数据格式转换处理。
  8. 根据权利要求1至7中任一项所述的数据处理系统,其特征在于,所述第一计算节点还包括所述第一内存模块,所述第一内存模块用于存储所述第一数据。
  9. 根据权利要求1至8中任一项所述的数据处理系统,其特征在于,还包括所述第二计算节点。
  10. 根据权利要求1至9中任一项所述的数据处理系统,其特征在于,所述第一计算节点和所述第二计算节点位于不同的装置中。
  11. 根据权利要求1至10中任一项所述的数据处理系统,其特征在于,所述聚合运算器包括至少两个运算通道,所述至少两个运算通道用于并行执行聚合运算。
  12. 根据权利要求1至11中任一项所述的数据处理系统,其特征在于,所述AI处理 器和所述聚合运算器能够并行运行。
PCT/CN2018/103669 2018-08-31 2018-08-31 数据处理系统和数据处理方法 WO2020042182A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP18931858.7A EP3819788A4 (en) 2018-08-31 2018-08-31 DATA PROCESSING SYSTEM AND DATA PROCESSING METHODS
CN201880091518.7A CN111886593A (zh) 2018-08-31 2018-08-31 数据处理系统和数据处理方法
PCT/CN2018/103669 WO2020042182A1 (zh) 2018-08-31 2018-08-31 数据处理系统和数据处理方法
US17/173,691 US20210166156A1 (en) 2018-08-31 2021-02-11 Data processing system and data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/103669 WO2020042182A1 (zh) 2018-08-31 2018-08-31 数据处理系统和数据处理方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/173,691 Continuation US20210166156A1 (en) 2018-08-31 2021-02-11 Data processing system and data processing method

Publications (1)

Publication Number Publication Date
WO2020042182A1 true WO2020042182A1 (zh) 2020-03-05

Family

ID=69643150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/103669 WO2020042182A1 (zh) 2018-08-31 2018-08-31 数据处理系统和数据处理方法

Country Status (4)

Country Link
US (1) US20210166156A1 (zh)
EP (1) EP3819788A4 (zh)
CN (1) CN111886593A (zh)
WO (1) WO2020042182A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297111A (zh) * 2021-06-11 2021-08-24 上海壁仞智能科技有限公司 人工智能芯片及其操作方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11842260B2 (en) * 2020-09-25 2023-12-12 International Business Machines Corporation Incremental and decentralized model pruning in federated machine learning
CN115221091A (zh) * 2021-04-21 2022-10-21 华为技术有限公司 一种聚合通信的方法、系统和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092886A (zh) * 2011-11-07 2013-05-08 中国移动通信集团公司 一种数据查询操作的实现方法、装置及系统
CN103559247A (zh) * 2013-10-29 2014-02-05 北京华胜天成科技股份有限公司 一种数据业务处理方法及装置
CN105760395A (zh) * 2014-12-18 2016-07-13 华为技术有限公司 一种数据处理的方法、装置及系统
CN107545005A (zh) * 2016-06-28 2018-01-05 华为软件技术有限公司 一种数据处理方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7870309B2 (en) * 2008-12-23 2011-01-11 International Business Machines Corporation Multithreaded programmable direct memory access engine
US20180046903A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092886A (zh) * 2011-11-07 2013-05-08 中国移动通信集团公司 一种数据查询操作的实现方法、装置及系统
CN103559247A (zh) * 2013-10-29 2014-02-05 北京华胜天成科技股份有限公司 一种数据业务处理方法及装置
CN105760395A (zh) * 2014-12-18 2016-07-13 华为技术有限公司 一种数据处理的方法、装置及系统
CN107545005A (zh) * 2016-06-28 2018-01-05 华为软件技术有限公司 一种数据处理方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3819788A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297111A (zh) * 2021-06-11 2021-08-24 上海壁仞智能科技有限公司 人工智能芯片及其操作方法

Also Published As

Publication number Publication date
CN111886593A (zh) 2020-11-03
EP3819788A4 (en) 2021-07-14
EP3819788A1 (en) 2021-05-12
US20210166156A1 (en) 2021-06-03

Similar Documents

Publication Publication Date Title
CN110689138B (zh) 运算方法、装置及相关产品
WO2017124644A1 (zh) 一种人工神经网络压缩编码装置和方法
US20210166156A1 (en) Data processing system and data processing method
US11403104B2 (en) Neural network processor, chip and electronic device
JP7256811B2 (ja) アドバンストインタコネクト技術を利用してaiトレーニングを加速するための方法及びシステム
Ghasemi et al. Accelerating apache spark big data analysis with fpgas
Gupta Introduction to hardware accelerator systems for artificial intelligence and machine learning
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
Li et al. A system-level solution for low-power object detection
KR20210084220A (ko) 부분 판독/기입을 갖는 재구성 가능한 시스톨릭 어레이를 위한 시스템 및 방법
CN117271953A (zh) 一种用于优化快速傅里叶变换的存内计算加速电路及方法
Bira et al. Energy-efficient computation of l1 and l2 norms on a FPGA SIMD accelerator, with applications to visual search
TWI775151B (zh) 圖形處理器及其矩陣運算的加速方法
WO2021082746A1 (zh) 运算装置及相关产品
WO2019134084A1 (zh) 代码执行方法、装置、终端设备及计算机可读存储介质
US10769527B2 (en) Accelerating artificial neural network computations by skipping input values
US20210150311A1 (en) Data layout conscious processing in memory architecture for executing neural network model
Zhou et al. Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication
WO2020194032A1 (en) Accelerating neuron computations in artificial neural networks by skipping bits
Zhao et al. Accelerating depthwise separable convolutions with vector processor
Wang et al. A new scheme for cache optimization based on cluster computing framework spark
US20230051344A1 (en) Optimization of memory use for efficient neural network execution
WO2021082747A1 (zh) 运算装置及相关产品
US20230043584A1 (en) Optimization of memory use for efficient neural network execution
Yong et al. Efficient parallel recursive Gaussian SIFT algorithm based on multi-core DSP

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18931858

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE