WO2022222578A1 - 一种聚合通信的方法、系统和计算机设备 - Google Patents

一种聚合通信的方法、系统和计算机设备 Download PDF

Info

Publication number
WO2022222578A1
WO2022222578A1 PCT/CN2022/075620 CN2022075620W WO2022222578A1 WO 2022222578 A1 WO2022222578 A1 WO 2022222578A1 CN 2022075620 W CN2022075620 W CN 2022075620W WO 2022222578 A1 WO2022222578 A1 WO 2022222578A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
computing chip
communication
computing
chip
Prior art date
Application number
PCT/CN2022/075620
Other languages
English (en)
French (fr)
Inventor
端启航
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22790688.0A priority Critical patent/EP4310687A4/en
Publication of WO2022222578A1 publication Critical patent/WO2022222578A1/zh
Priority to US18/488,454 priority patent/US20240045828A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17337Direct connection machines, e.g. completely connected computers, point to point communication networks
    • G06F15/17343Direct connection machines, e.g. completely connected computers, point to point communication networks wherein the interconnection is dynamically configurable, e.g. having loosely coupled nearest neighbor architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17318Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/119Details of migration of file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/542Event management; Broadcasting; Multicasting; Notifications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/545Interprogram communication where tasks reside in different layers, e.g. user- and kernel-space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the present application relates to the field of computers, and in particular, to a method, system and computer device for aggregated communication.
  • Recommendation algorithms can use deep learning algorithms to combine user characteristics and product characteristics to find products that users are interested in from a large number of products when users' purchase intentions are unclear, improving user purchase efficiency and efficiency.
  • Product experience has thus become the profit core of many leading Internet companies.
  • Recommendation algorithms usually use multiple GPUs to complete computing tasks in parallel, and combine aggregation communication technology to complete data migration between multiple processors.
  • Launched two hardware systems and And a GPU aggregation communication library that matches the hardware system Used to support aggregated communication between GPUs.
  • the present application provides a method, system and computer equipment for aggregated communication, thereby providing an efficient method for aggregated communication of sparse data, improving the performance of the aggregated communication system in processing algorithms that need to complete data transmission between multiple computing chips ability.
  • a method for converged communication which can be applied to a converged communication system, the system includes at least a first computing chip and a second computing chip, wherein the first computing chip communicates with the second computing chip through at least one communication channel , the method includes: a first computing chip compresses the first data and sends the compressed first data to a second computing chip through a communication channel, and then the second computing chip performs a process according to the compressed first data operation.
  • the first computing chip can compress the original data and send it to the second chip, thereby reducing the amount of data transmission between the first chip and the second chip.
  • the first computing chip communicates with the second computing chip through a communication channel, wherein the second computing chip is the root node of the communication channel, and the second computing chip performs the processing according to the compressed first data.
  • the computing method may include: the second computing chip aggregates the compressed first data and the second data, wherein the second data is the data to be communicated on the second computing chip; the second computing chip sends the aggregation result to The first computing chip.
  • the first chip can send the compressed data to the second computing chip through a channel, and the second chip aggregates the data and sends it back to the first chip to obtain the result of the allreduce operation in the aggregated communication, and improves the The execution time of the allreduce operation.
  • the converged communication system further includes a processor, the first computing chip communicates with the second computing chip through a communication channel, wherein the second computing chip is the root node of the communication channel, and the method includes : The first computing chip compresses the first data; the second computing chip compresses the second data.
  • the processor obtains the size of the compressed first data and the compressed second data, and calls The allgather interface in the communication library sends the compressed first data to the second computing chip, and sends the compressed second data to the first computing chip.
  • the first computing chip aggregates the compressed second data and the first data; the second computing chip aggregates the compressed first data and the second data.
  • the processor can call the interface in the existing communication library to complete the allreduce operation, which improves the efficiency of the allreduce operation without changing a large number of codes.
  • the first computing chip also communicates with the second computing chip through a communication channel, where the second computing chip is the root node of the communication channel, and the second computing chip is based on the compressed first
  • the data operation method may include: the second computing chip combines the compressed first data with the second data, wherein the second data is the data to be communicated on the second computing chip; the second computing chip aggregates the result sent to the first computing chip.
  • the first chip can send the compressed data to the second computing chip through a channel, and the second chip combines the data and sends it back to the first chip to obtain the result of the allgather operation in the aggregated communication, and improves the The execution time of the allgather operation.
  • the first computing chip communicates with the second computing chip through multiple communication channels, wherein the multiple communication channels include the first communication channel, and the first computing chip uses the communication channels to compress the compressed
  • the method for sending the first data to the second computing chip may include: the first computing chip sends the first part of the first data to the second computing chip through the first communication channel, wherein the second computing chip is a part of the first communication channel. root node.
  • the method for the second computing chip to perform operations according to the compressed first data may include: the second computing chip aggregates part of the compressed first data and part of the second data, wherein the second data is the second data. Calculate the data to be communicated on the chip.
  • the first computing chip can send the compressed data to the root node of each channel through each of the multiple channels.
  • the root node of one channel is the second computing chip
  • the first computing chip passes This channel sends the compressed data to the second computing chip, and the second computing chip aggregates the data, obtains the result of the reduce-scatter operation in the aggregated communication, and improves the execution time of the reduce-scatter operation.
  • the aggregated communication system may be used to recommend products to users by using a recommendation model in combination with user characteristics and commodity characteristics.
  • the method includes the first processing chip Converting the features of the user and the features of the product into first data according to the embedded table; the method further includes: the second computing chip inputs the operation result obtained by the second computing chip according to the compressed first data into the recommendation model The updated value of the embedded table and the updated value of the recommended model are obtained; then the second computing chip updates the recommended model according to the updated value of the recommended model, and updates the embedded table according to the updated value of the embedded table.
  • the method for aggregated communication proposed in this application can be used to recommend products for users in combination with a recommendation model, and between the input values of the recommendation model obtained according to the embedded table, the communication between the first computing chip and the second computing chip is improved.
  • the efficiency of data transmission reduces the time it takes to recommend products to users.
  • the aggregated communication system may be used to recommend products to users by using a recommendation model in combination with user characteristics and product characteristics.
  • the method includes the first processing chip According to the embedded table, the characteristics of the user and the characteristics of the commodity are converted into fourth data; then, the second computing chip inputs the fourth data into the recommendation model to obtain the first data and the updated value of the recommendation model; then the method further includes: The second computing chip updates the recommended model according to the update value of the recommended model, and updates the embedding table according to the operation result obtained by the second computing chip performing the operation according to the compressed first data.
  • the method for aggregated communication proposed in the present application can be used to recommend products for users in combination with a recommendation model, and in the operation of updating the embedded table, the efficiency of data transmission between the first computing chip and the second computing chip is improved. , which reduces the time it takes to recommend products to users.
  • the aggregated communication system may be used to recommend products to users by using a recommendation model in combination with user characteristics and commodity characteristics.
  • the method includes the first processing chip Convert the user's feature and the product feature into a query vector according to the embedded table, compress the query vector, and then obtain the first data according to the compressed query vector; then the method further includes: a second computing chip, a second computing The chip obtains the embedded vector according to the operation result obtained by the operation of the compressed first data, and inputs the embedded vector into the recommendation model to obtain the updated value of the embedded table and the updated value of the recommended model; and then the second computing chip is updated according to the updated value of the recommended model. recommending a model, and updating the embedding table according to the updated value of the embedding table and the compressed query vector.
  • the number of data with a value of 0 in the first data is greater than the number of data with a value other than 0.
  • the computing chip includes: one or more of a graphics processor, a tensor processor, a neural network processor, and a deep learning processor.
  • the present application provides a system for aggregated communication, including at least a first computing chip and a second computing chip, wherein the first computing chip communicates with the second computing chip through at least one communication channel, and the system for aggregated communication is used for The operation steps of the method performed by the corresponding subject in any of the above-mentioned first aspect and any possible implementation manner of the first aspect are implemented.
  • the present application provides a computer device, the computer device includes a processor, a memory, a first computing chip, and a second computing chip, where the memory is used to store computer execution instructions, and when the computer device runs,
  • the processor executes computer-executable instructions in the memory to use the first computing chip and the second computing chip to perform the operation steps of the method in the first aspect or any possible implementation manner of the first aspect.
  • the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a computer, the computer may execute the first aspect or any one of the first aspects. Operation steps of the method described in the implementation manner.
  • the present application provides a computer program product comprising instructions that, when run on a computer, cause the computer to perform the operation steps of the method described in the first aspect or any possible implementation manner of the first aspect.
  • the present application may further combine to provide more implementation manners.
  • FIG. 1 is a schematic structural diagram of a system 100 for aggregated communication provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a data flow of a data migration operation provided by the present application.
  • FIG. 3 is a schematic diagram of a data flow of an aggregation operation operation provided by the present application.
  • FIG. 4 is a schematic flowchart of a method for aggregating communication provided by the present application.
  • FIG. 5 is a schematic flowchart of another method for aggregating communication provided by the present application.
  • FIG. 6 is a schematic flowchart of another method for aggregating communication provided by the present application.
  • FIG. 7 is a schematic flowchart of another method for aggregating communication provided by the present application.
  • FIG. 8 is a schematic flowchart of another method for aggregating communication provided by the present application.
  • FIG. 9 is a schematic flowchart of another method for aggregating communication provided by the present application.
  • FIG. 1 is a schematic structural diagram of a system 100 for aggregated communication provided by an embodiment of the present application.
  • the system 100 includes a device 110 .
  • the device 110 may be a computing-capable device (eg, a server) for solely completing computing tasks involved in deep learning.
  • Device 110 includes processor 111 , memory 112 , communication interface 114 , and at least two computing chips, eg, computing chip 1131 and computing chip 1132 .
  • the processor 111, the memory 112, the communication interface 114 and all the GPUs are connected by a bus, for example, Peripheral Component Interconnect Express (PCIe).
  • PCIe Peripheral Component Interconnect Express
  • the bus can also be other types of buses that implement connections between devices within the device.
  • the bus may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus. Between computing chips can also pass Proposed The buses are connected to each other.
  • the processor 111 is configured to execute computer-executed instructions stored in the memory 112 to implement the functions of the device 110 .
  • the processor 111 may be a CPU, and may also be other general-purpose processors, digital signal processors (digital signal processing, DSP), application specific integrated circuits (ASICs), field programmable gate arrays (field programmable gate arrays). programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • Memory 112 may include read only memory and random access memory, and provides instructions and data to processor 111 . Memory 112 may also include non-volatile random access memory.
  • Memory 112 may also be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • a computing chip is a processor suitable for executing deep learning algorithms, such as graphics processing unit (GPU), tensor processing unit (TPU), neural network processing unit (NPU), Deep learning processing unit (DPU).
  • GPU graphics processing unit
  • TPU tensor processing unit
  • NPU neural network processing unit
  • DPU Deep learning processing unit
  • the device 110 may include at least two computing chips of the same or different types.
  • the device 110 may include two GPUs (graphics processor 1131 and graphics processor 1132 ) as shown in FIG. 1 , or may include two NPUs, and can also include a GPU and an NPU.
  • Multiple computing chips can execute deep learning-related algorithms in parallel, for example, neural network training and inference, and data transmission can be accomplished through the bus using aggregated communication technology between computing chips.
  • the system 100 may further include multiple devices, such as the device 110 and the device 120 , wherein the structures of other devices are similar to the device 110 .
  • Communication between different devices of the system 100 is carried out through a network.
  • the network includes wired or wireless transmission methods, wherein the wired transmission method includes data transmission in the form of ether, optical fiber, etc., and the wireless transmission method includes mobile hotspot (Wi-Fi), Bluetooth, infrared and other transmission methods.
  • Wi-Fi mobile hotspot
  • Bluetooth infrared
  • one or more switches and/or routers may be used to implement communication processing between multiple nodes.
  • the device 110 can cooperate with other devices to complete the computing tasks involved in deep learning.
  • the device 110 may include a processor 111, a memory 112, a communication interface 114, and at least one computing chip.
  • the computing chips of different devices can execute deep learning related algorithms in parallel.
  • the computing chips in the same device use the aggregated communication technology to complete the data transmission through the bus, and the computing chips between different devices use the aggregated communication technology to complete the data transmission through the network. transmission.
  • FIG. 1 is only an example of the system architecture provided by the computing resource management method provided by the present application, and does not constitute a limitation to the embodiments of the present application. .
  • the synchronization operation is used to synchronize all processes in the communication domain.
  • the process that performs the synchronization operation must wait for all the processes to complete the synchronization operation before the process continues to execute.
  • FIG. 2 is a schematic diagram of a data flow of a data migration operation provided by the present application, as shown in the figure: the first one on the left is broadcast, which can send data 1 of node 1 to node 1 and node 2 respectively; the second on the left is gather, which can send The data 1 of node 1, the data 2 of node 2, and the data 3 of node 3 are all sent to node 1; the third from the left is allgather, which can send data 1 of node 1 to node 2 and node 3, and data 2 of node 2 respectively.
  • node 1 and node 3 are sent to node 1 and node 3 respectively, and data 3 of node 3 is sent to node 1 and node 2 respectively. Finally, each node from node 1 to node 3 has data 1 to data 3; Data 2 of 1 is sent to node 2, and data 3 is sent to node 3.
  • FIG. 3 is a schematic diagram of a data flow of an aggregate operation operation provided by this application. Taking arithmetic operation as summation as an example, as shown in the figure: the first one on the left is reduce, and the data of node 1, node 2 and node 3 are added respectively.
  • the second left is allreduce, which can be regarded as reduce plus broadcast, and the result of reduce on node 1 is sent to all other nodes; the third left is reduce-scatter, which can be regarded as reduce plus scatter,
  • the data of node 1 is the sum of the first data of node 1 to node 3
  • the data of node 2 is the sum of the second data of node 1 to node 3
  • the data of node 3 is the sum of the third data of node 1 to node 3.
  • the most commonly used aggregation communication methods are reduce-scatter, allreduce and allgather. Therefore, this application will respectively provide the reduce-scatter method, the allreduce method and the allgather method among the aggregation communication methods applicable to sparse matrices. , and combine these methods with recommendation algorithms. It should be noted that the method for aggregating communication provided in this application is also applicable to scenarios with sparse data aggregation communication in other algorithms.
  • FIG. 4 is a schematic flowchart of a method for aggregating communication provided by this application. Specifically, the reduce-scatter operation can be completed, which can be performed by the device 110 shown in FIG. 1 , or the device 110 shown in FIG. 1 and the Other devices work together. The following describes the flow of the method by taking the single execution of the device 110 as an example, as shown in FIG. 4 , the specific method includes:
  • the computing chip to be communicated compresses the matrix to be communicated.
  • the processor 111 on the device 110 issues an instruction, instructing the computing chip to be communicated to compress the matrix to be communicated on its own chip.
  • the compression method can adopt the matrix compression method mastered by those skilled in the art, such as the following compression methods. Either:
  • Row compression compresses the original matrix to obtain a compressed matrix (row matrix) and a compressed vector (row offsets).
  • the compression vector records the row number of the non-zero row and the total row number of the matrix, and the compression matrix records the data of the non-0 row corresponding to the row number.
  • the compressed matrix after row compression is The compressed vector is (0 3 5 7). Among them, 0 represents the row number corresponding to the first row of data (1 2 3) in the vector matrix, similarly, 3 and 5 represent the row number corresponding to the second and third row data in the vector matrix, respectively, and 7 represents the original matrix.
  • the total line number is 7.
  • the row compression method is lossless compression, and the compressed data can be stored continuously or separately.
  • the COO compression method represents each non-zero element in the original matrix as a triple, and the triple contains three vectors, namely row number vector, column number vector and numerical vector.
  • the row number vector stores the row numbers of non-zero elements
  • the column number vector stores the column numbers of non-zero elements
  • the value vector stores the values of non-zero elements
  • the data positions in the three vectors correspond one-to-one.
  • the row number vector is (0 1 2 2)
  • the column number vector is (2 0 1 2)
  • the numeric vector is (1 2 3 4).
  • Mode 3 Compressed sparse row format (CSR) compression.
  • the CSR compression method uses three types of data to represent the non-zero elements in each original matrix, which are numerical vector, column number vector and row offset vector.
  • the numeric vector and column number vector are consistent with the COO compression method, and the first data in the row offset vector represents the position of the first element of the first row in all non-zero elements.
  • the non-zero value of the first row 0 data is 1, it is the first data in the non-0 element, and the position is 0, then the first data in the row offset vector is 0; the second data in the row offset vector represents the first data in the second row
  • the position of elements in all non-0 elements for example, for matrix C, the non-0 data in the second row is 2, which is the second data in the non-0 elements, and the position is 1, then the row offset vector is the second The number of data is 1; and so on, the last data in the row offset vector is the total number of all non-zero elements.
  • the row offset vector is (0 1 2 4).
  • Example 1 Exemplarily, in Example 1, it is assumed that the device 110 includes 4 GPUs to be communicated, namely GPU1, GPU2, GPU3 and GPU4, and the matrix to be communicated on each GPU is:
  • the matrices to be communicated on GPU1 to GPU4 after row compression are:
  • the communication vectors for each GPU are:
  • communication channels are established so that the number of communication channels is equal to that of computing chips to be communicated, and each communication channel transmits part of the data of the matrix to be communicated.
  • the data of the matrix to be communicated before compression can be evenly distributed to each communication channel according to the number of rows.
  • the number of rows of the matrix to be communicated to be transmitted by each communication channel may be dynamically planned according to the actual data transmission volume of the communication channel.
  • the first communication channel transmits the data of rows 0 to 3 of the matrix to be communicated before compression
  • the second communication channel transmits the matrix to be communicated before compression. 4 to 7 rows of data, and so on.
  • S403 The processor 110 determines the root node from the computing chip to be communicated.
  • each communication channel has a root node, which is used to receive and send data from other computing chips in the communication channel and perform arithmetic operations.
  • the root node may be designated by the user, or may be selected by the processor 111 according to the performance of each computing chip, such as the number of cores, core frequency, storage speed, video memory bit width, and capacity.
  • the application does not limit the selected method.
  • GPU1 can be selected as the root node of communication channel 1, GPU2 as the root node of communication channel 2, and so on.
  • the root node receives data of other computing chips to be communicated and obtains an aggregate operation result.
  • the computing chip to be communicated with the non-root node scans the compressed matrix, and sends the data and row number belonging to the communication channel in the matrix to the root node of the communication channel.
  • communication channel 1 transmits the data of rows 0 to 3 of the matrix to be communicated before compression
  • GPU1 is the root node of communication channel 1
  • GPU2 scans the compression vector and can see that in Among the first four rows of data in the original matrix, the 1st, 2nd, and 3rd rows have data to be communicated.
  • GPU2 will compress the data ⁇ 3,3,3,3 ⁇ , ⁇ 4,4,4,4 ⁇ , ⁇ 3,3,3,3 ⁇ and row number 1 corresponding to rows 1,2,3 in the compressed matrix, 2,3 are sent to GPU1.
  • GPU3 and GPU4 scan the compressed vector, it can be seen that there is no data to be communicated in the first four rows of data of the original matrix, so there is no need to send it.
  • the sending methods of other communication channels can be deduced by analogy.
  • the root node can create a new matrix area in the storage space to receive the data sent by the computing chip of the non-root node.
  • the number of rows of the new matrix is equal to the number of rows transmitted by the communication channel. Then the root node aggregates the data according to the row number and the data of the corresponding row of the original matrix before compression, and obtains the final aggregate communication reduce-scatter result.
  • the root node can directly calculate the received data according to the row number corresponding to the row data of the original matrix before compression, without creating a new storage area. Since the root node can receive and calculate data sent by multiple computing chips at the same time, it is necessary to lock the data to avoid performing multiple calculations on one data at the same time.
  • the root node can receive one data at a time, and can receive new data only after each data is calculated.
  • the root node may also receive a row of data at the same time, and receive new data only after each row of data is calculated.
  • the method for aggregating communication in the present application can compress the communication matrix before sending data, reduce the transmission of invalid 0s in each channel in the reduce-scatter operation, and improve the efficiency of the reduce-scatter operation.
  • FIG. 5 is a schematic flowchart of another aggregated communication method provided by the present application, which may be executed by the device 110 shown in FIG.
  • the device 110 shown in FIG. 1 executes in conjunction with other devices.
  • the following describes the flow of the method by taking the single execution of the device 110 as an example, as shown in FIG. 5 , the specific method includes:
  • the computing chip to be communicated compresses the matrix to be communicated, which is similar to S401.
  • S502 a communication channel is established between the computing chips to be communicated, which is similar to S402.
  • S503 The processor 110 determines the root node from the computing chip to be communicated.
  • all communication channels can specify the same root node for receiving and sending data from other computing chips and completing arithmetic operations.
  • the determination method of the root node is similar to the determination method of S403.
  • GPU1 can be selected as the root node.
  • the root node receives data of other computing chips to be communicated and obtains an aggregate operation result.
  • the root node receives data from other computing chips to be communicated and completes the calculation.
  • the root node of all communication channels is the same computing chip, so one computing chip finally receives and aggregates the data of all the computing chips to be communicated.
  • the calculation result of the root node GPU1 is as follows:
  • the root node sends data to other computing chips to be communicated.
  • the root node sends the matrix obtained by S504 to other computing chips to be communicated in the form of broadcast to complete the allreduce operation.
  • the final result of allreduce is:
  • the method for aggregating communication in the present application can compress the communication matrix before sending data, reduce the transmission of invalid 0s in each channel in the allreduce operation, and improve the efficiency of the allreduce operation.
  • FIG. 6 is a schematic flowchart of another aggregated communication method provided by the present application, which can be executed by the device 110 as shown in FIG.
  • the device 110 shown is co-executing with other devices.
  • the following describes the flow of the method by taking the single execution of the device 110 as an example, as shown in FIG. 6 , the specific method includes:
  • the computing chip to be communicated compresses the matrix to be communicated, which is similar to S401.
  • S602 a communication channel is established between the computing chips to be communicated, which is similar to S402.
  • S603 The processor 110 determines the root node from the computing chip to be communicated.
  • all communication channels can specify the same root node for receiving and sending data from other computing chips.
  • the determination method of the root node is similar to the determination method of S403.
  • GPU1 can be selected as the root node.
  • the root node receives data of other computing chips to be communicated and obtains a combined result.
  • a new matrix is created in the memory space of the chip to store the final aggregated communication result.
  • the number of rows of the new matrix is equal to the product of the total number of rows of the compressed vector and the number of communication channels. Then the root node fills each non-zero row of the compressed matrix into the new matrix in turn according to the row number of the compressed vector. Rows with no padding data at the end are padded with 0s.
  • GPU1 first creates a new matrix with 64 rows and 4 columns in the memory space of the chip, and then fills the non-zero rows in S603 into the new matrix in turn according to the compression vector. For example, if the first row ⁇ 1,1,1,1 ⁇ has row number 0 in the compressed vector, then fill ⁇ 1,1,1,1 ⁇ into the 0th row of the new matrix, and the second row ⁇ 1,1 ,1,1 ⁇ in the compressed vector, the row number is 5, then fill ⁇ 1,1,1,1 ⁇ into the 5th row of the new matrix, and so on, and finally the row without filled data is filled with 0, you can get
  • the result for GPU1 is:
  • the root node sends data to other computing chips to be communicated.
  • the root node sends the matrix obtained in S604 to other computing chips to be communicated in the form of broadcast to complete the allgather operation.
  • the method for aggregating communication in the present application can compress the communication matrix before sending data, reduce the transmission of invalid 0s in the allgather operation, and improve the efficiency of the allgather operation.
  • FIG. 7 is a schematic flowchart of another method for aggregating communication provided by this application.
  • the allreduce operation may be performed, which may be performed by the device 110 shown in FIG. 1 , or may be performed by the device 110 shown in FIG. 1 and other The devices work together.
  • the following describes the flow of the method by taking the single execution of the device 110 as an example, as shown in FIG. 7 , the specific method includes:
  • the computing chip to be communicated compresses the matrix to be communicated, which is similar to S401.
  • the processor 111 obtains the maximum size of the compression matrix.
  • the processor 111 can directly call The interface of the allreduce function in the communication library, which obtains the number of rows of the compression matrix on each computing chip to be communicated.
  • each computing chip to be communicated may also traverse the compression matrix and send the number of rows to the processor 111 .
  • each computing chip to be communicated can also directly read the number of rows from the compressed vector and send it to the processor 111 .
  • the processor 111 fills the compression matrix where the number of uplinks of the calculation chip does not reach the maximum value with 0, so that the number of rows reaches the maximum value.
  • the number of rows from GPU1 to GPU4 are 4, 6, 5, and 1 respectively, and the number of rows in GPU2 is the largest.
  • the remaining GPUs use data 0 to fill the number of rows to 6 rows. The result as follows:
  • the processor 111 calls the allgather interface.
  • Processor 111 may call The allgather interface in the communication library sends the compression matrix and compression vector on each GPU to be communicated to the rest of the computing chips, and each computing chip can obtain all the compression matrices. For example 1, we get:
  • the computing chip to be communicated performs an operation according to the compressed vector.
  • a new matrix is created in the memory space of the chip to store the final aggregated communication result.
  • the number of rows of the new matrix is equal to the total number of rows of the compressed vector.
  • the computing chip sequentially fills each non-zero row of the compressed matrix into a new matrix according to the row number of the compressed vector. If the row numbers corresponding to the two rows of data in the compressed matrix are the same, the computing chip fills in the result of the calculation of the two rows of data into a new matrix according to the operation method.
  • GPU1 first creates a new matrix with 16 rows and 4 columns in the memory space of the chip, and then sequentially fills the non-zero rows in S703 into the new matrix according to the compression vector. For example, if the first row ⁇ 1,1,1,1 ⁇ has row number 0 in the compressed vector, then fill ⁇ 1,1,1,1 ⁇ into the 0th row of the new matrix, and the second row ⁇ 1,1 ,1,1 ⁇ in the compressed vector, the row number is 5, then fill ⁇ 1,1,1,1 ⁇ into the 5th row of the new matrix, and so on, and finally get the allreduce results of all GPUs:
  • the method of aggregated communication in this application can directly call the existing communication library, the operation is simple, and by compressing the matrix, the transmission of invalid 0s in each channel in the allreduce operation can be reduced, and the efficiency of the allreduce operation can be improved. .
  • the embedding operation is mainly used to use a matrix to convert a sparse vector (when the number of elements with a value of 0 in a vector is more than the number of non-zero elements, it is called a sparse vector) into a dense vector.
  • the transformed vector is called the embedding vector, and the matrix used for the transformation is called the embedding table.
  • Equation 1 converts two 5-dimensional feature vectors into two 3-dimensional vectors:
  • each iteration of the training of the recommendation model can be composed of two processes: forward propagation and back propagation.
  • the forward propagation is used to obtain the result of the input data after the transformation of the embedding table and the calculation of the recommendation model.
  • Backpropagation is used to get the recommended model and the update of the embedding table based on the difference between the calculated result and the actual value.
  • the size of the embedded table has reached the level of 100 GB to TB, and the 10TB level is coming soon. Since the data volume of the complete embedded table is too large, it is generally stored jointly by multiple computing chips, and each computing chip stores a part of the data of the embedded table. Therefore, in the process of forward propagation, the used embedding vector needs to query and retrieve the data of the corresponding row from the embedding table stored in multiple computing chips according to the query vector (batch). The reduce-scatter or allreduce operation in the aggregate communication can be used. Combining the embedding vectors during this iteration. Finally, the embedding vector is input into the recommendation model for training.
  • each computing chip can only get a part of the updated value of the embedded vector. Therefore, in the process of backpropagation, the updated value of the embedded vector can be aggregated and sent to all computing chips through the allgather operation in the aggregated communication to obtain the updated value of the final embedded table.
  • FIG. 8 is a schematic flowchart of another method for aggregating communication provided by the present application, which may be executed by the device 110 shown in FIG. 1 , or may be executed jointly by the device 110 shown in FIG. 1 and other devices.
  • the following describes the flow of the method by taking the single execution of the device 110 as an example, as shown in FIG. 8 , and the specific method is as follows:
  • the computing chip obtains a query vector according to user characteristics and commodity characteristics.
  • the query vector stores the correspondence between the rows of the embedded table and the rows of the embedded vector.
  • the number of data in the query vector is the number of rows of the embedded vector, the position of each data is the row of the embedded vector, and the size of each data is the row of the embedded table, for example, when the query vector is ⁇ 1,2,3,4 ⁇ , the embedding vector is a 4-row matrix, the first data is 1, indicating that the first row of the embedding vector is the data of the first row of the embedding table; the second data is 2, indicating that the second row of the embedding vector is the second row of the embedding table.
  • the data is a 4-row matrix, the first data is 1, indicating that the first row of the embedding vector is the data of the first row of the embedding table; the second data is 2, indicating that the second row of the embedding vector is the second row of the embedding table.
  • the computing chip obtains the embedding vector from the embedding table according to the query vector.
  • this step can be divided into two steps, including:
  • the computing chip obtains a matrix to be communicated with each computing chip by querying the vector.
  • Each computing chip can first create a matrix to be communicated with all data 0, and the number of rows of the matrix to be communicated is equal to the total number of data in the query vector. Then, each computing chip fetches the data of the corresponding row from the locally stored embedded table according to the data in the query vector to form a matrix to be communicated.
  • Example 2 it is assumed that the device 110 includes 4 GPUs to be communicated, namely GPU1, GPU2, GPU3 and GPU4, and the embedding tables stored by each GPU are:
  • the number before each GPU matrix represents the row number of the embedded table.
  • the 1st, 2nd, and 3rd rows of the complete embedded table are stored, and the data of each row is ⁇ 0,0,0,0 ⁇ , ⁇ 1, 1,1,1 ⁇ and ⁇ 2,2,2,2 ⁇ .
  • GPU1 to GPU4 first create one A matrix of 16 rows and 4 columns with all data 0.
  • the 2nd row of the embedding table corresponds to the 1st, 6th, 11th, and 13th rows of the embedding vector
  • GPU1 fetches the 2nd row of data from the locally stored embedding table and fills in the 1st, 6th, 11th rows of the matrix respectively.
  • the 4th line of the embedding table corresponds to the 2nd, 4th, and 14th lines of the embedding vector
  • the 5th line of the embedding table corresponds to the 3rd and 7th lines of the embedding vector
  • the 6th line of the embedding table corresponds to the 6th line of the embedding vector.
  • GPU2 takes the 4th row of data from the locally stored embedding table, fills in the 2nd, 4th, and 14th row of the matrix, takes out the 5th row of data, fills in the 3rd and 7th row of the matrix, and takes out the 6th row of data , fill in row 6 of the matrix, and so on.
  • the communication matrix for each GPU is obtained as:
  • the computing chip obtains the embedded vector by using the reduce-scatter or allreduce operation provided in this application.
  • the reduce-scatter operation provided in Figure 4 of this application can be used, and each communicating computing chip can obtain a part of the embedded vector value and input it to the computing chip respectively.
  • the recommendation model above performs the next calculation.
  • the allreduce operation provided in Figure 5 or Figure 7 of this application can be used to obtain a complete embedding vector and input the recommendation model for the next step of calculation.
  • the computing chip inputs the embedded vector into the recommendation model for calculation, and obtains a calculation result.
  • the computing chip obtains the updated value of the embedded table according to the updated value of the embedded vector.
  • this step can be divided into two steps, including:
  • the computing chip uses the allgather operation provided by this application to obtain the updated value of the complete embedded vector.
  • each computing chip can only get a part of the updated value of the embedded vector, and the allgather operation provided in Figure 6 of this application can be used. Get the updated value of the full embedding vector.
  • Example 2 it is assumed that the updated values of the embedded vectors obtained on the four GPUs are:
  • the computing chip obtains the update value of the embedded table according to the query vector.
  • the updated value of each row in the embedding table is obtained from the updated value of the embedding vector.
  • the rows of the embedded table correspond to the rows of multiple embedded vectors in the query vector, the data of all the rows of the embedded vectors are added together as the rows of the embedded table.
  • the query vector is: ⁇ 2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9 ⁇
  • the second row of the embedding table corresponds to the embedding vector 1, 6, 11, and 13 rows
  • GPU1 takes out the data of rows 1, 6, 11, and 13 from the updated values of the obtained embedding table, ⁇ 0.1, 0.1, 0.1, 0.1 ⁇ , ⁇ 0.3, 0.3, 0.3, 0.3 ⁇ , ⁇ 0.8, 0.8, 0.8, 0.8 ⁇ , ⁇ 0.1, 0.1, 0.1 ⁇ , added together, as an update for row 2 of the embedding table, ⁇ 1.3, 1.3, 1.3, 1.3 ⁇ .
  • the updated value of the embedded table is finally obtained:
  • the computing chip updates the embedded table stored in the computing chip according to the updated value of the embedded table.
  • the method of aggregated communication in the present application can improve the efficiency of aggregated communication and reduce the time used in the training process of the recommendation model by compressing the matrix in the process of embedding vector combination and embedding table updating.
  • FIG. 9 is a schematic flowchart of another method for aggregating communication provided by the present application, which may be executed by the device 110 shown in FIG. 1 or jointly executed by the device 110 shown in FIG. 1 and other devices.
  • the flow of the method is described below by taking the single execution of the device 110 as an example.
  • steps S902 and S906 are different from those in the method shown in FIG. 8 , and the steps are similar to those in FIG. 8 .
  • the specific method as follows:
  • the computing chip obtains a query vector according to user characteristics and commodity characteristics.
  • the computing chip obtains the embedding vector from the embedding table according to the query vector.
  • this step can be divided into three steps, including:
  • the repeated elements in the query vector are removed, and the recovery vector of the query vector is used to record the position where each data in the compressed query vector appears in the query vector before compression.
  • the query vector is: ⁇ 2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9 ⁇
  • the compressed query vector is ⁇ 2,4,5,7,8,6,9,10 ⁇
  • the recovery vector of the query vector is ⁇ 1,6,11 ⁇ , ⁇ 2,3,14 ⁇ , ⁇ 3,7 ⁇ , ⁇ 5, 12 ⁇ , ⁇ 8 ⁇ , ⁇ 9 ⁇ , ⁇ 10,16 ⁇ , ⁇ 15 ⁇ , indicating that 1 appears at the position 1, 6, 11 of the query vector before compression, and 3 appears in the query vector before compression 2, 4, 14, and so on.
  • the computing chip obtains a matrix to be communicated for each computing chip according to the compressed query vector, which is similar to S8021.
  • the matrix to be communicated for each GPU can be obtained as:
  • the computing chip obtains the compressed embedding vector by using the allreduce operation provided by this application.
  • the compressed embedding vector can be obtained by using the allreduce operation provided in FIG. 5 or FIG. 7 of the present application.
  • the resulting compressed embedding vector is:
  • the computing chip restores the compressed embedding vector according to the restored vector of the query vector.
  • the computing chip may first create a new matrix for storing the final embedding vector, and the number of rows of the new matrix is equal to the total number of data of the query vector before compression. Then the computing chip sequentially determines the position of each row of data in the compressed embedded vector in the recovery vector of the query vector according to the row number, and further determines the position of the data of this row in the original query vector. Finally, the computing chip fills this row of data into the final embedding vector matrix according to the position in the original query vector.
  • GPU1 first creates a matrix of new embedding vectors with 16 rows and 4 columns in the memory space of the chip, and then sequentially determines that each row of data in S9023 corresponds to the original query vector according to the compression vector. Location. For example, the row number of the first row ⁇ 1,1,1,1 ⁇ in the compressed embedding vector is 1, then it is the first data in the recovery vector of the query vector, and the corresponding data content is ⁇ 1,6, 11 ⁇ , indicating that this row of data is the 1st, 6th, and 11th rows in the original query vector. Then the calculation matrix fills ⁇ 1,1,1,1 ⁇ into rows 1, 6, 11 of the new matrix, and so on, and finally the allreduce results of all GPUs can be obtained:
  • the computing chip inputs the embedded vector into the recommendation model for calculation, and obtains a calculation result.
  • the computing chip obtains the updated value of the embedded table according to the updated value of the embedded vector.
  • this step can be divided into two steps, including:
  • the computing chip removes duplicate data in the update value of the embedded vector according to the query vector.
  • each computing chip can only get a part of the updated value of the embedded vector.
  • the position of each data in the query vector is the row of the embedded vector
  • the size of each data is the row of the embedded table.
  • the updated values of the rows of the embedded vector corresponding to the same position of the data value on the query vector are added, written into any row of the corresponding row of the embedded vector, and the other rows are assigned as 0.
  • the update value of the embedding vector is:
  • the query vector is ⁇ 2, 4, 5, 4, 7, 2, 5, 8, 6, 9, 2, 7, 2, 4, 10, 9 ⁇
  • the partial embedding vector obtained by GPU1 is the 1st part of the complete embedding vector Line -4
  • GPU1 traverses the 1st-4th data ⁇ 2,4,5,4 ⁇ of the query vector, where the value of the 2nd data and the 4th data are both 4, so you can embed the 1st data of the vector in GPU1 Add the data of row 2 and row 4, write to row 2, and assign the data of row 4 to 0.
  • the updated value of the embedded vector is obtained:
  • the multi-row data of the corresponding embedding vector can be converted into one row in advance, which increases the sparsity of the matrix to be communicated in the next step and improves the aggregate communication efficiency. .
  • the more data that is repeated in the query vector the more efficient the communication can be.
  • the computing chip uses the allgather operation provided by this application to obtain the updated value of the complete embedded vector, which is similar to S8062.
  • the computing chip obtains the update value of the embedded table according to the query vector, which is similar to S8062.
  • the computing chip updates the embedded table stored in the computing chip according to the updated value of the embedded table.
  • the method of aggregated communication in the present application can compress the embedding vector that needs to be communicated to the greatest extent, further reducing the amount of data transmitted in communication, improving the efficiency of aggregated communication, and reducing the time used in the training process of the recommendation model .
  • the present application also provides a system for aggregating communication, which may be the system 100 shown in FIG. 1 .
  • the system for aggregated communication includes at least a first computing chip and a second computing chip, wherein the first computing chip communicates with the second computing chip through at least one communication channel.
  • the first computing chip and the second computing chip may be the computing chip 1131 and the computing chip 1132 shown in FIG. 1 , respectively, or the computing chip 1131 and the computing chip 1231 shown in FIG. 1 , respectively.
  • the system for converged communication is used to implement the operation steps of the method performed by the corresponding subject in the above-mentioned converged communication method. Through the above system for aggregated communication, the amount of data between computing chips can be reduced, and the efficiency of aggregated communication can be improved.
  • the system can also be used in the scenario of recommending commodities for users to reduce the calculation speed and improve the user experience.
  • the above embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-described embodiments may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, fiber optic, digital subscriber line, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that contains one or more sets of available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media.
  • the semiconductor medium may be a solid state drive (SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Mathematical Physics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Marketing (AREA)
  • Molecular Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Multimedia (AREA)
  • Computer And Data Communications (AREA)

Abstract

一种聚合通信的方法、系统以及计算机设备,方法应用于聚合通信系统,系统至少包括第一计算芯片和第二计算芯片,其中,第一计算芯片通过至少一个通信通道与第二计算芯片通信,方法包括:第一计算芯片压缩第一数据,并通过通信通道将压缩后的第一数据发送给第二计算芯片;第二计算芯片根据压缩后的第一数据进行运算。依次提高聚合通信的效率。

Description

一种聚合通信的方法、系统和计算机设备
本申请要求于2021年04月21日提交中国知识产权局、申请号为202110431626.8、申请名称为“一种聚合通信的方法、系统和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种聚合通信的方法、系统和计算机设备。
背景技术
随着互联网个性化时代的到来,推荐算法可以在用户购买意图不明确的情况下,利用深度学习算法结合用户特征和商品特征,从海量的商品中找到用户感兴趣的商品,提升用户购买效率和产品体验,从而成为了目前众多互联网头部企业的盈利核心。推荐算法通常使用多张GPU并行完成计算任务,并结合聚合通信技术在多个处理器之间完成数据的迁移。目前,
Figure PCTCN2022075620-appb-000001
推出了两套硬件系统
Figure PCTCN2022075620-appb-000002
Figure PCTCN2022075620-appb-000003
以及与硬件系统相匹配的GPU聚合通信库
Figure PCTCN2022075620-appb-000004
用于支持GPU之间的聚合通信。然而在实际应用中,由于用户特征和商品特征的数据中包含着大量数值为0的数据,稀疏度极高,现有的聚合通信技术对于稀疏数据存在大量无效的0的传输,降低了通信效率。因此如何提供一种提高稀疏数据的聚合通信效率的方法成为亟待解决的技术问题。
发明内容
本申请提供了一种聚合通信的方法、系统和计算机设备,以此提供一种高效的稀疏数据的聚合通信的方法,提高聚合通信系统处理需要在多个计算芯片之间完成数据传输的算法的能力。
第一方面,提供一种聚合通信的方法,可以应用于聚合通信系统,该系统至少包括第一计算芯片和第二计算芯片,其中,第一计算芯片通过至少一个通信通道与第二计算芯片通信,所述方法包括:第一计算芯片压缩第一数据并且通过通信通道将压缩后的第一数据发送给第二计算芯片,然后由所述第二计算芯片根据所述压缩后的第一数据进行运算。通过上述方法,第一计算芯片可以将原始数据压缩后发送给第二芯片,减少第一芯片与第二芯片之间的数据的传输量。
在一种可能的实现方式中,第一计算芯片通过一个通信通道与第二计算芯片通信,其中,第二计算芯片为通信通道的根节点,则第二计算芯片根据压缩后的第一数据进行运算的方法可以包括:第二计算芯片将压缩后的第一数据与第二数据聚合,其中,第二数据为所述第二计算芯片上待通信的数据;第二计算芯片将聚合结果发送至第一计算芯片。通过上述方法,第一芯片可以通过一个通道将压缩后的数据发送给第二计算芯片,由第二芯片聚合数据后发送回给第一芯片,得到聚合通信中的allreduce操作的结果,并且提高了 allreduce操作的执行时间。
在另一种可能的实现方式中,聚合通信系统还包括处理器,第一计算芯片通过一个通信通道与第二计算芯片通信,其中,第二计算芯片为通信通道的根节点,则该方法包括:第一计算芯片压缩第一数据;第二计算芯片压缩第二数据。处理器获取所述压缩的第一数据和压缩的第二数据的大小,并且调用
Figure PCTCN2022075620-appb-000005
通信库中的allgather接口将压缩后的第一数据发送给第二计算芯片,压缩后的第二数据发送给第一计算芯片。最后,第一计算芯片将压缩后的第二数据与第一数据聚合;第二计算芯片将压缩后的第一数据与第二数据聚合。通过上述方法,处理器可以调用已有的通信库中的接口完成allreduce操作,提高了allreduce操作的效率而无需更改大量的代码。
在另一种可能的实现方式中,第一计算芯片也通过一个通信通道与第二计算芯片通信,其中,第二计算芯片为通信通道的根节点,则第二计算芯片根据压缩后的第一数据进行运算的方法可以包括:第二计算芯片将压缩后的第一数据与第二数据合并,其中,第二数据为所述第二计算芯片上待通信的数据;第二计算芯片将聚合结果发送至第一计算芯片。通过上述方法,第一芯片可以通过一个通道将压缩后的数据发送给第二计算芯片,由第二芯片合并数据后发送回给第一芯片,得到聚合通信中的allgather操作的结果,并且提高了allgather操作的执行时间。
在另一种可能的实现方式中,第一计算芯片通过多个通信通道与第二计算芯片通信,其中,多个通信通道包括第一通信通道,则第一计算芯片通过通信通道将压缩后的第一数据发送给第二计算芯片的方法可以包括:第一计算芯片通过第一通信通道将第一数据的第一部分数据发送给第二计算芯片,其中,第二计算芯片为第一通信通道的根节点。则第二计算芯片根据压缩后的第一数据进行运算的方法可以包括:第二计算芯片将压缩后的第一数据的部分数据与第二数据的部分数据聚合,其中,第二数据为第二计算芯片上待通信的数据。通过上述方法,第一计算芯片可以通过多个通道中的每一个通道将压缩后的数据发送给每个通道的根节点,当一个通道的根节点为第二计算芯片时,第一计算芯片通过这个通道将压缩后的数据发送给第二计算芯片,由第二计算芯片聚合数据,得到聚合通信中的reduce-scatter操作的结果,并且提高了reduce-scatter操作的执行时间。
在另一种可能的实现方式中,聚合通信系统可以用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,在第一计算芯片压缩第一数据之前,该方法包括第一处理芯片根据嵌入表将用户的特征和所述商品的特征转化为第一数据;则该方法还包括:第二计算芯片将第二计算芯片根据压缩后的第一数据进行运算得到的运算结果输入推荐模型得到嵌入表的更新值和推荐模型的更新值;然后第二计算芯片根据推荐模型的更新值更新推荐模型,并且根据所述嵌入表的更新值更新所述嵌入表。通过上述方法,本申请提出的聚合通信的方法可以结合推荐模型用于为用户推荐商品,在根据嵌入表得到推荐模型的输入值之间,提高了第一计算芯片和第二计算芯片之间的数据传输的效率,减少了为用户推荐商品的时间。
在另一种可能的实现方式中,聚合通信系统可以用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,在第一计算芯片压缩第一数据之前,该方法包括第一处理芯片根据嵌入表将用户的特征和所述商品的特征转化为第四数据;然后,第二计算芯片将第 四数据输入推荐模型得到第一数据和推荐模型的更新值;则该方法还包括:第二计算芯片根据推荐模型的更新值更新所述推荐模型,并且根据第二计算芯片根据压缩后的第一数据进行运算得到的运算结果更新嵌入表。通过上述方法,本申请提出的聚合通信的方法可以结合推荐模型用于为用户推荐商品,在根据更新嵌入表的操作中,提高了第一计算芯片和第二计算芯片之间的数据传输的效率,减少了为用户推荐商品的时间。
在另一种可能的实现方式中,聚合通信系统可以用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,在第一计算芯片压缩第一数据之前,该方法包括第一处理芯片根据嵌入表将用户的特征和所述商品的特征转化为查询向量,并压缩查询向量,然后根据压缩后的查询向量得到所述第一数据;则该方法还包括:第二计算芯片第二计算芯片根据压缩后的第一数据进行运算得到的运算结果得到嵌入向量,并将嵌入向量输入推荐模型得到嵌入表的更新值和推荐模型的更新值;然后第二计算芯片根据推荐模型的更新值更新推荐模型,并且根据所述嵌入表的更新值和压缩后的查询向量更新所述嵌入表。通过上述方法,可以进一步减少第一数据的传输时间,提高聚合通信效率。
在另一种可能的实现方式中,第一数据中数值为0的数据的个数大于数值非0的数据的个数。可以通过上述方法,有效减少数据中无效的0的传输,提高通信效率。
在另一种可能的实现方式中,计算芯片包括:图形处理器、张量处理器、神经网络处理器、深度学习处理器中的其中一个或多个。
第二方面,本申请提供一种聚合通信的系统,至少包括第一计算芯片和第二计算芯片,其中,第一计算芯片通过至少一个通信通道与第二计算芯片通信,聚合通信的系统用于实现如上述第一方面及第一方面任意一种可能实现方式中相应主体所执行的方法的操作步骤。
第三方面,本申请提供一种计算机设备,所述计算机设备包括处理器、存储器、第一计算芯片和第二计算芯片,所述存储器中用于存储计算机执行指令,所述计算机设备运行时,所述处理器执行所述存储器中的计算机执行指令以利用所述第一计算芯片和第二计算芯片执行第一方面或第一方面任一种可能实现方式中所述方法的操作步骤。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一种可能实现方式中所述方法的操作步骤。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行第一方面或第一方面任一种可能实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请实施例提供的一种聚合通信的系统100的结构示意图;
图2为本申请提供的一种数据迁移操作的数据流示意图;
图3为本申请提供的一种聚合运算操作的数据流示意图;
图4为本申请提供的一种聚合通信的方法的流程示意图;
图5为本申请提供的另一种聚合通信的方法的流程示意图;
图6为本申请提供的另一种聚合通信的方法的流程示意图;
图7为本申请提供的另一种聚合通信的方法的流程示意图;
图8是本申请提供的另一种聚合通信的方法的流程示意图;
图9是本申请提供的另一种聚合通信的方法的流程示意图。
具体实施方式
下面结合附图对本申请实施例中的技术方案进行描述。
图1为本申请实施例提供的一种聚合通信的系统100的结构示意图,如图所示,系统100包括设备110。设备110可以是具有计算功能的设备(例如,服务器),用于单独完成深度学习中涉及的计算任务。设备110包括处理器111、内存112、通信接口114以及至少两个计算芯片,例如计算芯片1131和计算芯片1132。
在设备110中,处理器111、内存112、通信接口114和所有的GPU通过总线连接,例如,快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)。该总线也可以为其他类型实现设备内器件间连接的总线。此外,总线除了包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。计算芯片之间还可以通过
Figure PCTCN2022075620-appb-000006
提出的
Figure PCTCN2022075620-appb-000007
总线互相连接。
处理器111用于执行内存112存储的计算机执行指令以实现设备110的功能。示例性地,处理器111可以是CPU,还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
内存112可以包括只读存储器和随机存取存储器,并向处理器111提供指令和数据。内存112还可以包括非易失性随机存取存储器。
内存112还可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
计算芯片是适用于执行深度学习算法的处理器,例如图像处理器(graphics processing unit,GPU),张量处理器(tensor processing unit,TPU)、神经网络处理器(neural network processing unit,NPU)、深度学习处理器(deep learning processing unit,DPU)。需要说明 的是,设备110可以包括至少两个相同或者不同类型的计算芯片,例如,设备110可以如图1所示包括两个GPU(图形处理器1131和图形处理器1132),也可以包括两个NPU,还可以包括一个GPU和一个NPU。多个计算芯片可以并行执行深度学习相关算法,例如,神经网络的训练和推理,计算芯片之间可以使用聚合通信技术通过总线完成数据的传输。
可选地,系统100还可以包括多台设备,例如设备110和设备120,其中,其他设备的结构与设备110类似。系统100的不同设备之间通过网络进行通信。网络包括有线或无线的传输方式,其中,有线的传输方式包括利用以太、光纤等形式进行数据传输,无线传输方式包括移动热点(Wi-Fi)、蓝牙、红外等传输方式。具体实施过程中,可以利用一个或多个交换机和/或路由器实现多个节点之间的通信处理。
当系统100包括多台设备时,设备110可以与其他设备协同完成深度学习中涉及的计算任务。此时,设备110可以包括处理器111、内存112、通信接口114以及至少一个计算芯片。不同设备的计算芯片可以并行执行深度学习相关算法,相同设备内的计算芯片之间使用聚合通信技术通过总线完成数据的传输,不同设备之间的计算芯片之间使用聚合通信技术通过网络完成数据的传输。
值得说明的是,图1所示的计算资源的管理系统架构仅仅是为了更好的说明本申请所提供的计算资源的管理方法所提供的系统架构的示例,并不构成对本申请实施例的限定。
接下来,基于图1所示系统,进一步结合图2至图9图详细介绍本申请提供的聚合通信的方法。聚合通信的操作有以下三种类型:同步(barrier)、数据迁移和聚合运算。
同步操作用于同步通信域内的所有进程,执行同步操作的进程必须等待所有的进程执行完同步操作之后该进程在继续执行。
数据迁移操作用于将进程中的数据发送到通信域中的其他进程上,又包括广播(broadcast)、收集(gather)、全收集(allgather)和分散(scatter)。图2为本申请提供的一种数据迁移操作的数据流示意图,如图所示:左一为broadcast,可以将节点1的数据1分别发送给节点1和节点2;左二为gather,可以将节点1的数据1、节点2的数据2以及结点3的数据3全部发送给节点1;左三为allgather,可以将节点1的数据1分别发送给节点2和节点3,节点2的数据2分别发送给节点1和节点3,节点3的数据3分别发送给节点1和节点2,最终节点1至节点3中的每一个节点都拥有数据1至数据3;左四为scatter,可以将节点1的数据2发送给节点2,数据3发送给节点3。
聚合运算操作用于实现数据算术运算,例如求解最小值和最大值、求和、逻辑与运算以及其他用户自定义的计算算法。聚合运算操作包括规约(reduce)、全规约(allreduce)和规约分散(reduce-scatter)。图3为本申请提供的一种聚合运算操作的数据流示意图,以算术运算为求和为例,如图所示:左一为reduce,将节点1、节点2和节点3的数据分别相加并存储在节点1中;左二为allreduce,可以看做是reduce加上broadcast,将节点1的reduce的结果发送至其余所有节点;左三为reduce-scatter,可以看做是reduce加上scatter,最终节点1的数据为节点1至节点3第一个数据的和,节点2的数据为节点1至节点3第二个数据的和,节点3的数据为节点1至节点3第三个数据的和。
在推荐算法中,使用较多的聚合通信方法为reduce-scatter、allreduce以及allgather,因此本申请将分别提供适用于稀疏矩阵的聚合通信方法中的reduce-scatter的方法、allreduce 的方法以及allgather的方法,并将这些方法与推荐算法结合中。需要说明的是,本申请提供的聚合通信的方法也适用于其他算法中具有稀疏数据聚合通信的场景。
图4为本申请提供的一种聚合通信的方法的流程示意图,具体地,可以完成reduce-scatter操作,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,如图4所示,具体方法包括:
S401、待通信的计算芯片压缩待通信矩阵。
设备110上的处理器111下发指令,指示待通信的计算芯片对自身芯片上的待通信的矩阵进行压缩,压缩方式可以采用本领域技术人员掌握的矩阵压缩方式,例如可以采用以下压缩方式的任意一种:
方式一:行压缩。
行压缩将原始矩阵压缩后得到一个压缩矩阵(row matrix)和一个压缩向量(row offsets)。其中压缩向量记录了非0行的行号,和矩阵的总行号,而压缩矩阵记录了与行号对应的非0行的数据。例如对于矩阵
Figure PCTCN2022075620-appb-000008
行压缩后压缩矩阵为
Figure PCTCN2022075620-appb-000009
压缩向量为(0 3 5 7)。其中,0表示向量矩阵中第一行数据(1 2 3)对应的行号,类似的,3和5分别表示向量矩阵中第二行和第三行数据对应的行号,7表示原始矩阵的总行号为7。行压缩方式为无损压缩,压缩后的数据可以连续存储或者分开存储。
方式二:坐标格式(coordinate list,COO)压缩。
COO压缩方法将每一个原始矩阵中的非0元素,用一个三元组来表示,三元组中包含三个向量,分别为行号向量,列号向量和数值向量。其中,行号向量中存放非0元素的行号,列号向量存放非0元素的列号,数值向量存放非0元素的数值,三个向量中的数据位置一一对应。例如对于矩阵
Figure PCTCN2022075620-appb-000010
共有四个非0元素,则行号向量为(0 1 2 2),列号向量为(2 0 1 2),数值向量为(1 2 3 4)。
方式三:压缩稀疏行格式(compressed sparse row format,CSR)压缩。
CSR压缩方法将每一个原始矩阵中的非0元素,用三类数据来表示,分别为数值向量、列号向量以及行偏移向量。其中数值向量和列号向量与COO压缩方法一致,行偏移向量中第一个数据表示第一行的第一个元素在所有非0元素中的位置,例如对于矩阵C,第一行 的非0数据为1,它是非0元素中的第一个数据,位置为0,则行偏移向量中的第一个数据为0;行偏移向量中第二个数据表示第二行的第一个元素在所有非0元素中的位置,例如对于矩阵C,第二行的非0数据为2,它是非0元素中的第二个数据,位置为1,则行偏移向量中的第二个数据为1;以此类推,行偏移向量中的最后一个数据为所有非0元素的总数。则对于矩阵C,行偏移向量为(0 1 2 4)。
示例性地,在示例1中,假设设备110包括4个待通信GPU,分别为GPU1、GPU2、GPU3和GPU4,每个GPU上待通信的矩阵为:
Figure PCTCN2022075620-appb-000011
采用行压缩后GPU1至GPU4上的待通信矩阵分别为:
Figure PCTCN2022075620-appb-000012
每个GPU的通信向量分别为:
Figure PCTCN2022075620-appb-000013
S402、待通信计算芯片之间建立通信通道。
根据待通信计算芯片的数量,建立通信通道,使通信通道的数量与待通信计算芯片的相等,每一个通信通道传输待通信矩阵的部分数据。可以将压缩前的待通信矩阵的数据按照行数平均分配给每一个通信通道。可选地,还可以根据通信通道的实际的数据传输量,动态的规划每一个通信通道传输的待通信矩阵的行数数量。
对于示例1,可以按照平均分配的方式建立4个通信通道,第1个通信通道传输压缩前的待通信矩阵的第0行至第3行数据,第2个通信通道传输压缩前的待通信矩阵的第4行至第7行数据,以此类推。
S403、处理器110从待通信计算芯片中确定根节点。
对于reduce-scatter操作,每一个通信通道具有一个根节点,用于接收和发送通信通道内其他计算芯片的数据并完成算术运算。根节点可以由用户指定,也可以由处理器111根据每个计算芯片的性能选择,例如,核心数量、核心频率、存储速度、显存位宽、容量。本申请对选择的方法不做限定。
对于示例1,可以选择GPU1作为通信通道1的根节点,GPU2作为通信通道2的根节点,以此类推。
S404、根节点接收其他待通信计算芯片的数据并得到聚合运算结果。
对于每一个通信通道,非根节点的待通信计算芯片扫描压缩后的矩阵,将矩阵中属于这个通信通道的数据以及行号发送给这个通信通道的根节点。
例如,对于示例1,在通信通道1中,通信通道1传输压缩前的待通信矩阵的第0行至第3行数据,并且GPU1是通信通道1的根节点,则GPU2扫描压缩向量后可知在原始矩阵的前四行数据中,第1,2,3行有待通信的数据。GPU2将压缩矩阵中,第1,2,3行对应的数据{3,3,3,3},{4,4,4,4},{3,3,3,3}和行号1,2,3发送至GPU1。而GPU3和GPU4扫描压缩向量后可知在原始矩阵的前四行数据中,并没有待通信的数据,因此无需发送。其余通信通道的发送方法可以以此类推。
根节点可以在存储空间中创建新的矩阵区域接收非根节点的计算芯片发送的数据,新的矩阵的行数等于通信通道传输的行数。然后根节点将数据根据行号与压缩前的原始矩阵相应的行的数据进行聚合,得到最终的聚合通信reduce-scatter结果。
对于示例1,GPU1接收到数据{3,3,3,3},{4,4,4,4},{3,3,3,3}和行号1,2,3后,分别将{3,3,3,3},{4,4,4,4},{3,3,3,3}与原始矩阵第0,1,2,3行的数据相加。GPU2接收到数据{3,3,3,3},{6,6,6,6},{7,7,7,7}和行号5,4,7后,分别与原始矩阵第4,5,6,7行的数据相加。其余的GPU的操作也可以以此类推,最终得到聚合通信reduce-scatter的结果为:
Figure PCTCN2022075620-appb-000014
可选地,根节点可以将接收到的数据根据行号直接与压缩前的原始矩阵相应的行的数据进行计算,无需创建新的存储区域。由于根节点可以同时接收并计算多个计算芯片发送的数据,因此需要对数据进行加锁,避免同时对一个数据进行多次计算。根节点可以每次接受一个数据,每个数据计算完毕后才可以接收新的数据。可选地,根节点还可以同时接收一行数据,每一行数据计算完毕后才接收新的数据。
通过上述方法,本申请中的聚合通信的方法可以在发送数据之前对待通信矩阵进行压缩,减少reduce-scatter操作中每一个通道的无效的0的传输,提高reduce-scatter操作的效率。
下面介绍本申请实施例提供的一种聚合通信的allreduce操作的方法,图5为本申请提供的另一种聚合通信方法的流程示意图,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,如图5所示,具体方法包括:
S501、待通信的计算芯片压缩待通信矩阵,与S401类似。
S502、待通信计算芯片之间建立通信通道,与S402类似。
S503、处理器110从待通信计算芯片中确定根节点。
对于allreduce操作,所有的通信通道可以指定同一个根节点,用于接收和发送其他计算芯片的数据并完成算术运算。根节点的确定方式与S403的确定方式类似。
对于示例1,可以选择GPU1作为根节点。
S504、根节点接收其他待通信计算芯片的数据并得到聚合运算结果。
与S404类似,根节点接收其他待通信计算芯片的数据并完成计算。不同的是,在allreduce操作中,所有通信通道的根节点是同一个计算芯片,因此最后由一个计算芯片接收并聚合了所有的待通信计算芯片的数据。对于示例1,根节点GPU1的计算结果如下:
GPU1:
{1,1,1,1},
{3,3,3,3},
{4,4,4,4},
{3,3,3,3},
{6,6,6,6},
{1,1,1,1},
{4,4,4,4},
{7,7,7,7},
{5,5,5,5},
{8,8,8,8},
{1,1,1,1},
{6,6,6,6},
{1,1,1,1},
{3,3,3,3},
{9,9,9,9},
{8,8,8,8}
S505、根节点将数据发送给其他待通信计算芯片。
根节点以broadcast的形式将S504得到的矩阵发送给其他的待通信计算芯片,完成allreduce操作。对于示例1,最终allreduce的结果为:
Figure PCTCN2022075620-appb-000015
通过上述方法,本申请中的聚合通信的方法可以在发送数据之前对待通信矩阵进行压缩,减少allreduce操作中每一个通道的无效的0的传输,提高allreduce操作的效率。
下面介绍本申请实施例提供的一种allgather操作的方法,图6为本申请提供的另一种聚合通信的方法的流程示意图,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,如图6所示,具体方法包括:
S601、待通信的计算芯片压缩待通信矩阵,与S401类似。
S602、待通信计算芯片之间建立通信通道,与S402类似。
S603、处理器110从待通信计算芯片中确定根节点。
对于allgather操作,所有的通信通道可以指定同一个根节点,用于接收和发送其他计算芯片的数据。根节点的确定方式与S403的确定方式类似。
对于示例1,可以选择GPU1作为根节点。
S604、根节点接收其他待通信计算芯片的数据并得到合并结果。
其他待通信计算芯片可以将压缩矩阵和压缩向量发送给根节点,对于示例1,可以得到:
GPU1:
{1,1,1,1},
{1,1,1,1},
{1,1,1,1},
{1,1,1,1},
{3,3,3,3},
{4,4,4,4},
{3,3,3,3},
{4,4,4,4},
{5,5,5,5},
{3,3,3,3},
{6,6,6,6},
{7,7,7,7},
{8,8,8,8},
{6,6,6,6},
{8,8,8,8},
{9,9,9,9},
对应也可以得到所有的压缩向量:
GPU1:
{0,5,10,11,16}
{1,2,3,6,8,12,16}
{4,7,9,11,15,16}
{14,16}
根节点得到压缩矩阵和压缩向量之后,在芯片的存储空间中创建一个新的矩阵用于存放最终的聚合通信结果,新的矩阵的行数等于压缩向量的总行数与通信通道的数量的乘积。然后根节点依次将压缩矩阵的每一个非0行,按照压缩向量的行号,填入新的矩阵中。最 后没有填充数据的行用0填充。
以示例1中的GPU1为例,GPU1首先在芯片的存储空间中创建一个64行4列的新矩阵,再根据压缩向量依次将S603中的非0行填入新的矩阵中。例如,第一行{1,1,1,1}在压缩向量中行号为0,则将{1,1,1,1}填入新的矩阵的第0行,第二行{1,1,1,1}在压缩向量中行号为5,则将{1,1,1,1}填入新的矩阵的第5行,以此类推,最终没有填充数据的行用0填充,可以得到GPU1的结果为:
GPU1:
{1,1,1,1},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{1,1,1,1},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{1,1,1,1},
{0,0,0,0},
{1,1,1,1},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0}
{0,0,0,0},
{3,3,3,3},
{4,4,4,4},
{3,3,3,3},
{0,0,0,0},
{0,0,0,0},
{4,4,4,4},
{0,0,0,0},
{5,5,5,5},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{3,3,3,3},
{0,0,0,0},
{0,0,0,0}
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{6,6,6,6},
{0,0,0,0},
{0,0,0,0},
{7,7,7,7},
{0,0,0,0},
{8,8,8,8},
{0,0,0,0},
{6,6,6,6},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{8,8,8,8}
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{0,0,0,0},
{9,9,9,9},
{0,0,0,0}
S605、根节点将数据发送给其他待通信计算芯片。
类似S505,根节点以broadcast的形式将S604得到的矩阵发送给其他的待通信计算芯片,完成allgather操作。
通过上述方法,本申请中的聚合通信的方法可以在发送数据之前对待通信矩阵进行压缩,减少allgather操作中的无效的0的传输,提高allgather操作的效率。
对于已经安装了
Figure PCTCN2022075620-appb-000016
的GPU聚合通信库
Figure PCTCN2022075620-appb-000017
设备,可以在调用
Figure PCTCN2022075620-appb-000018
通信库中提供的聚合操作上,进一步。图7为本申请提供的另一种聚合通信的方法的流程示意图,具体地,可以执行allreduce操作,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,如图7所示,具体方法包括:
S701、待通信的计算芯片压缩待通信矩阵,与S401类似。
S702、处理器111获取压缩矩阵的最大大小。
处理器111可以直接调用
Figure PCTCN2022075620-appb-000019
通信库中allreduce函数的接口,获取每个待通信计算芯片上的压缩矩阵的行数。可选地,每个待通信的计算芯片还可以遍历压缩矩阵,并将行数发送给处理器111。可选地,每个待通信的计算芯片还可以直接从压缩向量中读取行数,并发送给处理器111。
处理器111根据行数的最大值,将计算芯片上行数未达到最大值的压缩矩阵用0填充,使行数达到最大值。以示例1中压缩后的矩阵为例,GPU1至GPU4的行数分别为4,6,5,1,其中GPU2的行数最大,则其余的GPU使用数据0将行数填充至6行,结果如下:
Figure PCTCN2022075620-appb-000020
S703、处理器111调用allgather接口。
处理器111可以调用
Figure PCTCN2022075620-appb-000021
通信库中的allgather接口,将每个待通信GPU上的压缩矩阵和压缩向量互相发送给其余的计算芯片,则每个计算芯片都可以得到所有的压缩矩阵。对于示例1,可以得到:
Figure PCTCN2022075620-appb-000022
对应每个GPU也都可以得到所有的压缩向量:
Figure PCTCN2022075620-appb-000023
S704、待通信计算芯片根据压缩向量进行运算。
计算芯片得到压缩矩阵和压缩向量之后,在芯片的存储空间中创建一个新的矩阵用于存放最终的聚合通信结果,新的矩阵的行数等于压缩向量的总行数。然后计算芯片依次将压缩矩阵的每一个非0行,按照压缩向量的行号,填入新的矩阵中。如果压缩矩阵中的两行数据对应的行号相同,那么计算芯片按照运算方式将两行数据计算的结果填入新的矩阵中。
以示例1中的GPU1为例,GPU1首先在芯片的存储空间中创建一个16行4列的新矩阵,再根据压缩向量依次将S703中的非0行填入新的矩阵中。例如,第一行{1,1,1,1}在压缩向量中行号为0,则将{1,1,1,1}填入新的矩阵的第0行,第二行{1,1,1,1}在压缩向量中行号为5,则将{1,1,1,1}填入新的矩阵的第5行,以此类推,最终可以得到所有GPU的allreduce结果:
Figure PCTCN2022075620-appb-000024
通过上述方法,本申请中的聚合通信的方法可以在直接调用现有的通信库,操作简单,并且可以通过压缩矩阵,减少allreduce操作中每一个通道的无效的0的传输,提高allreduce操作的效率。
在推荐系统中,由于用户特征和商品特征的数据中包含着大量数值为0的数据,稀疏度极高,因此嵌入(embedding)是推荐系统的核心操作。嵌入操作主要用于使用一个矩阵将稀疏向量(当一个向量中数值为0的元素数目多于非0元素的数目时,称为稀疏向量)转换成稠密向量。转换后的向量称为嵌入向量(embedding vector),用于转换的矩阵称为嵌入表(embedding table)。例如,公式1中将两个5维特性向量转换成了2个3维向量:
Figure PCTCN2022075620-appb-000025
在推荐系统中,推荐模型的训练的每一次迭代可以由正向传播和反向传播两个过程组成,其中,正向传播用于得到输入数据经过嵌入表的转换和推荐模型计算后的结果,反向 传播用于根据计算后的结果与实际值的差值得到推荐模型以及嵌入表的更新。
随着模型的复杂度和数据量的不断增加,目前嵌入表的大小已经达到百GB到TB的级别,10TB级别也即将到来。由于完整的嵌入表的数据量过大,一般由多个计算芯片共同存储,每个计算芯片存放嵌入表的一部分数据。因此在正向传播过程中,使用到的嵌入向量需要根据查询向量(batch)从多个计算芯片存储的嵌入表中查询并取出对应行的数据,可以通过聚合通信中的reduce-scatter或者allreduce操作组合成本次迭代过程中的嵌入向量。最后将该嵌入向量输入推荐模型进行训练。
不仅如此,由于计算量过大,推荐模型也会分别存储在不同的计算芯片中,每一次迭代的计算后,每一个计算芯片只能得到一部分的嵌入向量的更新值。因此在反向传播过程中,可以通过聚合通信中的allgather操作,将嵌入向量的更新值汇总并发送给所有的计算芯片,得到最终的嵌入表的更新值。
图8是本申请提供的另一种聚合通信的方法的流程示意图,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,如图8所示,具体方法如下:
S801、计算芯片根据用户特征和商品特征得到查询向量。
查询向量中存储着嵌入表的行与嵌入向量的行的对应关系。查询向量中的数据的数量是嵌入向量的行数,每一个数据的位置是嵌入向量的行,每一个数据的大小是嵌入表的行,例如查询向量为{1,2,3,4}时,嵌入向量为一个4行矩阵,第1个数据为1,表示嵌入向量第1行是嵌入表第1行的数据;第2个数据为2,表示嵌入向量第2行是嵌入表第2行的数据。
S802、计算芯片根据查询向量从嵌入表中得到嵌入向量。
具体地,这一步可以分为两步,包括:
S8021、计算芯片通过查询向量得到每个计算芯片的待通信矩阵。
每个计算芯片可以首先创建一个所有数据为0的待通信矩阵,待通信矩阵的行数等于查询向量的数据的总数。则每个计算芯片分别根据查询向量中的数据从本地存储的嵌入表中取出对应行的数据,组成待通信矩阵。
示例性地,在示例2中,假设设备110包括4个待通信GPU,分别为GPU1、GPU2、GPU3和GPU4,每个GPU存储的嵌入表分别为:
Figure PCTCN2022075620-appb-000026
每个GPU矩阵前的数字表示嵌入表的行号,对于GPU1,存储了完整嵌入表的第1、2、3行,每一行的数据分别是{0,0,0,0}、{1,1,1,1}和{2,2,2,2}。
假设查询向量为:{2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9},共16行,GPU1至GPU4首先分别创建一个所有数据为0的16行4列的矩阵。根据查询向量,嵌入表的第2行对应嵌入向量 的第1、6、11、13行,则GPU1从本地存储的嵌入表中取出第2行数据,分别填入矩阵的第1、6、11、13行;嵌入表的第4行对应嵌入向量的第2、4、14行,嵌入表的第5行对应嵌入向量的第3、7行,嵌入表的第6行对应嵌入向量的第6行,则GPU2从本地存储的嵌入表中取出第4行数据,填入矩阵的第2、4、14行,取出第5行数据,填入矩阵的第3、7行,取出第6行数据,填入矩阵的第6行,以此类推。最终得到每个GPU的待通信矩阵为:
Figure PCTCN2022075620-appb-000027
S8022、计算芯片使用本申请提供的reduce-scatter或者allreduce操作得到嵌入向量。
当推荐模型分成多个部分,分别由多个计算芯片执行训练时,可以使用本申请图4提供的reduce-scatter操作,每个通信的计算芯片可以得到嵌入向量的一部分值,分别输入该计算芯片上的推荐模型进行下一步的计算。
当推荐模型完整的由一个计算芯片或者同时由多个计算芯片执行训练是,可以使用本申请图5或者图7提供的allreduce操作,得到完整的嵌入向量,输入推荐模型进行下一步的计算。
S803、计算芯片将嵌入向量输入推荐模型进行计算,得到计算结果。
S804、计算计算结果与真实值的之间的损失函数。
S805、通过损失函数计算得到嵌入向量的更新值。
S806、计算芯片根据嵌入向量的更新值得到嵌入表的更新值。
具体地,这一步可以分为两步,包括:
S8061、计算芯片使用本申请提供的allgather操作得到完整的嵌入向量的更新值。
当推荐模型分成多个部分,分别由多个计算芯片执行训练时,每一个计算芯片只能得到一部分的嵌入向量的更新值,可以采用本申请图6提供的allgather操作,是每一个计算芯片都得到完整的嵌入向量的更新值。
以示例2为例,假设4个GPU上得到的嵌入向量的更新值分别为:
Figure PCTCN2022075620-appb-000028
经过本申请图6提供的allgather操作后,可以得到:
Figure PCTCN2022075620-appb-000029
S8062、计算芯片根据查询向量得到嵌入表的更新值。
根据每个计算芯片存储的嵌入表的行在查询向量中对应的嵌入向量的行,从嵌入向量的更新值得到嵌入表中的每一行的更新值。当嵌入表的行在查询向量中对应多个嵌入向量的行时,将所有的嵌入向量的行的数据相加后作为嵌入表的行。
对于示例2,查询向量为:{2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9},嵌入表第2行对应嵌入向量的第1、6、11、13行,则GPU1从获得的嵌入表的更新值中取出第1、6、11、13行的数据,{0.1,0.1,0.1,0.1},{0.3,0.3,0.3,0.3},{0.8,0.8,0.8,0.8},{0.1,0.1,0.1,0.1},相加后,作为嵌入表第2行的更新,{1.3,1.3,1.3,1.3}。以此类推,最终得到嵌入表的更新值为:
Figure PCTCN2022075620-appb-000030
S807、计算芯片根据嵌入表的更新值更新计算芯片存储的嵌入表。
通过上述方法,本申请中的聚合通信的方法可以在嵌入向量组合和嵌入表更新过程中通过压缩矩阵,提高聚合通信的效率,减少推荐模型的训练过程中使用的时间。
图9是本申请提供的另一种聚合通信的方法的流程示意图,可由如图1所示的设备110执行,也可以由如图1所示的设备110和其他设备共同执行。下面以设备110单独执行为例阐述方法的流程,与图8的方法相比,图9的方法只有S902和S906步骤与图8所示的方法不同,其与步骤均与图8类似,具体方法如下:
S901、计算芯片根据用户特征和商品特征得到查询向量。
S902、计算芯片根据查询向量从嵌入表中得到嵌入向量。
具体地,这一步可以分为三个步骤,包括:
S9021、计算芯片压缩查询向量,得到查询向量的压缩向量和查询向量的恢复向量。
将查询向量中重复的元素去除,并使用查询向量的恢复向量记录压缩后的查询向量中每一个数据在压缩前的查询向量中出现的位置。
如示例2所示,查询向量为:{2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9},压缩后的查询向量为{2,4,5,7,8,6,9,10},查询向量的恢复向量为{{1,6,11},{2,3,14},{3,7},{5,12},{8},{9},{10,16},{15}},表示,1出现在压缩前的查询向量的1,6,11的位置上,3出现在压缩前的查询向量的2,4,14的位置上,以此类推。
S9022、计算芯片根据压缩后的查询向量得到每个计算芯片的待通信矩阵,与S8021类似。
对于示例2,可以得到每个GPU的待通信矩阵为:
Figure PCTCN2022075620-appb-000031
S9023、计算芯片使用本申请提供的allreduce操作得到压缩的嵌入向量。
可以使用本申请图5或者图7提供的allreduce操作,得到压缩的嵌入向量。对于示例2,得到的压缩的嵌入向量为:
Figure PCTCN2022075620-appb-000032
S9024、计算芯片根据查询向量的恢复向量恢复压缩的嵌入向量。
计算芯片可以首先创建一个新的矩阵用于存放最终的嵌入向量,新的矩阵的行数等于压缩前的查询向量的数据的总数。然后计算芯片依次将压缩后的嵌入向量的每一行数据根据行号,确定其在查询向量的恢复向量中的位置,并进一步确定这一行的数据在原始的查询向量中的位置。最后,计算芯片将这一行数据根据原始的查询向量中的位置填入到最终的嵌入向量的矩阵中。
以示例2中的GPU1为例,GPU1首先在芯片的存储空间中创建一个16行4列的新嵌入向量的矩阵,再根据压缩向量依次确定S9023中的每一行数据对应在原始的查询向量中 的位置。例如,第一行{1,1,1,1}在压缩后的嵌入向量中行号为1,则其是查询向量的恢复向量中的第一个数据,对应的数据内容为{1,6,11},表示这一行数据在原始的查询向量中的第1,6,11行。则计算矩阵将{1,1,1,1}填入新的矩阵的第1,6,11行,以此类推,最终可以得到所有GPU的allreduce结果:
Figure PCTCN2022075620-appb-000033
S903、计算芯片将嵌入向量输入推荐模型进行计算,得到计算结果。
S904、计算计算结果与真实值的之间的损失函数。
S905、通过损失函数计算得到嵌入向量的更新值。
S906、计算芯片根据嵌入向量的更新值得到嵌入表的更新值。
具体地,这一步可以分为两步,包括:
S9061、计算芯片根据查询向量去除嵌入向量更新值中的重复数据。
与S8061类似,当推荐模型分成多个部分,分别由多个计算芯片执行训练时,每一个计算芯片只能得到一部分的嵌入向量的更新值。正如前面所述,查询向量中每一个数据的位置是嵌入向量的行,每一个数据的大小是嵌入表的行,计算芯片可以遍历查询向量中部分嵌入向量的行的位置对应的数据后,将查询向量上数据的值相同的位置对应的嵌入向量 的行的更新值相加,写入嵌入向量对应行的任意一行中,并将其他行赋值为0。
以示例2的GPU1为例,嵌入向量的更新值为:
Figure PCTCN2022075620-appb-000034
查询向量为{2,4,5,4,7,2,5,8,6,9,2,7,2,4,10,9},GPU1得到的部分嵌入向量为完整嵌入向量的第1-4行,GPU1遍历查询向量的第1-4个数据{2,4,5,4},其中第2个数据和第4个数据的值都为4,因此可以将GPU1中嵌入向量的第2行和第4行的数据相加,写入第2行,并将第4行的数据赋值为0。以此类推,得到嵌入向量的更新值为:
Figure PCTCN2022075620-appb-000035
由于查询向量中相同数值的数据对应的嵌入表中的相同行,因此可以提前将其对应的嵌入向量的多行数据转变为一行,增加下一步中待通信的矩阵的稀疏性,提高聚合通信效率。当查询向量中重复的数据越多时,越能提高通信效率。
S9062、计算芯片使用本申请提供的allgather操作得到完整的嵌入向量的更新值,与S8062类似。
S9063、计算芯片根据查询向量得到嵌入表的更新值,与S8062类似。
S907、计算芯片根据嵌入表的更新值更新计算芯片存储的嵌入表。
通过上述方法,本申请中的聚合通信的方法可以最大程度的需要通信的嵌入向量进行压缩,进一步减少了通信中传输的数据量,提高聚合通信的效率,减少推荐模型的训练过程中使用的时间。
值得说明的是,对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制。
本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请所必须的。
本申请还提供一种聚合通信的系统,可以为图1所示的系统100。该聚合通信的系统 至少包括第一计算芯片和第二计算芯片,其中,第一计算芯片通过至少一个通信通道与第二计算芯片通信。示例性地,第一计算芯片和第二计算芯片可以分别为图1所示的计算芯片1131和计算芯片1132,也可以分别为图1所示的计算芯片1131和计算芯片1231。聚合通信的系统用于实现如上述聚合通信方法中相应主体所执行的方法的操作步骤。通过上述聚合通信的系统,可以减少计算芯片之间的数据的数量,提高聚合通信效率。所述系统还可以使用在为用户推荐商品的场景下,减少计算速度,提高用户的使用体验。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)。
以上所述,仅为本申请的具体实施方式。熟悉本技术领域的技术人员根据本申请提供的具体实施方式,可想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (18)

  1. 一种聚合通信的方法,其特征在于,所述方法应用于聚合通信系统,所述系统至少包括第一计算芯片和第二计算芯片,所述第一计算芯片通过至少一个通信通道与所述第二计算芯片通信,所述方法包括:
    所述第一计算芯片压缩第一数据;
    所述第一计算芯片通过所述通信通道将压缩后的第一数据发送给所述第二计算芯片;
    所述第二计算芯片根据所述压缩后的第一数据进行运算。
  2. 根据权利要求1所述的聚合通信的方法,其特征在于,所述第一计算芯片通过一个通信通道与所述第二计算芯片通信,所述第二计算芯片为所述通信通道的根节点,则所述第二计算芯片根据所述压缩后的第一数据进行运算,具体包括:
    所述第二计算芯片将所述第一数据与压缩后的第二数据聚合,所述第二数据为所述第二计算芯片上待通信的数据;
    所述第二计算芯片将所述聚合结果发送至所述第一计算芯片。
  3. 根据权利要求1所述的聚合通信的方法,其特征在于,所述第一计算芯片通过一个通信通道与所述第二计算芯片通信,所述第二计算芯片为所述通信通道的根节点,则所述第二计算芯片根据所述压缩后的第一数据进行运算,具体包括:
    所述第二计算芯片将所述第一数据与压缩后的第二数据合并,所述第二数据为所述第二计算芯片上待通信的数据;
    所述第二计算芯片将所述合并结果发送至所述第一计算芯片。
  4. 根据权利要求1所述的聚合通信的方法,其特征在于,所述第一计算芯片通过多个通信通道与所述第二计算芯片通信,所述多个通信通道包括第一通信通道,所述第一计算芯片通过所述通信通道将压缩后的第一数据发送给所述第二计算芯片,具体包括:
    所述第一计算芯片通过所述第一通信通道将所述第一数据的第一部分数据发送给所述第二计算芯片,所述第二计算芯片为所述第一通信通道的根节点;
    则所述第二计算芯片根据所述压缩后的第一数据进行运算,具体包括:
    所述第二计算芯片将所述压缩后的第一数据的部分数据与第二数据的部分数据聚合,所述第二数据为所述第二计算芯片上待通信的数据。
  5. 根据权利要求1、2或4所述的聚合通信方法,其特征在于,所述聚合通信系统用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,在所述第一计算芯片压缩第一数据之前,所述方法包括:
    所述第一处理芯片根据嵌入表将所述用户的特征和所述商品的特征转化为所述第一数据;
    则所述方法还包括:
    所述第二计算芯片将所述第二计算芯片根据所述压缩后的第一数据进行运算得到的运算结果输入所述推荐模型得到所述嵌入表的更新值和所述推荐模型的更新值;
    所述第二计算芯片根据所述推荐模型的更新值更新所述推荐模型;
    所述第二计算芯片根据所述嵌入表的更新值更新所述嵌入表。
  6. 根据权利要求1或3所述的聚合通信方法,其特征在于,所述聚合通信系统用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,在所述第一计算芯片压缩第一数据之前,所述方法包括:
    所述第一处理芯片根据嵌入表将所述用户的特征和所述商品的特征转化为第四数据;
    所述第二计算芯片将所述第四数据输入所述推荐模型得到所述第一数据和所述推荐模型的更新值;
    则所述方法还包括:
    所述第二计算芯片根据所述推荐模型的更新值更新所述推荐模型;
    所述第二计算芯片根据所述第二计算芯片根据所述压缩后的第一数据进行运算得到的运算结果更新所述嵌入表。
  7. 根据权利要求1至6任一所述的聚合通信方法,其特征在于,所述第一数据中数值为0的数据的个数大于数值非0的数据的个数。
  8. 根据权利要求1至7任一所述的聚合通信的方法,其特征在于,所述计算芯片包括:图形处理器、张量处理器、神经网络处理器、深度学习处理器中的其中一个或多个。
  9. 一种聚合通信的系统,其特征在于,所述系统至少包括第一计算芯片和第二计算芯片,所述第一计算芯片通过至少一个通信通道与所述第二计算芯片通信:
    所述第一计算芯片,用于压缩第一数据;还用于通过所述通信通道将压缩后的第一数据发送给所述第二计算芯片;
    所述第二计算芯片,用于根据所述压缩后的第一数据进行运算。
  10. 根据权利要求9所述的聚合通信的系统,其特征在于,所述第一计算芯片通过一个通信通道与所述第二计算芯片通信,所述第二计算芯片为所述通信通道的根节点,则所述第二计算芯片还用于:
    将所述第一数据与第二数据聚合,所述第二数据为所述第二计算芯片上待通信的数据;
    将所述聚合结果发送至所述第一计算芯片。
  11. 根据权利要求9所述的聚合通信的系统,其特征在于,所述第一计算芯片通过一个通信通道与所述第二计算芯片通信,所述第二计算芯片为所述通信通道的根节点,则所述第二计算芯片还用于:
    将所述第一数据与第二数据合并,所述第二数据为所述第二计算芯片上待通信的数据;
    将所述合并结果发送至所述第一计算芯片。
  12. 根据权利要求9所述的聚合通信的系统,其特征在于,所述第一计算芯片通过多个通信通道与所述第二计算芯片通信,所述多个通信通道包括第一通信通道,所述第一计算芯片还用于:
    通过所述第一通信通道将所述第一数据的第一部分数据发送给所述第二计算芯片,所述 第二计算芯片为所述第一通信通道的根节点;
    则所述第二计算芯片还用于将所述压缩后的第一数据的部分数据与第二数据的部分数据聚合,所述第二数据为所述第二计算芯片上待通信的数据。
  13. 根据权利要求9、10或12所述的聚合通信方法,其特征在于,所述聚合通信系统用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,所述第一计算芯片还用于:
    在所述第一计算芯片压缩第一数据之前,根据嵌入表将所述用户的特征和所述商品的特征转化为所述第一数据;
    则所述第二计算芯片还用于,将所述第二计算芯片根据所述压缩后的第一数据进行运算得到的运算结果输入所述推荐模型得到所述嵌入表的更新值和所述推荐模型的更新值;
    根据所述推荐模型的更新值更新所述推荐模型;
    根据所述嵌入表的更新值更新所述嵌入表。
  14. 根据权利要求9或11所述的聚合通信方法,其特征在于,所述聚合通信系统用于使用推荐模型结合用户的特征和商品的特征为用户推荐商品,所述第一计算芯片还用于:
    在所述第一计算芯片压缩第一数据之前,根据嵌入表将所述用户的特征和所述商品的特征转化为第四数据;
    则在所述第一计算芯片压缩第一数据之前,所述第二计算芯片还用于,将所述第四数据输入所述推荐模型得到所述第一数据和所述推荐模型的更新值;
    所述第二计算芯片还用于,根据所述推荐模型的更新值更新所述推荐模型;根据所述第二计算芯片根据所述压缩后的第一数据进行运算得到的运算结果更新所述嵌入表。
  15. 根据权利要求9至14任一所述的聚合通信的系统,其特征在于,所述第一数据中数值为0的数据的个数大于数值非0的数据的个数。。
  16. 根据权利要求9至15任一所述的聚合通信的系统,其特征在于,所述计算芯片包括:图形处理器、张量处理器、神经网络处理器、深度学习处理器中的其中一个或多个。
  17. 一种计算机设备,其特征在于,一种计算机设备,其特征在于,包括处理器、存储器、第一计算芯片和第二计算芯片,所述存储器用于存储计算机执行指令,所述处理器执行所述存储器中的计算机执行指令,使得所述第一计算芯片和第二计算芯片执行权利要求1-8中任一所述方法的操作步骤。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括指令,当其在计算机上运行时,使得计算机执行权利要求1至8中任一所述的方法的操作步骤。
PCT/CN2022/075620 2021-04-21 2022-02-09 一种聚合通信的方法、系统和计算机设备 WO2022222578A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22790688.0A EP4310687A4 (en) 2021-04-21 2022-02-09 AGGREGATION COMMUNICATION METHOD AND SYSTEM, AND COMPUTER DEVICE
US18/488,454 US20240045828A1 (en) 2021-04-21 2023-10-17 Collective communication method and system, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110431626.8A CN115221091A (zh) 2021-04-21 2021-04-21 一种聚合通信的方法、系统和计算机设备
CN202110431626.8 2021-04-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/488,454 Continuation US20240045828A1 (en) 2021-04-21 2023-10-17 Collective communication method and system, and computer device

Publications (1)

Publication Number Publication Date
WO2022222578A1 true WO2022222578A1 (zh) 2022-10-27

Family

ID=83604506

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075620 WO2022222578A1 (zh) 2021-04-21 2022-02-09 一种聚合通信的方法、系统和计算机设备

Country Status (4)

Country Link
US (1) US20240045828A1 (zh)
EP (1) EP4310687A4 (zh)
CN (2) CN116257494B (zh)
WO (1) WO2022222578A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257494B (zh) * 2021-04-21 2023-12-08 华为技术有限公司 一种聚合通信的方法、系统和计算机设备
CN117014520B (zh) * 2023-10-08 2024-02-09 广东广宇科技发展有限公司 一种基于压缩算法的数据快速传输方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179574A1 (en) * 2014-12-17 2016-06-23 Nvidia Corporation Work-efficient, load-balanced, merge-based parallelized consumption of sequences of sequences
CN107169098A (zh) * 2017-05-15 2017-09-15 北京京东尚科信息技术有限公司 数据搬运方法、数据搬运装置及电子设备
US20180189110A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Compute engine architecture to support data-parallel loops with reduction operations
CN109409964A (zh) * 2018-11-27 2019-03-01 口碑(上海)信息技术有限公司 优质品牌的识别方法及装置
CN111240744A (zh) * 2020-01-03 2020-06-05 支付宝(杭州)信息技术有限公司 一种提高涉及稀疏矩阵并行计算效率的方法和系统
CN111858454A (zh) * 2020-06-29 2020-10-30 苏州浪潮智能科技有限公司 一种gpu通信方法、设备以及介质
CN112261023A (zh) * 2020-10-15 2021-01-22 苏州浪潮智能科技有限公司 一种卷积神经网络的数据传输方法和装置
CN112507284A (zh) * 2020-12-18 2021-03-16 清华大学 稀疏矩阵乘法在可重构处理器阵列上的实现方法及装置

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019151B2 (en) * 2007-06-11 2011-09-13 Visualization Sciences Group, Inc. Methods and apparatus for image compression and decompression using graphics processing unit (GPU)
US9805310B2 (en) * 2012-03-04 2017-10-31 Adam Jeffries Utilizing spatial statistical models to reduce data redundancy and entropy
CN105488088B (zh) * 2014-12-31 2019-05-07 哈尔滨安天科技股份有限公司 基于树形结构的二维网络角度分配布局方法
CN107330337B (zh) * 2017-07-19 2022-05-24 腾讯科技(深圳)有限公司 混合云的数据存储方法、装置、相关设备及云系统
CN111699682A (zh) * 2017-12-07 2020-09-22 韩国电子通信研究院 用于使用通道之间的选择性信息共享进行编码和解码的方法和设备
CN108108821B (zh) * 2017-12-29 2022-04-22 Oppo广东移动通信有限公司 模型训练方法及装置
CN109002283B (zh) * 2018-06-14 2021-07-27 南京航空航天大学 一种基于文件路径分析的代码审查者推荐方法
CN111886593B (zh) * 2018-08-31 2024-06-11 华为技术有限公司 数据处理系统和数据处理方法
US11127167B2 (en) * 2019-04-29 2021-09-21 Nvidia Corporation Efficient matrix format suitable for neural networks
CN110457361B (zh) * 2019-07-05 2023-12-05 中国平安人寿保险股份有限公司 特征数据获取方法、装置、计算机设备和存储介质
US20210105451A1 (en) * 2019-12-23 2021-04-08 Intel Corporation Scene construction using object-based immersive media
CN116257494B (zh) * 2021-04-21 2023-12-08 华为技术有限公司 一种聚合通信的方法、系统和计算机设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179574A1 (en) * 2014-12-17 2016-06-23 Nvidia Corporation Work-efficient, load-balanced, merge-based parallelized consumption of sequences of sequences
US20180189110A1 (en) * 2016-12-31 2018-07-05 Intel Corporation Compute engine architecture to support data-parallel loops with reduction operations
CN107169098A (zh) * 2017-05-15 2017-09-15 北京京东尚科信息技术有限公司 数据搬运方法、数据搬运装置及电子设备
CN109409964A (zh) * 2018-11-27 2019-03-01 口碑(上海)信息技术有限公司 优质品牌的识别方法及装置
CN111240744A (zh) * 2020-01-03 2020-06-05 支付宝(杭州)信息技术有限公司 一种提高涉及稀疏矩阵并行计算效率的方法和系统
CN111858454A (zh) * 2020-06-29 2020-10-30 苏州浪潮智能科技有限公司 一种gpu通信方法、设备以及介质
CN112261023A (zh) * 2020-10-15 2021-01-22 苏州浪潮智能科技有限公司 一种卷积神经网络的数据传输方法和装置
CN112507284A (zh) * 2020-12-18 2021-03-16 清华大学 稀疏矩阵乘法在可重构处理器阵列上的实现方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4310687A4

Also Published As

Publication number Publication date
US20240045828A1 (en) 2024-02-08
CN116257494B (zh) 2023-12-08
EP4310687A1 (en) 2024-01-24
CN115221091A (zh) 2022-10-21
CN116257494A (zh) 2023-06-13
EP4310687A4 (en) 2024-08-21

Similar Documents

Publication Publication Date Title
WO2022222578A1 (zh) 一种聚合通信的方法、系统和计算机设备
US10567494B2 (en) Data processing system, computing node, and data processing method
WO2021244354A1 (zh) 神经网络模型的训练方法和相关产品
JP2022532466A (ja) ニューラルネットワークに基づく量子誤り訂正復号方法、装置、チップ、コンピュータ機器、及びコンピュータプログラム
JP2022532469A (ja) 量子回路のフォールトトレランス・誤り訂正・復号方法、装置及びチップ並びにコンピュータプログラム
TWI735545B (zh) 一種模型的訓練方法和裝置
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
JP6978467B2 (ja) 疎要素を密行列に変換するためのシステムおよび方法
JP2022088600A (ja) 量子回路の処理方法、装置、電子デバイス、記憶媒体、及びプログラム
WO2023010694A1 (zh) 量子态制备电路生成方法、装置、芯片、设备及程序产品
WO2022116689A1 (zh) 图数据处理方法、装置、计算机设备和存储介质
KR102163209B1 (ko) 컨볼루션 신경망 훈련의 다차원 병렬화 방법과 이를 수행하는 장치 사이의 재구성 가능한 연결 구조
CN112633482B (zh) 一种高效宽度图卷积神经网络模型系统及训练方法
JP2017138966A (ja) 疎要素を密行列に変換するためのシステムおよび方法
TWI775210B (zh) 用於卷積運算的資料劃分方法及處理器
CN116108238B (zh) 一种图数据库中多跳查询的优化方法、系统和装置
TWI758223B (zh) 具有動態最小批次尺寸之運算方法,以及用於執行該方法之運算系統及電腦可讀儲存媒體
CN116800671A (zh) 数据传输方法、装置、计算机设备、存储介质和程序产品
WO2020037512A1 (zh) 一种神经网络计算方法和装置
Cao et al. Higher rank matricial ranges and hybrid quantum error correction
WO2023019972A1 (zh) 一种计算装置、方法、系统、电路、芯片及设备
CN116562373A (zh) 数据挖掘方法、装置、设备和介质
CN117035045A (zh) 模型参数更新方法、装置、设备、存储介质和程序产品
CN115688917A (zh) 神经网络模型的训练方法、装置、电子设备及存储介质
Fish et al. Sampling without compromising accuracy in adaptive data analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22790688

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022790688

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022790688

Country of ref document: EP

Effective date: 20231017

NENP Non-entry into the national phase

Ref country code: DE