CN108052482B - Method and system for communication between GPUs - Google Patents

Method and system for communication between GPUs Download PDF

Info

Publication number
CN108052482B
CN108052482B CN201711115570.5A CN201711115570A CN108052482B CN 108052482 B CN108052482 B CN 108052482B CN 201711115570 A CN201711115570 A CN 201711115570A CN 108052482 B CN108052482 B CN 108052482B
Authority
CN
China
Prior art keywords
bit
data
level
diagram
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711115570.5A
Other languages
Chinese (zh)
Other versions
CN108052482A (en
Inventor
石宣化
金海�
赵鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201711115570.5A priority Critical patent/CN108052482B/en
Publication of CN108052482A publication Critical patent/CN108052482A/en
Application granted granted Critical
Publication of CN108052482B publication Critical patent/CN108052482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17318Parallel communications techniques, e.g. gather, scatter, reduce, roadcast, multicast, all to all

Abstract

The invention discloses a method and a system for communication between GPUs (graphic processing units), belonging to the field of data processing and high-performance calculation, wherein the method comprises the following steps of: data conversion: converting the data to obviously expose redundant information in the data so as to perform subsequent processing; generating a bit map: generating a plurality of levels of bit maps for the converted data to omit transmission of redundant information; data transmission: selecting a specific part of the converted data for data transmission according to the bit diagram result; data extraction: after the data transmission is finished, the transmitted data is read and converted to obtain the original data. When redundant data with certain characteristics are communicated among GPUs, the method and the device can quickly perform data conversion on the GPUs and reduce data transmission quantity, so that the data communication efficiency among the GPUs is improved.

Description

Method and system for communication between GPUs
Technical Field
The invention belongs to the field of data processing and high-performance computing, and particularly relates to a method and a system for communication between GPUs.
Background
With the rapid development of programmable Graphics Processing Unit (GPU) in performance, GPU has great advantages in computational performance and memory bandwidth compared to Central Processing Unit (CPU), and GPU is increasingly used in various fields to accelerate data Processing and computation. Data communication is carried out between the GPUs through PCIe, and the data transmission rate of the PCIe is much smaller than the memory bandwidth of the GPU (taking Nvidai Tesla P100 as an example, the theoretical rate of PCIe x16 is 32GB/s, and the memory access bandwidth is up to 732GB/s), so the data communication between the GPUs is often a performance bottleneck.
In a computer cluster, data is often compressed in advance by using a compression algorithm when data communication is performed, so that communication efficiency is improved. However, the GPU requires that the algorithm can be highly parallelized to achieve the acceleration effect, and a compression algorithm capable of efficiently performing a large amount of parallelization does not exist at present, so that when data communication is performed between GPUs, data compression cannot be performed as in network communication, thereby improving communication efficiency.
Disclosure of Invention
In view of the above defects or improvement needs in the prior art, the present invention provides a method and system for inter-GPU communication, so as to solve the technical problem of low communication efficiency when data communication is performed between the existing GPUs.
To achieve the above object, according to an aspect of the present invention, there is provided a method for inter-GPU communication, including:
(1) combining each bit in the data to be transmitted into an M-bit unsigned number, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
(2) if M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the result data, corresponding bit position 1 in a first-level bit diagram is used, otherwise, corresponding bit position 0 in the first-level bit diagram is used;
(3) when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
(4) determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram, and transmitting the multi-level bit diagram, the level of the bit diagram and the target data from the first GPU to the second GPU;
(5) and the second GPU sets the missing data part caused by the data not transmitted as 0 or 1 according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtains the data to be transmitted through an inverse process corresponding to the data conversion.
Preferably, step (1) specifically comprises:
(1.1) taking every M data items in the data to be transmitted as a group of data processing units;
(1.2) combining each bit in each group of data into an M-bit unsigned number;
(1.3) storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;
and (1.4) sequentially extracting a group of data at the same effective bit of each M-bit unsigned number, and forming result data by the extracted groups of data according to the sequence of the effective bits from low to high.
Preferably, the step (2) specifically comprises:
(2.1) if M-bit continuous unsigned numbers 0 exist in the result data, corresponding bit positions 1 in the first-level bit diagram A, otherwise, corresponding bit positions 0 in the first-level bit diagram A;
(2.2) if M-bit continuous unsigned numbers 1 exist in the result data, corresponding bit positions 1 in the first-level bit diagram B are determined, otherwise, corresponding bit positions 0 in the first-level bit diagram B are determined.
Preferably, step (3) comprises:
(3.1) composing the first level bit map view a with the first level bit map view B into a first level bit map view;
(3.2) if M-bit continuous unsigned numbers 0 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map A, otherwise setting the corresponding bit position 0 in the second level bit map A;
(3.3) if M-bit continuous unsigned numbers 1 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map B, otherwise setting the corresponding bit position 0 in the second level bit map B;
(3.4) combining the second level bit map A and the second level bit map B into a second level bit map, and generating a third level bit map according to the second level bit map until all elements of the generated N level bit map are 0.
According to another aspect of the present invention, there is provided a system for inter-GPU communication, comprising:
the data conversion module is used for combining each bit in the data to be transmitted into M-bit unsigned numbers, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
a bit diagram generating module, configured to, when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the result data, generate a corresponding bit position 1 in a first-level bit diagram, and otherwise generate a corresponding bit position 0 in the first-level bit diagram; when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
the data transmission module is used for determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram and transmitting the multi-level bit diagram, the level number of the bit diagram and the target data from the first GPU to the second GPU;
and the data extraction module is used for setting a missing data part caused by data not transmitted as 0 or 1 by the second GPU according to the multilevel bit diagram, the level of the bit diagram and the target data, and then obtaining the data to be transmitted through an inverse process corresponding to the data conversion.
Preferably, the data conversion module includes:
the data grouping module is used for taking every M data items in the data to be transmitted as a group of data processing units;
the unsigned number generating module is used for combining each bit in each group of data into an M-bit unsigned number;
the data processing module is used for storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;
and the result data generation module is used for sequentially extracting a group of data at the same effective bit of each M-bit unsigned number and forming the extracted groups of data into result data according to the sequence of the effective bits from low to high.
Preferably, the bitmap generation module includes:
a first bit map generation module, configured to, when M-bit continuous unsigned numbers 0 exist in the result data, map a corresponding bit position 1 in a first level bit map a, otherwise map a corresponding bit position 0 in the first level bit map a;
and a second bit map generation module, configured to, when M-bit continuous unsigned numbers 1 exist in the result data, map a corresponding bit position 1 in the first-level bit map B, and otherwise map a corresponding bit position 0 in the first-level bit map B.
Preferably, the bitmap generation module further includes:
a first combining module for combining the first level bit map A and the first level bit map B into a first level bit map;
a third bit map generating module, configured to apply a bit position 1 corresponding to the second level bit map a when M consecutive unsigned numbers 0 exist in the first level bit map, and apply a bit position 0 corresponding to the second level bit map a if not;
a fourth bit diagram generating module, configured to, when M consecutive unsigned numbers 1 exist in the first level bit diagram, apply a corresponding bit position 1 in a second level bit diagram B, otherwise apply a corresponding bit position 0 in the second level bit diagram B;
and the cyclic processing module is used for forming the second-level bit diagram A and the second-level bit diagram B into a second-level bit diagram, and generating a third-level bit diagram according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the data have stronger range and similarity (such as label values of nodes in a graph algorithm), continuous repeated values appear after the data are converted, and a part of data can be replaced by the bit diagram, so that the aim of reducing the data volume through quick calculation is achieved, and the communication efficiency between GPUs is improved.
(2) And generating a multi-level bit diagram for the result data obtained after conversion, avoiding transmitting continuous 1 or 0 existing in the multi-level bit diagram, and simultaneously transmitting the level of the bit diagram, the multi-level bit diagram and some parts of the result data after conversion after generating the bit diagram, and avoiding transmitting the rest parts of the data.
(3) When data conversion is carried out, the process can be completely executed in parallel, and the GPU has corresponding hardware support for carrying out bit combination on a group of data, so that the processing process can effectively utilize the powerful computing capacity of the GPU, and finally the processing process of data conversion can be completed very quickly.
(4) The generation of the bitmap and the extraction process of the converted data are also in accordance with the characteristics of the GPU, and can be quickly executed, so that the invention can effectively reduce the data transmission quantity, only introduces a small amount of overhead, and finally achieves the improvement of the overall efficiency.
Drawings
FIG. 1 is a simplified diagram of a method for inter-GPU communication according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for inter-GPU communication according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a data transformation process according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a bit map according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
According to the method and the system for communication between GPUs, provided by the invention, the data is subjected to one-time rapid conversion to generate the bit diagram, specific redundant information in the data is gathered and exposed, and then the transmission of the redundant information is avoided through the bit diagram, so that the communication efficiency is improved.
Fig. 1 is a simplified schematic diagram of a method for communication between GPUs according to an embodiment of the present invention, where the method mainly includes: data conversion, bit diagram generation, data transmission and data extraction; wherein, the data conversion: converting the data to obviously expose redundant information in the data so as to perform subsequent processing; generating a bit map: generating a plurality of levels of bit maps for the converted data to omit transmission of redundant information; data transmission: selecting a specific part of the converted data for data transmission according to the bit diagram result; data extraction: after the data transmission is finished, the transmitted data is read and converted to obtain the original data.
Fig. 2 is a schematic flow chart of a method for communication between GPUs according to an embodiment of the present invention, where the method shown in fig. 2 includes the following steps:
(1) combining each bit in the data to be transmitted into an M-bit unsigned number, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
in an optional embodiment, step (1) specifically includes:
(1.1) taking every M data items in the data to be transmitted as a group of data processing units;
(1.2) combining each bit in each group of data into an M-bit unsigned number;
(1.3) storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;
and (1.4) sequentially extracting a group of data at the same effective bit of each M-bit unsigned number, and forming result data by the extracted groups of data according to the sequence of the effective bits from low to high.
Wherein, after data conversion, the resultant data corresponding to some bits are consecutive 1's or 0's due to the range of the data; due to the similarity of data, certain bits of adjacent data in some ranges are the same, and thus the corresponding result data is also consecutive 1 or 0.
Preferably, M is taken to be a multiple of 8.
(2) If M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the result data, corresponding bit position 1 in the first-level bit diagram is used, otherwise, corresponding bit position 0 in the first-level bit diagram is used;
in an optional embodiment, step (2) specifically includes:
(2.1) if M-bit continuous unsigned number 0 exists in the result data, corresponding bit position 1 in the first-level bit diagram A, otherwise, corresponding bit position 0 in the first-level bit diagram A;
(2.2) if M-bit continuous unsigned numbers 1 exist in the result data, corresponding bit positions 1 in the first-level bit diagram B are set, otherwise, corresponding bit positions 0 in the first-level bit diagram B are set.
(3) When M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the first-level bit diagram, corresponding bit position 1 in the second-level bit diagram is used, otherwise, corresponding bit position 0 in the second-level bit diagram is used, and a third-level bit diagram is generated according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
in an optional embodiment, step (3) specifically includes:
(3.1) combining the first level bit map A with the first level bit map B to form a first level bit map;
(3.2) if M-bit continuous unsigned number 0 exists in the first-level bit diagram, setting the corresponding bit position 1 in the second-level bit diagram A, otherwise, setting the corresponding bit position 0 in the second-level bit diagram A;
(3.3) if M-bit continuous unsigned number 1 exists in the first-level bit diagram, setting the corresponding bit position 1 in the second-level bit diagram B, otherwise, setting the corresponding bit position 0 in the second-level bit diagram B;
(3.4) combining the second level bit map A and the second level bit map B into a second level bit map, and generating a third level bit map according to the second level bit map until all elements of the generated N level bit map are 0.
(4) Determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram, and transmitting the multi-level bit diagram, the level of the bit diagram and the target data from the first GPU to the second GPU;
here, the untransmitted data is consecutive 0 or 1, which is represented as 1 in the bitmap.
(5) And the second GPU sets the missing data part caused by the data not transmitted as 0 or 1 according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtains the data to be transmitted through an inverse process corresponding to the data conversion.
The following describes a method for inter-GPU communication according to the present invention in detail with reference to the accompanying drawings and embodiments.
The invention can be realized by the following technical scheme (taking CUDA as a platform):
1. data conversion: as shown in fig. 3, which is a schematic diagram of a data conversion process disclosed in the embodiment of the present invention, all threads sequentially read an element in original data, and sequentially obtain 32-bit unsigned integers (i.e., M is 32) composed of bits by using a function __ ballot () in a group of Warp (32 adjacent threads), and write the 32-bit unsigned integers into a result array. Detailed description of the embodiments of the present invention, the description of the embodiments of the method above will not be repeated.
2. Bit map generation: after the previous step is completed, an __ ballot () function is also used to generate two bitmap images, wherein the bitmap image a stores the positions where consecutive 0 s appear in the original data, and the bitmap image B stores the positions where consecutive 1 s appear in the original data. A bit map of the bit map is then generated, storing the positions in the bit map where consecutive 1's appear, and a higher level bit map is generated accordingly until the generated bit maps are all 0's.
3. Data transmission: fig. 4 is a schematic diagram illustrating the operation of a bit map according to an embodiment of the present invention. After the bit map is generated, the bit map is transmitted first. Then the CPU can quickly calculate which data need to be transmitted, namely the start address and the length of the data, through the bit diagram, and accordingly start the transmission process of the corresponding part of data. This process naturally skips consecutive 0 or 1's in the data.
4. Data extraction: when the data transmission is completed, the target GPU obtains the bit diagram and partial result data. According to the requirement, the data conversion in the reverse direction can be carried out after the missing part of the data is set to be corresponding to 0 or 1, so that the complete original data is obtained; or directly accessing the bit diagram and the data in the GPU code, and directly using the bit diagram and the data after reverse processing to obtain the original data.
The present invention provides in another aspect a system for inter-GPU communication, comprising:
the data conversion module is used for combining each bit in the data to be transmitted into M-bit unsigned numbers, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
a bit diagram generating module, configured to, when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the result data, apply corresponding bit positions 1 in the first-level bit diagram, otherwise apply corresponding bit positions 0 in the first-level bit diagram; when M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the first-level bit diagram, corresponding bit position 1 in the second-level bit diagram is used, otherwise, corresponding bit position 0 in the second-level bit diagram is used, and a third-level bit diagram is generated according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
the data transmission module is used for determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram and transmitting the multi-level bit diagram, the level number of the bit diagram and the target data from the first GPU to the second GPU;
and the data extraction module is used for setting the missing data part caused by the data not transmitted as 0 or 1 by the second GPU according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtaining the data to be transmitted through an inverse process corresponding to the data conversion.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (8)

1. A method for inter-GPU communication, comprising:
(1) combining each bit in the data to be transmitted into an M-bit unsigned number, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
(2) if M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the result data, corresponding bit position 1 in a first-level bit diagram is used, otherwise, corresponding bit position 0 in the first-level bit diagram is used;
(3) when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
(4) determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram, and transmitting the multi-level bit diagram, the level of the bit diagram and the target data from the first GPU to the second GPU, wherein the untransmitted data are continuous 0 or 1, and the untransmitted data are represented as 1 in the bit diagram;
(5) and the second GPU sets the missing data part caused by the data not transmitted as 0 or 1 according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtains the data to be transmitted through an inverse process corresponding to the data conversion.
2. The method according to claim 1, wherein step (1) comprises in particular:
(1.1) taking every M data items in the data to be transmitted as a group of data processing units;
(1.2) combining each bit in each group of data into an M-bit unsigned number;
(1.3) storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;
and (1.4) sequentially extracting a group of data at the same effective bit of each M-bit unsigned number, and forming result data by the extracted groups of data according to the sequence of the effective bits from low to high.
3. The method according to claim 2, wherein step (2) comprises in particular:
(2.1) if M-bit continuous unsigned numbers 0 exist in the result data, corresponding bit positions 1 in the first-level bit diagram A, otherwise, corresponding bit positions 0 in the first-level bit diagram A;
(2.2) if M-bit continuous unsigned numbers 1 exist in the result data, corresponding bit positions 1 in the first-level bit diagram B are determined, otherwise, corresponding bit positions 0 in the first-level bit diagram B are determined.
4. The method of claim 3, wherein step (3) comprises:
(3.1) composing the first level bit map view a with the first level bit map view B into a first level bit map view;
(3.2) if M-bit continuous unsigned numbers 0 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map A, otherwise setting the corresponding bit position 0 in the second level bit map A;
(3.3) if M-bit continuous unsigned numbers 1 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map B, otherwise setting the corresponding bit position 0 in the second level bit map B;
(3.4) combining the second level bit map A and the second level bit map B into a second level bit map, and generating a third level bit map according to the second level bit map until all elements of the generated N level bit map are 0.
5. A system for inter-GPU communication, comprising:
the data conversion module is used for combining each bit in the data to be transmitted into M-bit unsigned numbers, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;
a bit diagram generating module, configured to, when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the result data, generate a corresponding bit position 1 in a first-level bit diagram, and otherwise generate a corresponding bit position 0 in the first-level bit diagram; when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;
a data transmission module, configured to determine target data to be transmitted in the result data according to the generated multi-level bit map, and transmit the multi-level bit map, the number of levels of the bit map, and the target data from the first GPU to the second GPU, where the untransmitted data are consecutive 0 s or 1 s, which are represented as 1 s in the bit map;
and the data extraction module is used for setting a missing data part caused by data not transmitted as 0 or 1 by the second GPU according to the multilevel bit diagram, the level of the bit diagram and the target data, and then obtaining the data to be transmitted through an inverse process corresponding to the data conversion.
6. The system of claim 5, wherein the data conversion module comprises:
the data grouping module is used for taking every M data items in the data to be transmitted as a group of data processing units;
the unsigned number generating module is used for combining each bit in each group of data into an M-bit unsigned number;
the data processing module is used for storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;
and the result data generation module is used for sequentially extracting a group of data at the same effective bit of each M-bit unsigned number and forming the extracted groups of data into result data according to the sequence of the effective bits from low to high.
7. The system of claim 6, wherein the bitmap generation module comprises:
a first bit map generation module, configured to, when M-bit continuous unsigned numbers 0 exist in the result data, map a corresponding bit position 1 in a first level bit map a, otherwise map a corresponding bit position 0 in the first level bit map a;
and a second bit map generation module, configured to, when M-bit continuous unsigned numbers 1 exist in the result data, map a corresponding bit position 1 in the first-level bit map B, and otherwise map a corresponding bit position 0 in the first-level bit map B.
8. The system of claim 7, wherein the bitmap generation module further comprises:
a first combining module for combining the first level bit map A and the first level bit map B into a first level bit map;
a third bit map generating module, configured to apply a bit position 1 corresponding to the second level bit map a when M consecutive unsigned numbers 0 exist in the first level bit map, and apply a bit position 0 corresponding to the second level bit map a if not;
a fourth bit diagram generating module, configured to, when M consecutive unsigned numbers 1 exist in the first level bit diagram, apply a corresponding bit position 1 in a second level bit diagram B, otherwise apply a corresponding bit position 0 in the second level bit diagram B;
and the cyclic processing module is used for forming the second-level bit diagram A and the second-level bit diagram B into a second-level bit diagram, and generating a third-level bit diagram according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0.
CN201711115570.5A 2017-11-13 2017-11-13 Method and system for communication between GPUs Active CN108052482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711115570.5A CN108052482B (en) 2017-11-13 2017-11-13 Method and system for communication between GPUs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711115570.5A CN108052482B (en) 2017-11-13 2017-11-13 Method and system for communication between GPUs

Publications (2)

Publication Number Publication Date
CN108052482A CN108052482A (en) 2018-05-18
CN108052482B true CN108052482B (en) 2020-05-19

Family

ID=62120049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711115570.5A Active CN108052482B (en) 2017-11-13 2017-11-13 Method and system for communication between GPUs

Country Status (1)

Country Link
CN (1) CN108052482B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117170A (en) * 2015-08-24 2015-12-02 浪潮(北京)电子信息产业有限公司 Computer system architecture
CN105183692A (en) * 2015-09-22 2015-12-23 浪潮(北京)电子信息产业有限公司 Method and system for data communication between cluster system devices
CN105975434A (en) * 2016-04-29 2016-09-28 中国人民解放军国防科学技术大学 Heterogeneous system-oriented data transmission optimization method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539020B2 (en) * 2010-06-14 2013-09-17 Microsoft Corporation Sessions to host processes with special requirements

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117170A (en) * 2015-08-24 2015-12-02 浪潮(北京)电子信息产业有限公司 Computer system architecture
CN105183692A (en) * 2015-09-22 2015-12-23 浪潮(北京)电子信息产业有限公司 Method and system for data communication between cluster system devices
CN105975434A (en) * 2016-04-29 2016-09-28 中国人民解放军国防科学技术大学 Heterogeneous system-oriented data transmission optimization method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Optimizing Graph Processing on GPUs;Wenyong Zhong等;《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》;20160920;第28卷(第4期);全文 *

Also Published As

Publication number Publication date
CN108052482A (en) 2018-05-18

Similar Documents

Publication Publication Date Title
CN108416327B (en) Target detection method and device, computer equipment and readable storage medium
CN106852185B (en) Parallelly compressed encoder based on dictionary
US20200342632A1 (en) Efficient matrix format suitable for neural networks
CN108304918B (en) Data parallel deep learning parameter exchange method and system
CN109145158B (en) Processing method of data in bloom filter and bloom filter
CN106549673B (en) Data compression method and device
CN111708511A (en) Data compression for neural networks
WO2016029664A1 (en) Two-dimensional filter generation method, query method and device
CN110489428B (en) Multi-dimensional sparse matrix compression method, decompression method, device, equipment and medium
CN108287877B (en) FPGA (field programmable Gate array) compression/decompression system and hardware decompression method for RIB (run in Box) rendering compressed file
CN108920540B (en) Spark-based parallel raster data processing method
CN113176992B (en) A/B experiment shunting method, device and computer readable storage medium
CN114138231B (en) Method, circuit and SOC for executing matrix multiplication operation
Funasaka et al. Fast LZW compression using a GPU
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
US9015429B2 (en) Method and apparatus for an efficient hardware implementation of dictionary based lossless compression
CN111914987A (en) Data processing method and device based on neural network, equipment and readable medium
CN108052482B (en) Method and system for communication between GPUs
CN113360911A (en) Malicious code homologous analysis method and device, computer equipment and storage medium
CN103209328A (en) Multi-source satellite image real-time online processing technical method and device
CN106934757B (en) Monitoring video foreground extraction acceleration method based on CUDA
CN102867023A (en) Method for storing and reading grid data and device
CN111653318A (en) Acceleration method and device for gene comparison, storage medium and server
CN116451755A (en) Acceleration method and device of graph convolution neural network and electronic equipment
CN108062289B (en) Fast Fourier Transform (FFT) address order changing method, signal processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant