CN106776455A

CN106776455A - A kind of method and device of many GPU communications of unit

Info

Publication number: CN106776455A
Application number: CN201611149576.XA
Authority: CN
Inventors: 张清; 龚湛; 宋书涛
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-05-31
Anticipated expiration: 2036-12-13
Also published as: CN106776455B

Abstract

The invention discloses the method and device of many GPU communications of unit, the method includes：Determine the direct-connected relation datas of GPU；The GPU that predetermined communication data and needs are communicated is determined according to data broadcasting, and the GPU comprising predetermined communication data is divided into first set in the GPU that needs are communicated, other GPU are divided into second set；GPU in first set transmits predetermined communication data according to the direct-connected relation datas of GPU to the GPU for having direct-connected relation with it in second set, and by during the GPU with predetermined communication data moves on to first set in second set, there is the remaining GPU with GPU in first set in the absence of direct-connected relation in second set is empty or second set；When there is residue GPU, CPU in second set predetermined communication data is transmitted to remaining GPU；Avoiding the data transfer between all GPU will cause CPU as bottleneck by CPU.

Description

A kind of method and device of many GPU communications of unit

Technical field

The present invention relates to technical field of data processing, the method and device of more particularly to a kind of many GPU communications of unit.

Background technology

It is tall and handsome since 2006 (to contain 128 streaming multiprocessings up to (NVIDIA) company release graphic process unit G80 Device) since, graphic process unit (GPU, Graphic Processing Unit) in the application of some Large-scale parallel computings, Performance is improved up to more than 100 times for CPU.GPU possesses more transistors, for data processing rather than picture CPU goes processing data cache and instruction to control like that, it means that GPU has huge computation capability.At GPU many-cores Reason device computing resource density is higher, and with calculating performance higher, double smart performances are more than 1TFlops.

With the development of high-performance calculation application software, using the demand more and more higher to calculating performance, CPU+GPU isomeries Cooperated computing brings the advantages such as performance higher, lower cost, increasing high-performance relative to traditional CPU cluster Calculate computation schema of the application software using CPU+GPU isomery cooperated computings.

CPU+GPU isomery cooperated computing frameworks in a calculate node as shown in figure 1, employ CPU+GPU isomeries Calculation.In the very huge application scenarios of some amounts of calculation, such as training of deep learning neutral net, multiple GPU must Data transmission bauds that must be between co-ordination, therefore many GPU is very big to the performance impact of whole application.How existing hard It is a problem that efficient data transfer is completed on the basis of part framework.

The content of the invention

It is an object of the invention to provide a kind of method and device of many GPU communications of unit, using GPU Direct technologies, keep away Having exempted from the data transfer between all GPU will cause CPU as bottleneck by CPU, while being entered according to specific hardware topology The rational path planning of row, realizes the high-speed communication between many GPU.

In order to solve the above technical problems, the present invention provides a kind of unit method that many GPU communicate, methods described includes：

Whole GPU are detected, the direct-connected relation datas of GPU are determined；

The GPU that predetermined communication data and needs are communicated is determined according to data broadcasting, and the GPU for being communicated will be needed In be divided into first set comprising the GPU of the predetermined communication data, the GPU not comprising the predetermined communication data is divided into the Two set；

GPU in the first set is according to the direct-connected relation datas of the GPU to having directly with it in the second set Even the GPU of relation transmits the predetermined communication data, described predetermined by having in the second set after data transfer is completed The GPU of communication data is moved in the first set, is existed in the second set is for empty or described second set Untill the remaining GPU that GPU in the first set does not exist direct-connected relation；

When there is the remaining GPU in the second set, CPU transmits the predetermined communication number to the remaining GPU According to.

Optionally, whole GPU are detected, determines the direct-connected relation datas of GPU, including：

The combination of all 2 pieces of GPU is traveled through using double circulation, the number of direct-connected relation is whether there is between obtaining any 2 GPU According to table.

Optionally, carried out data transmission by GPU Direct technologies between two GPU with direct-connected relation.

Optionally, CPU transmits the predetermined communication data to the remaining GPU, including：

Predetermined GPU in timestep in the first set by the internal memory of the predetermined communication data transfer to CPU, And by the predetermined communication data transfer to the remaining GPU from the internal memory.

The present invention also provides a kind of device of many GPU communications of unit, including：

Direct-connected relation detection module, for being detected to whole GPU, determines the direct-connected relation datas of GPU；

Set division module, for determining predetermined communication data according to data broadcasting and needing the GPU for being communicated, and will The GPU comprising the predetermined communication data is divided into first set in the GPU that needs are communicated, not comprising the predetermined communication The GPU of data is divided into second set；

Direct-connected data transmission module, for the GPU in the first set according to the direct-connected relation datas of the GPU to described The GPU for having direct-connected relation with it in second set transmits the predetermined communication data, after data transfer is completed, described second During the GPU with the predetermined communication data will be moved into the first set in set, until the second set for empty or Untill there is the remaining GPU for not existing direct-connected relation with GPU in the first set in second set described in person；

Cpu data transport module, for when there is the remaining GPU in the second set, CPU to be to the residue GPU transmits the predetermined communication data.

Optionally, the direct-connected relation detection module is specially the combination that all 2 pieces of GPU are traveled through using double circulation, obtains To the module of the tables of data that whether there is direct-connected relation between any 2 GPU.

Optionally, direct-connected data transmission module includes：

Direct-connected data transmission unit, for the GPU in the first set according to the direct-connected relation datas of the GPU to described The GPU for having direct-connected relation with it in second set passes through predetermined communication data described in GPU Direct technical transmissions.

Optionally, cpu data transport module be specially when exist in the second set the remaining GPU and During timestep, the predetermined GPU in the first set by the internal memory of the predetermined communication data transfer to CPU, and from described By the predetermined communication data transfer to the module in the remaining GPU in internal memory.

A kind of method of many GPU communications of unit provided by the present invention, the method includes：Determine the direct-connected relation datas of GPU； The GPU that predetermined communication data and needs are communicated is determined according to data broadcasting, and will need to be included in advance in the GPU for being communicated The GPU for determining communication data is divided into first set, and other GPU are divided into second set；GPU in first set is straight according to GPU Even relation data transmits predetermined communication data to the GPU for having direct-connected relation with it in second set, and will have in second set The GPU for having predetermined communication data is moved on in first set, is existed and first set in second set is for empty or second set Remaining GPUs of the middle GPU in the absence of direct-connected relation；Communicated to remaining GPU transmission is predetermined when there is residue GPU, CPU in second set Data；

It can be seen that, the method utilizes GPU Direct technologies, it is to avoid the data transfer between all GPU will be by CPU Causing CPU turns into bottleneck, while carrying out rational path planning according to specific hardware topology, realizes that the high speed between many GPU is led to Letter；The present invention also provides a kind of device of many GPU communications of unit, with above-mentioned beneficial effect, will not be repeated here.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.

The CPU+GPU isomery cooperated computing configuration diagrams that Fig. 1 is provided by the embodiment of the present invention；

Fig. 2 by many GPU communications of unit that the embodiment of the present invention is provided method flow chart；

Fig. 3 plans schematic diagram by many GPU communication paths that the embodiment of the present invention is provided；

Fig. 4 by many GPU communications of unit that the embodiment of the present invention is provided system structured flowchart.

Specific embodiment

Core of the invention is to provide a kind of method and device of many GPU communications of unit, using GPU Direct technologies, keeps away Having exempted from the data transfer between all GPU will cause CPU as bottleneck by CPU, while being entered according to specific hardware topology The rational path planning of row, realizes the high-speed communication between many GPU.

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is A part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

The present embodiment realizes the high speed in a calculate node between multiple GPU using the direct-connected methods without CPU of GPU Communication, such that it is able to avoid the data transfer between all GPU that CPU will be caused by CPU as bottleneck；And according to difference Hardware topology select suitable communication path, reach the optimization for calculating performance.The control process of the method can be Realized in GPU many-core processors.Wherein, GPU many-core processors possess superpower computing capability, are mainly used in the meter of core missions Calculate.Specifically refer to Fig. 2, Fig. 2 by many GPU communications of unit that the embodiment of the present invention is provided method flow chart；The side Method can include：

S100, whole GPU are detected, determine the direct-connected relation datas of GPU；

Specifically, the hardware connection of detection whole GPU, so that it is determined that the direct-connected relation datas of GPU.Determine that there is direct-connected pass The GPU of system.The direct-connected relation datas of the GPU can be stored in table form, it is also possible to be stored with the relation of mapping table. The direct-connected relation datas of GPU can only comprising the corresponding GPU with direct-connected relation, it is also possible to comprising the correspondence with direct-connected relation The GPU and corresponding GPU without direct-connected relation.As long as can selected one by one according to the direct-connected relation datas of GPU The GPU that there is direct-connected relation with it is assured that after GPU.Therefore, the present embodiment is not to the direct-connected relation datas of GPU Content and form is defined.

In order to further improve the direct-connected relation detections of GPU, it is preferred that whole GPU are detected, the direct-connected passes of GPU are determined Coefficient evidence can include：

Specifically, for having the node of N block GPU cards, the combination of all 2 pieces of cards is traveled through with double circulation to take office With the presence or absence of the tables of data of direct-connected relation between 2 GPU of meaning.

S110, predetermined communication data and the GPU that is communicated of needs are determined according to data broadcasting, and will need to be communicated GPU in be divided into first set comprising the GPU of the predetermined communication data, the GPU not comprising the predetermined communication data draws It is divided into second set；

Specifically, refer to Fig. 3, GPU broadcast communication original states are represented, GPU0 is comprised only in first set, remaining is not GPU containing predetermined communication data is in second set.

GPU in S120, the first set is according to the direct-connected relation datas of the GPU to having with it in the second set The GPU for having direct-connected relation transmits the predetermined communication data, described by having in the second set after data transfer is completed The GPU of predetermined communication data is moved in the first set, until the second set is in empty or described second set Untill the remaining GPU for not existing direct-connected relation with GPU in the first set；

Specifically, any two can be by direct communication (such as GPU between having the GPU of direct-connected relation in the present embodiment Direct technologies) carry out data transmission.Here the direct communication between GPU is often to be provided with multiple in a calculate node GPU, for the huge application program of speed-up computation amount, such as training of deep learning neutral net.Can be by GPU between GPU Direct technologies carry out direct data transfer and without being used as intermediary by CPU, greatly improve data transmission bauds.

Specifically, because in general the physical topology between GPU and CPU is tree-shaped therefore traditional transmission side data Method also more uses the tree-shaped communication mode consistent with physical topology.But this kind of communication mode natively cause tree root node into It is communication performance bottleneck, communication path here is planned to a kind of ring communication mode in logic independently of on physical topology, Can solve the problem that the bottleneck problem of communication bandwidth.The present embodiment is divided and direct-connected according to the set after division and GPU by gathering Relation data determines communication path；Suitable communication path planning is selected according to different hardware topologies, so as to realize The many GPU of unit calculate the optimization of performance.Can be detected by the direct-connected relations of GPU, carry out active path planning, realized logical at a high speed Letter.

The process of above-mentioned steps is exemplified below：

Assuming that there is N blocks GPU card needs to be communicated, it is 0,1 to be numbered ..., N-1.Data broadcasting is needed data D (predetermined communication data) is transferred to remaining all GPU from GPU0.

Direct-connected relation table between any 2 GPU for building first, builds two set, and the GPU in a set has contained There is data D, be designated as first set；What another was gathered does not contain data D, is designated as second set.In each timestep, the GPU in one set transmits data D by direct-connected relation table to the GPU in second set, and after completing transmission, a part of GPU is (i.e. GPU with data D) enter in first set from second set, until second set is empty set or second set and the In the absence of untill direct-connected relation between one set.

S130, when there is the remaining GPU in the second set, CPU is described predetermined logical to the remaining GPU transmission Letter data.

Wherein, if second set is not sky, first set must transmit predetermined communication number by CPU to second set According to.

Specifically, there is the remaining GPU in working as the second set, and in timestep in the first set Predetermined GPU by the internal memory of the predetermined communication data transfer to CPU, and by the predetermined communication data transfer from the internal memory To in the remaining GPU.Here predetermined GPU can be any GPU for choosing in first set.Therefore, the present embodiment is not Specific predetermined GPU is defined.And CPU here to the remaining GPU in second set when transmitting predetermined communication data, can be with It is that CPU transmits predetermined communication data to a residue GPU in second set every time, and after data transfer is completed, this is had The remaining GPU for having predetermined communication data moves to first set, untill second set is for sky.I.e. the present embodiment is at each Timestep, arbitrarily chooses a GPU from first set, the internal memory of data transfer to host CPU, from host CPU It is transferred in internal memory in second set in any one GPU, and updates first set and second set.Process before repeating, directly To second set is for sky.

Said process can be illustrated according to Fig. 3, if it is GPU1 and GPU2 to have the GPU of direct-connected relation with GPU0, then existed First timestep, GPU0 by predetermined communication data transfer to GPU1 and GPU2, after completing data transfer, by GPU1 and GPU2 First set is moved to, at this moment there is GPU1 the GPU of direct-connected relation to be for GPU4 and GPU5, GPU2 have the GPU of direct-connected relation GPU6, then, in second timestep, GPU1 is by predetermined communication data transfer to GPU4 and GPU5, GPU2 by predetermined communication number According to being transferred to GPU6.Now, if GPU4, GPU5 and GPU6 and GPU7 do not have direct-connected relation, then GPU7 is residue GPU, this When, GPU0 is chosen by the internal memory of predetermined communication data transfer to CPU, and by the predetermined communication data transfer from the internal memory To in the remaining GPU7, the transmission that predetermined communication data is realized in many GPU communications is finally completed.

Based on above-mentioned technical proposal, the method for many GPU communications of unit that the embodiment of the present invention is carried can be by checking GPU Between direct-connected relation determine optimum data transmission path, it is to avoid the data transfer between all GPU will be caused by CPU CPU turns into bottleneck, GPU bandwidth resources is fully used, and realizes the high speed data delivery between many GPU.

The device of GPU communications many to unit provided in an embodiment of the present invention is introduced below, and unit described below is more The method that the device GPUs many with above-described unit of GPU communications communicate can be mutually to should refer to.

Refer to Fig. 4, Fig. 4 by many GPU communications of unit that the embodiment of the present invention is provided system structured flowchart；The dress Putting to include：

Direct-connected relation detection module 100, for being detected to whole GPU, determines the direct-connected relation datas of GPU；

Set division module 200, for determining predetermined communication data according to data broadcasting and needing the GPU for being communicated, And the GPU comprising the predetermined communication data is divided into first set in needing the GPU for being communicated, not comprising described predetermined The GPU of communication data is divided into second set；

Direct-connected data transmission module 300, for the GPU in the first set according to the direct-connected relation datas of the GPU to The GPU for having direct-connected relation with it in the second set transmits the predetermined communication data, described after data transfer is completed During the GPU with the predetermined communication data will be moved into the first set in second set, until the second set is Exist in the empty or second set with GPU in the first set in the absence of direct-connected relation remaining GPU untill；

Cpu data transport module 400, for when there is the remaining GPU in the second set, CPU to be to described surplus Remaining GPU transmits the predetermined communication data.

Based on above-described embodiment, the direct-connected relation detection module 100 is specially all 2 pieces using double circulation traversal The combination of GPU, whether there is the module of the tables of data of direct-connected relation between obtaining any 2 GPU.

Based on above-described embodiment, direct-connected data transmission module 300 can include：

Based on above-described embodiment, there is the residue in cpu data transport module 400 in specially working as the second set GPU and in timestep, the predetermined GPU in the first set by the internal memory of the predetermined communication data transfer to CPU, and By the predetermined communication data transfer to the module in the remaining GPU from the internal memory.

Each embodiment is described by the way of progressive in specification, and what each embodiment was stressed is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment Speech, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is referring to method part illustration .

Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example according to function in the above description.These Function is performed with hardware or software mode actually, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can realize described function to each specific application using distinct methods, but this realization should not Think beyond the scope of this invention.

The step of method or algorithm for being described with reference to the embodiments described herein, directly can be held with hardware, processor Capable software module, or the two combination is implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In field in known any other form of storage medium.

The method and device of GPU communications many to unit provided by the present invention is described in detail above.Herein should Principle of the invention and implementation method are set forth with specific case, the explanation of above example is only intended to help and manages The solution method of the present invention and its core concept.It should be pointed out that for those skilled in the art, not departing from On the premise of the principle of the invention, some improvement and modification can also be carried out to the present invention, these are improved and modification also falls into this hair In bright scope of the claims.

Claims

1. the method that a kind of many GPU of unit communicate, it is characterised in that methods described includes：

The GPU that predetermined communication data and needs are communicated is determined according to data broadcasting, and will need to be wrapped in the GPU for being communicated GPU containing the predetermined communication data is divided into first set, and the GPU not comprising the predetermined communication data is divided into the second collection Close；

GPU in the first set has direct-connected pass in the second set according to the direct-connected relation datas of the GPU with it The GPU of system transmits the predetermined communication data, after data transfer is completed, will have the predetermined communication in the second set The GPU of data is moved in the first set, is existed and institute in the second set is for empty or described second set State GPU in first set in the absence of direct-connected relation remaining GPU untill；

When there is the remaining GPU in the second set, CPU transmits the predetermined communication data to the remaining GPU.

2. method according to claim 1, it is characterised in that whole GPU are detected, the direct-connected relation numbers of GPU are determined According to, including：

The combination of all 2 pieces of GPU is traveled through using double circulation, the data of direct-connected relation are whether there is between obtaining any 2 GPU Table.

3. method according to claim 2, it is characterised in that pass through GPU between two GPU with direct-connected relation Direct technologies carry out data transmission.

4. method according to claim 3, it is characterised in that CPU transmits the predetermined communication number to the remaining GPU According to, including：

Predetermined GPU in timestep in the first set by the internal memory of the predetermined communication data transfer to CPU, and from By in the predetermined communication data transfer to the remaining GPU in the internal memory.

5. the device that a kind of many GPU of unit communicate, it is characterised in that including：

Set division module, for determining predetermined communication data according to data broadcasting and needing the GPU for being communicated, and will need GPU comprising the predetermined communication data in the GPU for being communicated is divided into first set, not comprising the predetermined communication data GPU be divided into second set；

Direct-connected data transmission module, for the GPU in the first set according to the direct-connected relation datas of the GPU to described second The GPU for having direct-connected relation with it in set transmits the predetermined communication data, after data transfer is completed, the second set In there is the predetermined communication data GPU will be moved into the first set, until the second set is sky or institute State exist in second set with GPU in the first set in the absence of direct-connected relation remaining GPU untill；

Cpu data transport module, for when there is the remaining GPU in the second set, CPU to be passed to the remaining GPU The defeated predetermined communication data.

6. device according to claim 5, it is characterised in that the direct-connected relation detection module is specially and is followed using dual Ring travels through the combination of all 2 pieces of GPU, and the module of the tables of data of direct-connected relation is whether there is between obtaining any 2 GPU.

7. device according to claim 6, it is characterised in that direct-connected data transmission module includes：

Direct-connected data transmission unit, for the GPU in the first set according to the direct-connected relation datas of the GPU to described second The GPU for having direct-connected relation with it in set passes through predetermined communication data described in GPU Direct technical transmissions.

8. device according to claim 7, it is characterised in that cpu data transport module is specially when the second set In there is the remaining GPU and in timestep, the predetermined GPU in the first set is by the predetermined communication data transfer To the internal memory of CPU, and by the predetermined communication data transfer to the module in the remaining GPU from the internal memory.