CN110415160B - GPU (graphics processing Unit) topology partitioning method and device - Google Patents

GPU (graphics processing Unit) topology partitioning method and device Download PDF

Info

Publication number
CN110415160B
CN110415160B CN201910580776.8A CN201910580776A CN110415160B CN 110415160 B CN110415160 B CN 110415160B CN 201910580776 A CN201910580776 A CN 201910580776A CN 110415160 B CN110415160 B CN 110415160B
Authority
CN
China
Prior art keywords
gpus
gpu
partition
point
topology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910580776.8A
Other languages
Chinese (zh)
Other versions
CN110415160A (en
Inventor
王德奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201910580776.8A priority Critical patent/CN110415160B/en
Publication of CN110415160A publication Critical patent/CN110415160A/en
Application granted granted Critical
Publication of CN110415160B publication Critical patent/CN110415160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

The invention discloses a GPU topology partitioning method and a device, comprising the following steps: determining interconnection bandwidth among the GPUs according to the physical topology information of the GPUs, and generating a GPU topological graph comprising the GPUs; randomly dividing a plurality of GPUs in a GPU topological graph into two partitions; calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph; the above steps are repeated until all GPUs in the GPU topology are removed, and the partitioning scheme with the smallest number of connections across partitions is selected as the partitioning result. The method can optimize the topological partitions of the GPU from the bottom layer in a targeted manner according to different connection relations among the GPUs, reduce the transmission time consumption among the GPUs, and improve the calculation speed of artificial intelligence.

Description

GPU (graphics processing Unit) topology partitioning method and device
Technical Field
The present invention relates to the field of computers, and more particularly, to a method and an apparatus for GPU topology partitioning.
Background
In the fields of high-performance computing and artificial intelligence, GPUs are often used for computational acceleration. The GPU is used in a large scale due to its powerful computing power and low power consumption characteristics, and especially in the field of artificial intelligence of recent heat, most of model training is based on GPU operation, which can save a large amount of computing time, thereby accelerating model iteration. Because the cost of the GPU is high, more and more artificial intelligence developers want to fully improve the resource utilization rate of the GPU and exert the maximum value of the GPU under the limited GPU resources. However, most artificial intelligence developers lack underlying knowledge about the GPUs, the communication between the GPUs in the prior art is inefficient due to lack of underlying optimization, and the GPU partitioning lacks orderliness, which results in slowing down the computation speed of artificial intelligence.
Aiming at the problem that the GPU partition in the prior art is lack of orderliness, so that the slow artificial intelligence computing speed is caused, no effective solution is provided at present.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for GPU topology partitioning, which can optimize the topology partitioning of the GPU from the bottom layer in a targeted manner according to different connection relationships between GPUs, reduce the time consumption for transmission between GPUs, and improve the computation speed of artificial intelligence.
Based on the above object, a first aspect of the embodiments of the present invention provides a GPU topology partitioning method, including the following steps:
determining interconnection bandwidth among the GPUs according to the physical topology information of the GPUs, and generating a GPU topological graph comprising the GPUs;
randomly dividing a plurality of GPUs in a GPU topological graph into two partitions;
calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph;
the above steps are repeated until all GPUs in the GPU topology are removed, and the partitioning scheme with the smallest number of connections across partitions is selected as the partitioning result.
In some embodiments, the physical topology information includes connection relationships between the multiple GPUs; the connection relationship among the GPUs comprises that the GPUs are connected through at least one of the following simultaneously: NVlink, PCIe bus, PCIe switch, PCIe host bridge, QPI.
In some embodiments, determining the interconnection bandwidth between the plurality of GPUs from the physical topology information of the plurality of GPUs comprises: and determining the rate of mutual information transmission among the GPUs according to the connection relation among the GPUs.
In some embodiments, further generating a GPU topology comprising a plurality of GPUs comprises:
taking a plurality of GPUs as a plurality of points;
taking the connection relation among the GPUs as a plurality of edges;
taking interconnection bandwidth among the GPUs as the weight of the edges;
the GPU topology is constructed according to the points, the edges and the weights of the edges.
In some embodiments, generating a GPU topology map comprising a plurality of GPUs further comprises: the computational power of the multiple GPUs is used as the weight of the multiple points, and the GPU topological graph is constructed according to the multiple points, the multiple edges, the weight of the multiple points and the weight of the multiple edges.
In some embodiments, calculating the migration gain for all GPUs in the GPU topology comprises:
for each point, determining a migration tendency FS of the point according to respective weights of an edge connected to the point across the partition and an edge connected to the point;
for each point, determining the retention tendency TE of the point according to the respective weights of the edge connected to the point and the edge connected to the point in the same partition;
migration gain was obtained using the migration tendency FS of each point minus the retention tendency TE.
In some embodiments, further comprising: randomly determining one partition in response to the number of CPUs in the two partitions being the same or non-randomly determining one partition as a partition including more GPUs according to a predetermined rule; randomly determining one GPU in response to the simultaneous existence of two or more GPUs with the highest migration gains in parallel or non-randomly determining one GPU as the GPU with the highest migration gain according to a preset rule; the one or more partitioning schemes are randomly determined in response to the two or more partitioning schemes having the smallest number of cross-partition connections existing at the same time or non-randomly determined according to a predetermined rule as the partitioning scheme having the smallest number of cross-partition connections.
In some embodiments, the method further comprises:
after obtaining the partitioning results, a partitioned GPU topology map is generated for one or more partitions in the partitioning results to perform topology partitioning again.
A second aspect of the present invention provides a GPU topology partitioning apparatus, including:
the modeling module is used for determining the interconnection bandwidth among the GPUs according to the physical topology information of the GPUs and generating a GPU topological graph comprising the GPUs;
the initialization module is used for randomly dividing a plurality of GPUs in the GPU topological graph into two partitions;
an iteration module, configured to calculate migration gains of all GPUs in the GPU topology, migrate a GPU with the highest migration gain in partitions including more GPUs to partitions including fewer GPUs, calculate the number of cross-partition connections of a current partition scheme, and remove a migrated GPU from the GPU topology;
and the sorting module is used for repeating the previous step until all GPUs in the GPU topological graph are removed, and selecting the partition scheme with the minimum number of cross-partition connection as a partition result.
A third aspect of an embodiment of the present invention provides an artificial intelligence computing device, including:
a plurality of GPUs;
a processor; and
a memory storing processor-executable program code that, when executed, performs the above-described GPU topology partitioning method to partition the plurality of GPUs and arrange artificial intelligence computation tasks in units of each of the partitions.
The invention has the following beneficial technical effects: according to the GPU topology partitioning method and device, the interconnection bandwidth among the GPUs is determined according to the physical topology information of the GPUs, and a GPU topology graph comprising the GPUs is generated; randomly dividing a plurality of GPUs in a GPU topological graph into two partitions; calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph; the previous step is repeated until all the GPUs in the GPU topological graph are removed, and the partitioning scheme with the minimum number of cross-partition connection is selected as the partitioning result, so that the topological partitions of the GPUs can be optimized from the bottom layer in a targeted mode according to different connection relations among the GPUs, the transmission time consumption among the GPUs is reduced, and the artificial intelligence computing speed is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of a GPU topology partitioning method according to the present invention;
fig. 2 is a schematic diagram of a GPU topology connection relationship in an embodiment of the GPU topology partitioning method provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above, a first aspect of the embodiments of the present invention proposes an embodiment of a method for optimizing topology partitions of a GPU from the bottom layer in a targeted manner for different connection relationships between GPUs. Fig. 1 is a schematic flowchart illustrating a GPU topology partitioning method provided by the present invention.
The GPU topology partitioning method, as shown in fig. 1, includes the following steps:
step S101: determining interconnection bandwidth among the GPUs according to the physical topology information of the GPUs, and generating a GPU topological graph comprising the GPUs;
step S103: randomly dividing a plurality of GPUs in the GPU topological graph into two partitions;
step S105: calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph;
step S107: the above steps are repeated until all GPUs in the GPU topology are removed, and the partitioning scheme with the smallest number of connections across partitions is selected as the partitioning result.
The invention provides a GPU topological partitioning method, which achieves the effects of reducing the minimum communication bandwidth among different partitions and the maximum communication bandwidth in the partitions by performing communication modeling on a plurality of GPU cards belonging to different servers and using a partitioning algorithm to realize GPU topological partitioning. The application meaning is that an artificial intelligence computing load can be arranged in the same partition, and therefore time cost consumed by transmission of the artificial intelligence computing load between the GPUs is the minimum.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, the physical topology information includes connection relationships between the multiple GPUs; the connection relationship among the GPUs comprises that the GPUs are connected through at least one of the following simultaneously: NVlink, PCIe bus, PCIe switch, PCIe host bridge, QPI.
Obviously, the bandwidths of the GPUs are different in different connection modes. Generally, the bandwidth of the connection over NVlink is highest; different NVlink connection bandwidths are different; the connection bandwidth through the PCIe bus and the PCIe host bridge within the NUMA node is relatively low. QPI here refers to a connection scheme that is interconnected across NUMA nodes over a PCIe bus and using SMP, with lower connection bandwidth. The physical topology information may be provided by the driver.
In some embodiments, determining, according to the physical topology information of the GPUs, the interconnection bandwidth between the GPUs is: and determining the rate of mutual information transmission among the GPUs according to the connection relation among the GPUs.
In some embodiments, generating a GPU topology map comprising a plurality of GPUs comprises:
taking a plurality of GPUs as a plurality of points;
taking the connection relation among the GPUs as a plurality of edges;
taking interconnection bandwidth among the GPUs as the weight of the edges;
the GPU topology is constructed according to the points, the edges and the weights of the edges.
In some embodiments, further generating a GPU topology map comprising a plurality of GPUs further comprises: the computational power of the multiple GPUs is used as the weight of the multiple points, and the GPU topological graph is constructed according to the multiple points, the multiple edges, the weight of the multiple points and the weight of the multiple edges.
In some embodiments, calculating the migration gain comprises:
for each point, determining a migration tendency FS of the point according to respective weights of an edge connected to the point across the partition and an edge connected to the point;
for each point, determining the retention tendency TE of the point according to the respective weights of the edge connected to the point and the edge connected to the point in the same partition;
migration gain was obtained using the migration tendency FS of each point minus the retention tendency TE.
The migration tendency FS represents the inner edge of the partition that the point can obtain from migration; retention tendency TE represents the inner edge of the partition where the point can be lost from migration. The weights of the edges within these partitions are also taken into account to improve the accuracy of the model.
In some embodiments, one partition is randomly determined in response to the number of CPUs being the same in both partitions or non-randomly determined according to a predetermined rule as a partition including more GPUs; in response to the simultaneous existence of two or more GPUs with the highest migration gains in parallel, randomly determining one GPU or non-randomly determining one GPU as the GPU with the highest migration gain according to a preset rule; the one or more partitioning schemes are randomly determined in response to the two or more partitioning schemes having the smallest number of cross-partition connections existing at the same time or non-randomly determined according to a predetermined rule as the partitioning scheme having the smallest number of cross-partition connections.
In some embodiments, the method further comprises:
after obtaining the partitioning results, a partitioned GPU topology map is generated for one or more partitions in the partitioning results to perform topology partitioning again.
The re-partitioning is suitable for the situation that the total amount of the GPU is large and the concurrent computing task is large.
The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.
The detailed embodiments of the present invention are further illustrated below with reference to specific examples.
Firstly, installing a driver on a physical server configured with a plurality of GPUs to acquire physical connection relations among the GPUs and between the GPUs and the CPUs:
Figure BDA0002113021360000071
wherein, X represents the GPU or the CPU device itself; SYS is formed by connecting two GPUs/GPUs and a CPU through PCIe buses and SMP interconnection (QPI/UPI) among NUMA nodes; PHB is formed by connecting two GPUs/GPUs and a CPU through a PCIe bus and a PCIe host bridge; the PXB is characterized in that two GPUs/GPUs and a CPU are connected through a plurality of PCIe switches; PIX is formed by connecting two GPUs/GPUs and a CPU through a PCIe switch; NODE is that two GPUs/GPUs and a CPU are connected through a PCIe bus and a PCIe host bridge in a NUMA NODE; NV # is two GPUs/GPUs connected to the CPU via an NvLink, # denotes a number.
The topological connection relation graph shown in fig. 2 can be determined according to the obtained physical connections, wherein the connections between the GPUs only reserve the NvLink connections with higher bandwidth. Since the control data with limited size is mainly transmitted between the GPU and the CPU, the embodiment of the present invention only focuses on the connection between the two GPUs, and does not focus on the connection between the GPU and the CPU. The embodiment of the invention further uses p2p Bandwidth Latency Test to measure the Bandwidth, which is shown in the following table:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 743 96 96 48 48 19 18 19
GPU1 96 744 48 48 19 96 18 19
GPU2 96 48 746 96 18 18 48 18
GPU3 48 48 96 747 18 19 18 96
GPU4 48 19 18 19 754 96 96 48
GPU5 18 96 18 18 96 744 48 48
GPU6 18 18 48 18 96 48 749 96
GPU7 19 19 18 96 48 48 96 745
in the embodiment of the present invention, it is default that the computing power of each GPU card is the same, i.e. the point weights are all 1.
The partitioning scheme is determined according to the GPU topological graph by using the Fiducia-Mattheys-Sanchs algorithm. The GPU topology may be expressed in a data-only manner as follows:
8
28
96 2 0 1
96 2 0 2
48 2 0 3
48 2 0 4
19 2 0 5
18 2 0 6
19 2 0 7
48 2 1 2
48 2 1 3
19 2 1 4
96 2 1 5
18 2 1 6
19 2 1 7
96 2 2 3
18 2 2 4
18 2 2 5
48 2 2 6
18 2 2 7
18 2 3 4
19 2 3 5
18 2 3 6
48 2 3 7
96 2 4 5
48 2 4 6
48 2 4 7
48 2 5 6
48 2 5 7
96 2 6 7
1
1
1
1
1
1
1
1
the number of GPUs in the first row represents that 8 GPUs are used in the embodiment of the invention; the second row represents the number of GPU-to-GPU connections, with 8 GPU interconnects having 28 connections; lines 3 to 30 are description information of each connection, the first column is a weight (i.e., a bandwidth between GPUs), the second column is a number of endpoints (the number of points of each edge, which is uniformly 2 in the GPU interconnection of the embodiment of the present invention), the third column is a number of a connected start GPU, and the fourth column is a number of a connected end GPU; lines 31 through 36 are the weights for each GPU, all 1 in the present embodiment.
The Fiducia-Mattheys-Sanchs algorithm is an improvement over the Fiducia-Mattheys algorithm. The traditional Fiducia-Mattheys algorithm only supports two partitions, while the Fiducia-Mattheys-Sanchs algorithm used by the embodiment of the invention supports multiple partitions, so that the embodiment of the invention has better expansibility.
The calculation mode of the Fiducia-Mattheys-Sanchs algorithm is as follows:
firstly, a plurality of GPUs in a GPU topological graph are randomly divided into two partitions, namely 4 in each partition.
The migration gains for all GPUs in the GPU topology are then calculated. The migration gain is the migration tendency FS minus the retention tendency TE; for each point, the migration tendency FS is according to the respective weights of the edges connected to the point and the edges connected to the point across the partition, and the retention tendency TE is according to the respective weights of the edges connected to the point and the edges connected to the point within the same partition.
After calculating the migration gains of all the points, the point with the highest migration gain in the partition including more points is migrated into the partition including fewer points, the number of cross-partition connections of the current partition scheme is calculated, and the migrated point is removed from the GPU topological graph.
And recalculating the migration gain of the rest points according to the situation after the point migration, and repeating the steps to obtain a partition scheme until all the points are migrated. At this time, the smallest number of connections across partitions of the partitioning scheme is counted as the partitioning result.
The pseudo-code implementation of the above algorithm is as follows:
Figure BDA0002113021360000111
as can be seen from the foregoing embodiments, in the GPU topology partitioning method provided by the embodiments of the present invention, interconnection bandwidths among a plurality of GPUs are determined according to physical topology information of the plurality of GPUs, and a GPU topology map including the plurality of GPUs is further generated; randomly dividing a plurality of GPUs in a GPU topological graph into two partitions; calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph; the previous step is repeated until all the GPUs in the GPU topological graph are removed, and the partitioning scheme with the minimum number of cross-partition connection is selected as the partitioning result, so that the topological partitions of the GPUs can be optimized from the bottom layer in a targeted mode according to different connection relations among the GPUs, the transmission time consumption among the GPUs is reduced, and the artificial intelligence computing speed is improved.
It should be noted that, the steps in the embodiments of the GPU topology partitioning method may be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.
In view of the above, a second aspect of the embodiments of the present invention proposes an embodiment of an apparatus capable of optimizing topology partitions of GPUs from the bottom layer in a targeted manner for different connection relationships between the GPUs. The GPU topology partitioning device comprises:
the modeling module is used for determining the interconnection bandwidth among the GPUs according to the physical topology information of the GPUs and generating a GPU topological graph comprising the GPUs;
the initialization module is used for randomly dividing a plurality of GPUs in the GPU topological graph into two partitions;
an iteration module, configured to calculate migration gains of all GPUs in the GPU topology, migrate a GPU with the highest migration gain in partitions including more GPUs to partitions including fewer GPUs, calculate the number of cross-partition connections of a current partition scheme, and remove a migrated GPU from the GPU topology;
and the sorting module is used for repeating the previous step until all GPUs in the GPU topological graph are removed, and selecting the partition scheme with the minimum number of cross-partition connection as a partition result.
The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
In view of the above, a third aspect of the embodiments of the present invention proposes an embodiment of an artificial intelligence computing device capable of optimizing topology partitions of GPUs from the bottom layer in a targeted manner for different connection relationships between the GPUs. The artificial intelligence computing device includes:
a plurality of GPUs;
a processor; and
a memory storing processor-executable program code that, when executed, performs the above-described GPU topology partitioning method to partition the plurality of GPUs and arrange artificial intelligence computation tasks in units of each of the partitions.
It can be seen from the foregoing embodiments that, in the GPU topology partitioning apparatus and the artificial intelligence computing device provided by the embodiments of the present invention, the interconnection bandwidth between the multiple GPUs is determined according to the physical topology information of the multiple GPUs, and the GPU topology map including the multiple GPUs is further generated; randomly dividing a plurality of GPUs in a GPU topological graph into two partitions; calculating migration gains of all GPUs in the GPU topological graph, migrating the GPU with the highest migration gain in the partitions including more GPUs into the partitions including fewer GPUs, calculating the number of cross-partition connections of the current partition scheme, and removing the migrated GPU from the GPU topological graph; the previous step is repeated until all the GPUs in the GPU topological graph are removed, and the partitioning scheme with the minimum number of cross-partition connection is selected as the partitioning result, so that the topological partitions of the GPUs can be optimized from the bottom layer in a targeted mode according to different connection relations among the GPUs, the transmission time consumption among the GPUs is reduced, and the artificial intelligence computing speed is improved.
It should be particularly noted that the above embodiments of the GPU topological partitioning apparatus and the artificial intelligence computing device use the embodiment of the GPU topological partitioning method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the GPU topological partitioning method. Of course, since the steps in the GPU topology partitioning method embodiment can be mutually intersected, replaced, added, and deleted, these reasonable permutation, combination and transformation shall also belong to the scope of the present invention for the GPU topology partitioning apparatus and the artificial intelligence computing device, and shall not limit the scope of the present invention to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (8)

1. A GPU topological partitioning method is characterized by comprising the following steps:
determining interconnection bandwidth among the GPUs according to the physical topology information of the GPUs, and generating a GPU topology graph comprising the GPUs, wherein the step of generating the GPU topology graph comprising the GPUs comprises the following steps: taking the GPUs as a plurality of points, taking the connection relation among the GPUs as a plurality of edges, taking the interconnection bandwidth among the GPUs as the weight of the edges, and constructing the GPU topological graph according to the weight of the points, the edges and the edges;
randomly dividing a plurality of GPUs in the GPU topological graph into two partitions;
calculating migration gains of all GPUs in the GPU topology graph, migrating a GPU with the highest migration gain in partitions comprising more GPUs into partitions comprising fewer GPUs, calculating the number of cross-partition connections of a current partition scheme, and removing the migrated GPU from the GPU topology graph, wherein calculating the migration gains of all GPUs in the GPU topology graph comprises: for each point, determining a migration tendency FS of the point according to respective weights of an edge connected to the point and an edge connected to the point across the partition, for each point, determining a retention tendency TE of the point according to respective weights of an edge connected to the point and an edge connected to the point within the same partition, and obtaining the migration gain by subtracting the retention tendency TE from the migration tendency FS of each point;
repeating the previous step until all GPUs in the GPU topological graph are removed, and selecting the partitioning scheme with the minimum number of cross-partition connections as a partitioning result.
2. The method of claim 1, wherein the physical topology information includes connection relationships between multiple GPUs; the connection relationship between the multiple GPUs includes simultaneous connection by at least one of: NVlink, PCIe bus, PCIe switch, PCIe host bridge, QPI.
3. The method of claim 2, wherein determining the interconnection bandwidth between a plurality of GPUs from the physical topology information of the plurality of GPUs comprises: and determining the rate of mutual information transmission among the GPUs according to the connection relation among the GPUs.
4. The method of claim 1, wherein generating the GPU topology graph comprising a plurality of GPUs further comprises: and taking the computing power of the plurality of GPUs as the weight of the plurality of points, and constructing the GPU topological graph according to the plurality of points, the plurality of edges, the weight of the plurality of points and the weight of the plurality of edges.
5. The method of claim 1, further comprising:
randomly determining one partition in response to the number of CPUs in two partitions being the same or non-randomly determining one partition as the partition including more GPUs according to a predetermined rule;
in response to the simultaneous existence of two or more GPUs with the highest parallel migration gain, randomly determining one GPU or non-randomly determining one GPU as the GPU with the highest migration gain according to a preset rule;
randomly determining one or more of the partition schemes in response to the existence of the smallest number of two or more of the cross-partition connections at the same time or non-randomly determining one or more of the partition schemes according to a predetermined rule as the smallest number of the cross-partition connections.
6. The method of claim 1, further comprising:
after obtaining the partitioning results, a partitioned GPU topology map is generated for one or more partitions in the partitioning results to perform topology partitioning again.
7. A GPU topology partitioning apparatus, comprising:
the modeling module is used for determining interconnection bandwidth among the GPUs according to the physical topology information of the GPUs and generating a GPU topology graph comprising the GPUs, wherein the generation of the GPU topology graph comprising the GPUs comprises the following steps: taking the GPUs as a plurality of points, taking the connection relation among the GPUs as a plurality of edges, taking the interconnection bandwidth among the GPUs as the weight of the edges, and constructing the GPU topological graph according to the weight of the points, the edges and the edges;
the initialization module is used for randomly dividing a plurality of GPUs in the GPU topological graph into two partitions;
an iteration module to calculate migration gains for all GPUs in the GPU topology graph, migrate a GPU with a highest migration gain in partitions including more GPUs into partitions including fewer GPUs, calculate a number of cross-partition connections for a current partition scheme, and remove a migrated GPU from the GPU topology graph, wherein calculating the migration gains for all GPUs in the GPU topology graph comprises: for each point, determining a migration tendency FS of the point according to respective weights of an edge connected to the point and an edge connected to the point across the partition, for each point, determining a retention tendency TE of the point according to respective weights of an edge connected to the point and an edge connected to the point within the same partition, and obtaining the migration gain by subtracting the retention tendency TE from the migration tendency FS of each point;
and the sorting module is used for repeating the previous step until all GPUs in the GPU topological graph are removed, and selecting the partition scheme with the minimum number of cross-partition connection as a partition result.
8. An artificial intelligence computing device, comprising:
a plurality of GPUs;
a processor; and
a memory storing processor-executable program code which, when executed, performs the GPU topology partitioning method of any of claims 1-6 to partition a plurality of GPUs and arrange artificial intelligence computing tasks in units of each of the partitions.
CN201910580776.8A 2019-06-29 2019-06-29 GPU (graphics processing Unit) topology partitioning method and device Active CN110415160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910580776.8A CN110415160B (en) 2019-06-29 2019-06-29 GPU (graphics processing Unit) topology partitioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910580776.8A CN110415160B (en) 2019-06-29 2019-06-29 GPU (graphics processing Unit) topology partitioning method and device

Publications (2)

Publication Number Publication Date
CN110415160A CN110415160A (en) 2019-11-05
CN110415160B true CN110415160B (en) 2022-06-07

Family

ID=68358547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910580776.8A Active CN110415160B (en) 2019-06-29 2019-06-29 GPU (graphics processing Unit) topology partitioning method and device

Country Status (1)

Country Link
CN (1) CN110415160B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597139B (en) * 2020-05-13 2023-01-06 苏州浪潮智能科技有限公司 Communication method, system, equipment and medium of GPU
CN111880911A (en) * 2020-06-19 2020-11-03 浪潮电子信息产业股份有限公司 Task load scheduling method, device and equipment and readable storage medium
CN111930498B (en) * 2020-06-29 2022-11-29 苏州浪潮智能科技有限公司 Efficient GPU resource allocation optimization method and system
CN114356818A (en) * 2022-03-17 2022-04-15 苏州浪潮智能科技有限公司 Multi-channel data transmission method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932175A (en) * 2012-10-29 2013-02-13 华为技术有限公司 Node partition dividing method, device and server
CN103917958A (en) * 2011-11-11 2014-07-09 阿尔卡特朗讯 Distributed mapping function for large scale media clouds
CN108139887A (en) * 2015-10-22 2018-06-08 国际商业机器公司 Across hardware accelerator parallelization matrix decomposition
CN109844722A (en) * 2016-08-12 2019-06-04 利奇得公司 Breakdown fabric switch computing platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103917958A (en) * 2011-11-11 2014-07-09 阿尔卡特朗讯 Distributed mapping function for large scale media clouds
CN102932175A (en) * 2012-10-29 2013-02-13 华为技术有限公司 Node partition dividing method, device and server
CN108139887A (en) * 2015-10-22 2018-06-08 国际商业机器公司 Across hardware accelerator parallelization matrix decomposition
CN109844722A (en) * 2016-08-12 2019-06-04 利奇得公司 Breakdown fabric switch computing platform

Also Published As

Publication number Publication date
CN110415160A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110415160B (en) GPU (graphics processing Unit) topology partitioning method and device
CN110618870B (en) Working method and device for deep learning training task
CN107437110B (en) Block convolution optimization method and device of convolutional neural network
CN110362388B (en) Resource scheduling method and device
US20150215379A1 (en) Distributed processing device and distributed processing system as well as distributed processing method
CN110795226B (en) Method for processing task using computer system, electronic device and storage medium
CN114281521A (en) Method, system, device and medium for optimizing communication efficiency of deep learning heterogeneous resources
CN107346350B (en) Distribution method, device and cluster system for integrated circuit layout data processing tasks
JP2021022373A (en) Method, apparatus and device for balancing loads, computer-readable storage medium, and computer program
KR102326586B1 (en) Method and apparatus for processing large-scale distributed matrix product
CN114338506B (en) Neural task on-chip routing method and device of brain-like computer operating system
CN115237580A (en) Intelligent calculation-oriented flow parallel training self-adaptive adjustment system and method
CN109412865B (en) Virtual network resource allocation method, system and electronic equipment
CN110750363B (en) Computer storage management method and device, electronic equipment and storage medium
CN116303219A (en) Grid file acquisition method and device and electronic equipment
CN115879543A (en) Model training method, device, equipment, medium and system
CN113988277A (en) Neural network mapping method, device and equipment for storage and computation integrated chip
CN109408242B (en) Server resource online and offline method and device
CN114615146A (en) Software Defined Network (SDN) controller deployment method, device, equipment and storage medium
CN115965070B (en) Computational graph processing method, apparatus, device, storage medium, and program product
CN116805155B (en) LSTM network processing method, device, equipment and readable storage medium
CN109800076B (en) Storage scheduling method and device
EP4009241A1 (en) Arithmetic processing apparatus, arithmetic processing method, and arithmetic processing program
TWI843934B (en) A method and system for processing unstructured source data
CN114726851B (en) Block operation method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant