CN112835772A

CN112835772A - Coarse-grained calculation acceleration ratio evaluation method and system under heterogeneous hardware environment

Info

Publication number: CN112835772A
Application number: CN201911164810.XA
Authority: CN
Inventors: 汤文莉
Original assignee: Nanjing Institute of Industry Technology
Current assignee: Nanjing Institute of Industry Technology
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2021-05-25

Abstract

A coarse-grained calculation acceleration ratio evaluation method and a coarse-grained calculation acceleration ratio evaluation system in a heterogeneous hardware environment can evaluate whether an acceleration result is obtained or not before actual calculation is executed, the expense of calculating first and then measuring is avoided, the acceleration ratio can be automatically evaluated in real time according to calculation context, compared with an experience and measurement mode, the method is more accurate and more efficient, dynamic scheduling of calculation is achieved, a module with the acceleration ratio is operated on a GPU, a module without the acceleration ratio is still operated on a CPU, the calculation capacity of heterogeneous hardware can be utilized to the maximum, and the system can achieve the best performance.

Description

Coarse-grained calculation acceleration ratio evaluation method and system under heterogeneous hardware environment

Technical Field

The invention belongs to the field of high performance, relates to a method for evaluating a calculation acceleration ratio in a heterogeneous hardware calculation environment, and more particularly relates to a method for quickly evaluating the calculation acceleration ratio of the same code in CPU and GPU hardware in a mixed calculation scene of the CPU and the GPU.

Background

With the development of high-performance computing technology, besides a CPU, more and more coprocessors, such as a GPU, an FPGA, an embedded accelerator card, and the like, appear in computing equipment, and these coprocessors can accelerate a conventional program based on CPU computing, thereby improving the overall computing performance of a business system. In the field of high-performance parallel computing, the acceleration of computing can be realized without rewriting a CPU code into, for example, a GPU code, and often, because of the influence of factors such as the increase of overhead of data copying and a GPU computation scheduling mechanism, the running speed is reduced instead of a multi-threaded program running on a CPU being changed into a parallel program running on a GPU. The existing estimation mode of the calculation performance mainly depends on manual experience and actual test result measurement after code migration, and if the calculation effect of program migration can be estimated in advance, a lot of unnecessary work expenses can be reduced undoubtedly.

Disclosure of Invention

In view of the above situation, the present invention provides an innovative coarse-grained calculation acceleration ratio evaluation method, which can solve the above problems well. The system can rapidly judge whether the acceleration ratio exists or not when a certain module is transferred from the CPU to the GPU for calculation according to the method, and the acceleration ratio is approximate.

The invention discloses a coarse-grained calculation acceleration ratio evaluation method and system under a heterogeneous hardware environment, which are different from the existing calculation acceleration ratio evaluation method, can quickly and automatically estimate acceleration effects through an algorithm, can more efficiently schedule and calculate through quantized acceleration ratio results, and are calculation effect evaluation implementation means.

The method comprises the following steps:

step 1, acquiring basic attributes of heterogeneous hardware and calculation types of calculation modules.

Step 2, according to the calculation type of the calculation module: the data volume and the calculated volume are in a linear relation or an exponential relation, different linear evaluation algorithms or exponential evaluation algorithms are selected according to the calculation types, and the acceleration ratio of the calculation module is estimated by combining specific parameters in the following calculation.

The acceleration ratio calculation method in the step comprises the following steps: the acceleration ratio N ═ algorithm calculates the consumed time T (CPU) in the CPU/algorithm calculates the consumed time T (GPU) in the GPU, wherein T (GPU) ═ total amount of data IO S (InData + outData)/bus IO speed (PCIE) + T (CPU)/parallelism M. The difference between the linear evaluation algorithm and the exponential evaluation algorithm is mainly reflected in the relationship between the calculated amount and the data amount. For example, in a linear relationship, the calculation time period t (CPU) of 100M data on the CPU is t seconds, and in an exponential relationship (generally, a square relationship), the calculation time period t (CPU) of 100M data on the CPU is t seconds.

And 3, according to the acceleration ratio result, if the acceleration ratio is greater than 1, the calculation is transferred from the CPU to the GPU, and the calculation can be transferred from the CPU to the GPU. If the acceleration ratio is 1, the calculation is migrated from the CPU to the GPU without acceleration effect.

Compared with the prior art: the invention has the beneficial effects that:

1. whether the result is accelerated or not can be evaluated before the actual calculation is carried out, and the overhead of calculating and measuring each time is avoided.

2. The acceleration ratio can be automatically evaluated in real time according to the computational context, and the method is more accurate and efficient compared with the mode of experience and measurement.

3. And the dynamic scheduling of calculation is realized, the module with the speed-up ratio is operated on the GPU, and the module without the speed-up ratio is still operated on the CPU, so that the calculation capability of heterogeneous hardware can be maximally utilized, and the system can realize the optimal performance.

Drawings

FIG. 1 is a schematic diagram of a linear evaluation model in an embodiment with a parallelism of 64;

FIG. 2 is a schematic diagram of an embodiment of a linear evaluation model with an acceleration ratio of 1 at a parallelism of 64;

FIG. 3 is a schematic diagram of an example-based exponential evaluation model with a data size of 1GB and a parallelism of 64;

FIG. 4 is a schematic diagram of an index evaluation model in an embodiment, where the data size is 1GB, the parallelism is 64, and the acceleration ratio is 1;

Detailed Description

In this embodiment, the hardware used by the high-performance computing server is an Intel E52600 CPU, the GPU device is an NVIDIA GTX 1080Ti, the motherboard uses PCI-E3.0 specification, the GPU uses PCIE 8X slot, the theoretical transmission speed is 16GB/S, and the actual test speed is 12.8 GB/S. In this embodiment, a database system is used, and all the bottom operators of the database can support the CPU and the GPU, but it is necessary to determine whether to use the CPU operator or the GPU operator according to the acceleration ratio result calculated from the context of operation.

In this embodiment, SQL statement 1: select from Data _100G where id _ int <10, SQL statement 2: select a.id _ int from Data _100G a join Data _100M b on a.id _ int. The main calculation of statement 1 is the where comparison operation, and the calculation amount is linear with the data amount. The main calculation of statement 2 is Join operation, and the calculation amount and the data amount are in exponential (square) relation.

The method comprises the following specific implementation steps:

step 1, a database system supporting a CPU/GPU heterogeneous computing operator acquires parameters of a current hardware environment, such as PCIE bus transmission efficiency and performance ratio of a single-core CPU and a single-core GPU, and the parameters can be obtained through manual configuration or automatic program statistics. In this embodiment, the actual transmission speed of the PCIE3.0 bus is 12.8GB/S, and the performance ratio of the single-core CPU to the single-core GPU is 1.

The system simultaneously obtains the calculation categories of the calculation module (the bottom operator of the database in the database system), for example, the filter operator is a linear operator category, the join operator is an exponential operator category, and the like, and the categories of the operators are manually set in advance according to the calculation characteristics (the relationship between the data amount and the calculation amount) of the operators.

And 2, generating a specific acceleration ratio evaluation model according to software and hardware context, wherein in the embodiment, the size of output data is 10% of that of input data D, the output data is substituted into the previous parameters, the acceleration ratio N is T/(1.1 multiplied by D/12.8+ T/M), T is the calculation time consumption of a CPU of an algorithm with the same calculation complexity, D is the size of input data, M is the GPU parallelism, and the time consumption of every 100M of data on the CPU is T seconds. The linear evaluation algorithm is then: because T ═ T × D/0.1 ═ 10T × D, N ═ 10T × D/1.1 ═ D/12.8+10T × D/m ═ 10T/(1.1/12.8+10T/m) ═ 1/(1.1/128T +1/m), where T ranges from 0.001S to 10S and m ranges from 64 to 1024. The index evaluation algorithm is as follows: because T is equal to T × square (D/0.1) and 100T × D, N is equal to 100T × D/(1.1 × D/12.8+100T × D/m) and is equal to 100T/(1.1/12.8D +100T/m) and 1/(1.1/1280Dt +1/m), wherein T is in the range of 0.001S to 10S, m is in the range of 64 to 1024, and D is in the range of 0.1GB to 16 GB.

And 3, when the database system receives the SQL sentences, analyzing and decomposing the SQL sentences into an SQL physical execution plan, wherein the physical execution plan consists of bottom SQL operators, and the SQL physical execution plan is a mature technology. In this embodiment, the main computing operator after the SQL statement 1 is analyzed is a filter operator, and the main computing operator after the SQL statement 2 is analyzed is a join operator.

And 4, when the database system processes the SQL statement 1, if the filter operator is in a linear type, a linear evaluation algorithm is adopted, the parallelism m is set to be 64, and the parameter can be adjusted and set according to specific conditions. As shown in fig. 1, the limit acceleration ratio of the linear model in this embodiment is 60, and when it is determined whether the acceleration ratio is available, it is concerned whether the acceleration ratio is greater than 1, the calculation time of the filter operator hundred M data of the SQL statement 1 in this embodiment is 0.005 second in the CPU (this data is a statistical result, the statistical method is a general method, and mature technology), as shown in fig. 2, the acceleration ratio N is evaluated as 1/(1.1/(128 0.005) +1/64) to 0.576 according to the antecedent evaluation algorithm, which indicates that the GPU operator has no acceleration effect in this case, and therefore, the database system continues to use the filter CPU operator to complete the calculation.

And step 5, when the database system processes the SQL statement 2, if the join operator is an exponential type, an exponential evaluation algorithm is adopted, and because the NVIDIA 1080Ti video memory size is 11G, the size of data D sent to the GPU for calculation is set to be 1GB, the parallelism m is set to be 64, and the parameter can be adjusted and set according to specific conditions. As shown in fig. 3, in the present embodiment, the limit acceleration ratio of the linear model is between 60 and 70, and when it is determined whether the acceleration ratio is available, it is concerned whether the acceleration ratio is greater than 1, the calculation time of the Join operator hundred M data of the SQL statement 2 in the present embodiment in the CPU is 0.2 seconds (this data is a statistical result, the statistical method is a general method, and a mature technology), and as shown in fig. 2, the acceleration ratio is evaluated as N ═ 1/(1.1/(1280 × 0.2) +1/64) ═ 50.19 according to the advance evaluation algorithm, since the acceleration ratio N >1, which indicates that the GPU operator has a good acceleration effect, the database system uses the Join operator to complete the calculation.

Particularly, the invention relates to a coarse-grained calculation acceleration ratio evaluation method under a heterogeneous hardware environment, the technical form of the implementation is not limited to a single algorithm, and the method can also be embedded into a hardware system for implementation, and the basic principle is consistent.

Of course, the present invention may have other embodiments, and those skilled in the art may make various changes and modifications according to the present invention, such as those applied to big data processing systems and those applied to AI hybrid computing execution scenarios, without departing from the spirit and scope of the present invention, and such changes and modifications should fall within the protection scope of the appended claims.

Claims

1. A coarse-grained calculation acceleration ratio evaluation method and system under heterogeneous hardware environment are characterized in that: the method comprises the following steps:

step 1, acquiring basic attributes of heterogeneous hardware and calculation types of calculation modules;

step 2, according to the calculation type of the calculation module: the data volume and the calculated volume are in a linear relation or an exponential relation, different linear evaluation algorithms or exponential evaluation algorithms are selected according to the calculation types, and the acceleration ratio of the calculation module is estimated by combining the specific parameters of the upper part and the lower part of the calculation;

the acceleration ratio calculation method in the step comprises the following steps: the acceleration ratio N ═ algorithm calculates the consumed time T (CPU) in the CPU/algorithm calculates the consumed time T (GPU) in the GPU, wherein T (GPU) ═ total amount of data IO S (InData + outData)/bus IO speed (PCIE) + T (CPU)/parallelism M; the difference between the linear evaluation algorithm and the exponential evaluation algorithm is mainly reflected in the relationship between the calculated amount and the data amount; for example, in a linear relationship, if the calculation time length t (CPU) of 100M data on the CPU is t seconds, then in an exponential relationship (generally, a square relationship), the calculation time length t (CPU) of 100M data on the CPU is t seconds;

step 3, according to the acceleration ratio result, if the acceleration ratio is greater than 1, it is indicated that the calculation is migrated from the CPU to the GPU and has an acceleration effect, and the calculation can be migrated from the CPU to the GPU; if the acceleration ratio is 1, the calculation is migrated from the CPU to the GPU without acceleration effect.