CN111240844A

CN111240844A - Resource scheduling method, equipment and storage medium

Info

Publication number: CN111240844A
Application number: CN202010032674.5A
Authority: CN
Inventors: 张燕; 夏正勋
Original assignee: Transwarp Technology Shanghai Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2020-06-05

Abstract

The embodiment of the invention discloses a resource scheduling method, equipment and a storage medium. The method comprises the following steps: acquiring a calculation operator to be operated in the distributed calculation cluster; calculating the total calculation time of the distributed CPU according to the CPU cluster characteristic parameters matched with the distributed calculation cluster, the preset single-machine input and output data quantity and the calculation category of the calculation operator; calculating the total calculation time of the distributed GPU according to the GPU single-computer characteristic parameters matched with the distributed calculation clusters, the GPU cluster characteristic parameters, the single-computer input and output data quantity and the calculation types; calculating the GPU acceleration ratio according to the total time consumption calculated by the distributed heterogeneous CPU and the total time consumption calculated by the distributed heterogeneous GPU; if the GPU acceleration ratio is larger than a preset value, determining to run a calculation operator in the GPU; otherwise, determining to run a calculation operator in the CPU. The acceleration effect of the GPU can be evaluated quickly and automatically before algorithm calculation, the running hardware of a calculation operator is determined, and resources are scheduled efficiently.

Description

Resource scheduling method, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of high-performance computing, in particular to a resource scheduling method, resource scheduling equipment and a storage medium.

Background

With the development of high-performance computing technologies, computing devices include, in addition to a Central Processing Unit (CPU), more and more coprocessors, such as a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an embedded accelerator card, and the like, which can accelerate a conventional program based on CPU computing and improve the overall computing performance of a service system.

Ideally, GPU acceleration is used, whether for single-machine or multi-machine distributed computing. However, in a distributed cluster environment, when an algorithm is migrated from a CPU to a GPU, an actual computation situation is very complex, and in a typical distributed heterogeneous computation process, after work nodes in a distributed cluster are generally computed, results need to be aggregated, sometimes even multiple times. Therefore, there are many fine-grained factors that affect the acceleration effect of the GPU, such as: data Input Output (IO) duration in the CPU or the GPU, network transmission speed between nodes, GPU compression and decompression performance, and the like.

In the prior art, whether the GPU has an acceleration effect in a distributed heterogeneous computing environment is generally measured and evaluated according to manual experience in combination with a result after computation execution after code migration, so that the acceleration effect of the GPU cannot be effectively evaluated before computation, running hardware of a computation operator is determined, and resources cannot be efficiently scheduled.

Disclosure of Invention

The embodiment of the invention provides a resource scheduling method, equipment and a storage medium, which can quickly and automatically evaluate the acceleration effect of a GPU (graphics processing unit) before algorithm calculation, determine the running hardware of a calculation operator and efficiently schedule resources.

In a first aspect, an embodiment of the present invention provides a resource scheduling method, where the method includes:

acquiring a calculation operator to be operated in the distributed calculation cluster;

calculating the total calculation time of the distributed CPU matched with the calculation operator according to the characteristic parameters of the CPU cluster matched with the distributed calculation cluster, the input and output data quantity of a preset single machine and the calculation category of the calculation operator;

calculating the total calculation time of the distributed GPU matched with the calculation operator according to the GPU single-computer characteristic parameters matched with the distributed calculation cluster, the GPU cluster characteristic parameters, the single-computer input and output data quantity and the calculation category;

calculating a GPU acceleration ratio corresponding to the calculation operator according to the total calculation time of the distributed heterogeneous CPUs and the total calculation time of the distributed heterogeneous GPUs;

if the GPU acceleration ratio is larger than a preset value, determining to operate the calculation operator in the GPU; otherwise, determining to run the calculation operator in the CPU.

In a second aspect, an embodiment of the present invention further provides a computer device, including a processor and a memory, where the memory is used to store instructions that, when executed, cause the processor to:

In a third aspect, an embodiment of the present invention further provides a storage medium, where the storage medium is configured to store instructions for performing:

According to the technical scheme of the embodiment of the invention, the calculation operators to be operated in the distributed calculation cluster are obtained; calculating the total time consumption of distributed CPU calculation matched with a calculation operator according to the characteristic parameters of the CPU cluster matched with the distributed calculation cluster, the input and output data quantity of a preset single machine and the calculation category of the calculation operator; calculating the total calculation time of the distributed GPU matched with the calculation operator according to the GPU single-computer characteristic parameters matched with the distributed calculation cluster, the GPU cluster characteristic parameters, the single-computer input and output data quantity and the calculation category; calculating the total time consumption according to the distributed heterogeneous CPU and the distributed heterogeneous GPU, and calculating a GPU acceleration ratio corresponding to a calculation operator; if the GPU acceleration ratio is larger than a preset value, determining to run a calculation operator in the GPU; otherwise, the calculation operator is determined to run in the CPU, the problem that the acceleration effect of the GPU cannot be effectively evaluated before calculation, and the running hardware of the calculation operator cannot be efficiently scheduled is solved, the acceleration effect of the GPU can be quickly and automatically effectively evaluated, the running hardware of the calculation operator can be determined through a quantized acceleration ratio, resources are efficiently scheduled, and the calculation performance is improved.

Drawings

Fig. 1 is a flowchart of a resource scheduling method according to an embodiment of the present invention;

FIG. 2a is a graph showing the relationship between the acceleration ratio of the GPU and the time consumed by hundreds of megabytes of data calculation;

FIG. 2b is a schematic diagram showing the relationship between the GPU speed-up ratio and the time consumed by hundred million data calculations;

FIG. 2c is a graph showing the relationship between the GPU speed-up ratio and the time consumed by hundred million data calculations;

FIG. 2d is a schematic diagram showing a relationship between a GPU speed-up ratio and hundred megabyte data calculation time;

FIG. 2e is a diagram illustrating the relationship between the GPU speed-up ratio and the time consumed by hundred megabyte data calculation;

FIG. 2f is a schematic diagram showing a relationship between a GPU speed-up ratio and hundred mega data calculation time consumption;

fig. 3 is a schematic structural diagram of a resource scheduling apparatus according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The term "distributed computing cluster" as used herein may be a cluster computing in a distributed heterogeneous hardware environment; batch-wise computation scenarios, i.e., computation of batch-wise data; the bottom-layer operator of the data can support the operation of the CPU and the GPU, and the acceleration ratio calculated according to the embodiment of the invention can determine whether the CPU or the GPU is used for calculation, so that the effect is better. In a distributed computing cluster, after the working nodes in the cluster are computed, the results need to be aggregated and then computed, namely, aggregation operation.

The term "calculation operator" as used herein may be an individual operator into which an algorithm in high performance computing is split, e.g., an element filtering (filter) operator and an interior connection (join) operator, etc.

The term "calculation category" used herein may be set in advance by an operator according to the calculation characteristics of the calculation operator, i.e., the relationship between the data amount and the calculation amount, and the relationship between the data amount and the calculation amount is generally linear or exponential. If the filter operator is a linear operator subclass, the join operator is an index operator subclass. The length of computation time for computing the same amount of data for different computation categories affects differently.

The term "preset single-machine input-output data amount" used herein may be a general term for the single-machine input data amount and the single-machine output data amount of the single-machine operation calculation operator.

The term "CPU cluster characteristic parameter" used herein may include a single-core CPU-to-single-core GPU performance ratio, a network transmission speed between cluster nodes, a total number of machines in a cluster, a total number of machines in each aggregation operation, and the like, and may be configured by a user or obtained by a program.

The term "GPU standalone feature parameter" used herein may include bus transmission efficiency, GPU parallelism, etc., and may be configured by a user or obtained by a program.

The term "GPU cluster characteristic parameter" used herein may include a single-core CPU to single-core GPU performance ratio, a network transmission speed between cluster nodes, a total number of machines in a cluster, a total number of machines in each aggregation operation, a GPU parallelism, a GPU data compression rate, a GPU compression and decompression speed, and the like, and may be configured by a user or obtained by a program.

The term "distributed CPU computation total time consumption" as used herein may be the length of time a computation operator needs to consume to run at a CPU in a distributed computation cluster under a certain amount of data.

The term "distributed GPU compute total time consumption" as used herein may be the length of time a compute operator needs to consume to run a GPU in a distributed compute cluster under a certain amount of data.

The term "stand-alone CPU computation time" as used herein may be the length of time a computation operator needs to spend running a CPU in a single machine under a certain amount of data.

The term "stand-alone GPU compute time consumption" as used herein may be the length of time that a compute operator needs to consume to run a GPU in a single machine under a certain amount of data.

The term "GPU acceleration ratio" as used herein may be a quantification of the acceleration effect of a computing operator running at the GPU corresponding to running at the CPU.

Example one

Fig. 1 is a flowchart of a resource scheduling method according to an embodiment of the present invention, where this embodiment is applicable to determining a situation that a computing operator runs in a CPU or a GPU in high-performance computing, and the method may be executed by a resource scheduling apparatus, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in a processor, as shown in fig. 1, and the method specifically includes:

and step 110, acquiring a calculation operator to be operated in the distributed calculation cluster.

In the embodiment of the present invention, in a distributed computing cluster, a specification of a CPU adopted by a high performance computing server is Inter E52600, a specification of a GPU is NVIDIA GTX 1080Ti, a motherboard adopts a PCI-E3.0 specification, and the GPU uses a PCIE8X slot as an example, so as to perform an example of calculating a GPU speed-up ratio.

In the embodiment of the present invention, the running hardware of each computation operator in an algorithm may be determined to implement resource scheduling, and the computing capability of distributed heterogeneous hardware is maximally utilized to implement optimal performance. When a database system receives a Structured Query Language (SQL) statement, the SQL statement is analyzed and decomposed into an SQL physical execution plan, and the physical execution plan is composed of SQL operators, and the SQL operators can be used as calculation operators to be operated. For example, executing the where compare operation, the database system receives SQL statement 1: the selection from Data _100G where id _ int <10, the main calculation operator after the analysis of the SQL statement 1 is a filter operator, the filter operator can be used as a calculation operator to be operated, the calculation amount of the SQL statement 1 is in a linear relation with the Data amount, and is a linear operator category; for another example, executing Join operation, the database system receives SQL statement 2: the main calculation operator after the SQL statement 2 is analyzed is a join operator, the join operator can be used as a calculation operator to be operated, and the calculation amount and the Data amount of the SQL statement 2 are in an exponential relationship and are in an exponential calculation subclass. The resource scheduling example is described by taking the computing operator as a filter operator and a join operator as examples, but the embodiment of the present invention is not limited to determining the operating hardware of the computing operator.

And 120, calculating the total time consumed by the distributed CPU matched with the calculation operator according to the characteristic parameters of the CPU cluster matched with the distributed calculation cluster, the preset single-machine input and output data quantity and the calculation category of the calculation operator.

In an implementation manner of the embodiment of the present invention, optionally, calculating the total computation time of the distributed CPU matched with the computation operator according to the CPU cluster characteristic parameter, the preset single-machine input/output data amount, and the computation category of the computation operator includes: calculating the calculation time consumption of the single CPU matched with the calculation operator according to the single input data quantity and the calculation category; and calculating the total calculation time of the distributed CPU matched with the calculation operator according to the calculation time of the single CPU, the single output data quantity and the characteristic parameters of the CPU cluster.

It is assumed that the time consumption of every 100 megabytes (M) of data on the CPU is t (t ranges from 0.001 second to 10 seconds), and the single-machine input data amount is D Gigabytes (GB). Then, when the calculation operator is in the linear operator category, the single-machine CPU matched with the calculation operator calculates the consumed time T ═ T/0.1 ═ D ═ 10T × (D); when the calculation operator is an exponential calculation subclass (such as a square relation), a single CPU matched with the calculation operator calculates the consumed time T ═ T/0.1 × (T/0.1) × D ═ 100T × D.

In an implementation manner of the embodiment of the present invention, optionally, calculating the total calculation time of the distributed CPUs matched with the calculation operator according to the calculation time of the stand-alone CPU, the stand-alone output data amount, and the CPU cluster characteristic parameter includes: determining aggregation times and aggregation node numbers according to the total number of machines in the distributed computing cluster and the total number of machines in each aggregation operation; and calculating the total calculation time of the distributed CPU according to the aggregation times, the number of aggregation nodes, the calculation time of the single CPU, the single output data quantity and the network transmission speed among the cluster nodes.

The total time t (CPU) calculated by the distributed CPU may be obtained by calculating the formula t (CPU) ═ t (cal) + t (poly) + t (trans). Wherein, T (cal) is the calculation time T of the single CPU, and T (poly) is the total calculation time of the calculation operator when performing the aggregation operation on the CPU, and can be obtained by the formula T (poly) T (cal) aggregation times; the number of polymerizations may be log_k(n), where n is the total number of machines in the distributed computing cluster, and k is the total number of machines in each aggregation operation, that is, in the distributed computing cluster, n machines exist, and each k machines perform aggregation computation, where k may be 10 and the aggregation number is lg (n); t (trans) refers to the transmission time between network nodes during the calculation of operator aggregation operation, and can be obtained by the formula t (trans) ═ single output data volume × aggregation node number/network transmission speed between cluster nodes; wherein, the single-machine output data volume can be assumed to be 10% of the single-machine input data volume, i.e. assumed to be 0.1 × D; the number of aggregation nodes can be obtained by the formula n (1+1/k +1/(k × k) + …), wherein the aggregation operation requires aggregation of the aggregation result to one machine, so the last term in the solution formula of the number of aggregation nodes is 1.

In an embodiment of the present invention, the inter-cluster network may use a 10 Gigabit (Gb) ethernet card or a 100Gb Infiniband (IB) card. When a 10Gb ethernet card is used, the actual network transmission speed between cluster nodes is 1 giga Per Second (Gbps); the distributed CPU calculates the total elapsed time T (CPU) ═ T + lg (n) + (D × n/9)/(1/8) ═ T (1+ lg (n)) +8D × n/9. Wherein, (D × n/9)/(1/8) represents t (trans), D × n/9 represents the total amount of aggregated data, which can be calculated by the formula stand-alone output data amount × aggregate node number ═ 0.1 × D n (1+1/k +1/(k × k) + …), with k as 10 cases, the result is approximated as D × n/9; 1 represents the network transmission speed between cluster nodes as 1Gbps, and the division by 8 is to satisfy the conversion of the actual transmission speed unit from Gbps to gigabit Per Second (Gbps). When the IB card of 100Gb is adopted, the actual network transmission speed between the cluster nodes is 80 Gbps; the distributed CPU calculates the total elapsed time T (CPU) ═ T + lg (n) + (D × n/9)/(80/8) ═ T (1+ lg (n)) + D × n/90. Wherein, T is the calculation time of the calculation operator on the single CPU, namely the calculation time T (cal) of the single CPU.

And step 130, calculating the total calculation time of the distributed GPU matched with the calculation operator according to the GPU single-computer characteristic parameters matched with the distributed calculation cluster, the GPU cluster characteristic parameters, the single-computer input and output data volume and the calculation category.

In an implementation manner of the embodiment of the present invention, optionally, calculating the total computation time of the distributed GPU matched with the computation operator according to the GPU standalone feature parameter, the GPU cluster feature parameter, the standalone input/output data amount, and the computation category matched with the distributed computation cluster includes: calculating the calculation time consumption of the single-machine GPU according to the calculation type, the characteristic parameters of the single machine of the GPU and the input and output data quantity of the single machine; and calculating the total calculation time of the distributed GPU matched with the calculation operator according to the calculation time of the stand-alone GPU, the GPU cluster characteristic parameters and the stand-alone input data volume.

In an implementation manner of the embodiment of the present invention, optionally, calculating the calculation time of the stand-alone GPU according to the calculation type, the GPU stand-alone feature parameter, and the stand-alone input/output data amount includes: and calculating the calculation time consumption of the single-machine GPU according to the calculation type, the bus transmission efficiency, the GPU parallelism and the single-machine input and output data quantity.

In a specific implementation manner of the embodiment of the present invention, a PCIE3.0 bus is used as the bus, the theoretical bus transmission efficiency is 16 Gigabits Per Second (GB/s), and the actual test speed is 12.8 GB/s. The performance ratio of the single-core CPU to the single-core GPU is assumed to be 1, and the computation time T ' (cal) of the stand-alone GPU is calculated, that is, the computation time T ' of the computation operator on the stand-alone GPU can be obtained by the formula T ' ═ total input/output data/bus transmission efficiency + T (cal)/m. Wherein, T (cal) can be obtained by calculation of calculation type and single machine input data amount. Assuming that the time consumption of each 100M data on the CPU is t (the value range of t is 0.001-10 seconds), and the single-machine input data volume is D, when the calculation operator is in a linear operator category, T (cal) t/0.1D-10 t D; when the calculation operator is an exponential calculation subclass (such as a square relation), t (cal) ═ t/0.1 ═ D ═ 100t × (t/0.1) × (D). Assume that the amount of standalone output data is 10% of the amount of standalone input data, i.e. 0.1 × D. And m is the GPU parallelism, and the value range is 64-1024. In an example of an embodiment of the present invention, the stand-alone GPU calculates the elapsed time T' ═ 1+0.1 × D/12.8+ T/m.

In an embodiment of the present invention, optionally, calculating the total calculation time of the distributed GPU matched with the calculation operator according to the calculation time of the stand-alone GPU, the GPU cluster feature parameters, and the stand-alone input data volume includes: determining aggregation times and aggregation node numbers according to the total number of machines in the distributed computing cluster and the total number of machines in each aggregation operation; calculating the total calculation time of the distributed GPU according to the aggregation times, the number of aggregation nodes, the calculation time of the single GPU, the single output data volume and the network transmission speed among the cluster nodes; or calculating the total time consumed by the distributed GPU according to the aggregation times, the number of aggregation nodes, the single-machine GPU calculation time, the single-machine output data volume, the network transmission speed among cluster nodes, the GPU data compression ratio and the GPU compression and decompression speed.

The distributed GPU total calculation time can be divided into the distributed GPU total calculation time when GPU compression and decompression functions are considered and the GPU compression and decompression functions are not considered. When the GPU compression and decompression function is not considered, the total time consumption T (GPU) calculated by the distributed GPU can be obtained by calculating the formula T (GPU) ═ T '(cal) + T' (poly) + T (trans). In the formula, T ' (poly) refers to the total calculation duration of the calculation operator when performing the aggregation operation on the GPU, and can be obtained by using the formula T ' (poly) ═ T ' (cal) × aggregation times; the method of determining the number of polymerization times and T (trans) is consistent with the explanation in step 120.

When the GPU compression and decompression functions are considered, the total time consumed by the distributed GPU (GPU + CP/UP) may be calculated by using the formula T (GPU + CP/UP) ═ D/v +0.1 × D/v + T '(cal) + T' (poly) + T (trans) × c. In the formula, D refers to a single-computer input data volume, 0.1 × D refers to a single-computer output data volume, v refers to a GPU compression and decompression speed, D/v refers to a single-computer input data decompression time, 0.1 × D/v refers to a single-computer output data compression time, and c refers to a GPU data compression rate.

In an embodiment of the present invention, when the inter-cluster network employs a 10Gb ethernet card, the distributed GPU calculates the total consumed time T (GPU) (+ T' × lg (n)) + (D × n/9)/(1/8) ((11D/128 + T/m)) (1+ lg (n)) +8D × n/9, regardless of the GPU compression and decompression functions. When the IB card of 100Gb is used in the inter-cluster network, the distributed GPU calculates the total consumed time T (GPU) ═ T '+ T' × lg (n) +(D × n/9)/(80/8) (11D/128+ T/m) ((1 + lg (n)) + D × n/90), regardless of the GPU compression and decompression functions. When a 10Gb ethernet card is used in the inter-cluster network, the distributed GPU calculates the total consumed time T (GPU + CP/UP) (/ D/v +0.1 × D/v + T' (/ lg) (n) + (dn × n/9) (/ 1/8) (/ 1.1 × D/v + (11D/128+ T/m) (1+ lg) (n)) +8D × n c/9), taking into account the GPU compression and decompression functions. When the network between clusters adopts 100Gb IB cards, considering GPU compression and decompression functions, the distributed GPU calculates the total consumed time T (GPU + CP/UP) (/ D/v +0.1 × D/v + T' (/ g) (n) + (D × n/9) (/ 80/8) (/ 1.1 × D/v + (11D/128+ T/m) ((1 + lg) (n) + D × n/90).

And 140, calculating a GPU acceleration ratio corresponding to the calculation operator according to the total calculation time consumption of the distributed heterogeneous CPUs and the total calculation time consumption of the distributed heterogeneous GPUs.

The GPU acceleration ratio corresponding to the calculation operator can be obtained by calculating the ratio of the total time consumed by the distributed heterogeneous GPU and the total time consumed by the distributed heterogeneous CPU of the calculation operator with the same calculation complexity.

In an embodiment of the present invention, when a 10Gb ethernet card is used as the inter-cluster network, the GPU speed-up ratio N ═ T (cpu)/T (GPU) ([ T + T × lg), (N) + (D × N/9)/(1/8) ]/[ T' + l g (N) + (D ÷ N/9)/(1/8) ]/[ (11D/128+ T/m) ((1 + lg (N)) +8D × N/9) ]/(8D/9) regardless of the GPU compression and decompression functions. Assuming that N is 1000, N is (4T +8000D/9)/(44D/128+4T/m +8000D/9) — (T +2000D/9)/(11D/128+ T/m + 2000D/9).

Fig. 2a is a schematic diagram illustrating a relationship between a GPU acceleration ratio and time consumption for calculating the hundred million data, and as shown in fig. 2a, when a network between clusters uses a 10Gb ethernet card, a GPU compression and decompression function is not considered, and when a calculation operator is a linear operator type (T ═ 10T × D), a relationship between a GPU acceleration ratio N and time consumption for calculating the hundred million data is N ═ 40T ═ D +8000D/9)/(44D/128+40T ═ D/m + 8000D/9)/(T +200/9)/(T/m + 256099/11520). When the GPU parallelism m is 64, N is (t +200/9)/(t/64+ 256099/11520).

Fig. 2b is a schematic diagram illustrating a relationship between a GPU acceleration ratio and a hundred million data calculation time, and as shown in fig. 2b, when a network between clusters uses a 10Gb ethernet card, GPU compression and decompression functions are not considered, and a calculation operator is an exponential calculation type (T ═ 100T × D), a relationship between a GPU acceleration ratio N and a hundred million data calculation time T is N ═ T +20/9)/(T ═ T/m + 256099/115200. When the GPU parallelism m is 64, N ═ t +20/9)/(t × (t + t/64+ 256099/115200).

In one embodiment of the invention, when the 100Gb IB card is used in the inter-cluster network, irrespective of the GPU compression and decompression functions, the GPU acceleration ratio N ═ T (cpu)/T (GPU) ([ T + T × lg (N) + (D × N/9)/(80/8) ]/[ T' + lg (N) + (D × (N/9)/(80/8) ], [ T × (1+ lg (N) + D × (90)/[ (11D/128+ T/m) ((1 + lg) (N) + D × (N) + D/90 ] (4T +100D/9)/(44D/128+4T/m +100D/9) ((T +25D/9)/(11D/128+ T/m + 25D/9)) corresponding to the calculation operator.

Fig. 2c is a schematic diagram illustrating a relationship between a GPU acceleration ratio and a hundred mega data calculation time, and as shown in fig. 2c, when the inter-cluster network uses a IB card of 100Gb, and the GPU compression and decompression functions are not considered, and the calculation operator is a linear operator type (T ═ 10T × D), a relationship between a GPU acceleration ratio N and a hundred mega data calculation time T is (T +5/18)/(T/m + 3299/11520). When the GPU parallelism m is 64, N is (t +5/18)/(t/64+ 3299/11520).

Fig. 2D is a schematic diagram illustrating a relationship between a GPU acceleration ratio and a hundred million data calculation time, and as shown in fig. 2D, when the inter-cluster network uses a IB card of 100Gb, the GPU compression and decompression functions are not considered, and the calculation operator is an exponential calculation type (T ═ 100T × D), the relationship between the GPU acceleration ratio N and the hundred million data calculation time T is N ═ T +1/36)/(T ═ T/m + 3299/115200. When the GPU parallelism m is 64, N ═ t +1/36)/(t × (t + t/64+ 3299/115200).

In an embodiment of the present invention, when a 10Gb ethernet card is used as the inter-cluster network, the GPU compression and decompression functions are considered, and the GPU acceleration ratio N ═ T (cpu)/T (GPU + CP/UP) ═ T ± (1+ lg (N)) +8D × N/9]/[1.1D/v + (11D/128+ T/m) (1+ lg (N)) +8D × N c/9] ═ 4T +8000D/9)/(1.1D/v +11D/32+4T/m + 8000D/c/9) ((T +2000D/9)/(11D/40v +11D/128+ T/m +2000D × c/9) corresponding to the calculation operator is calculated. When c is 30% and v is 100MB/s is 0.1GB/s, N is (T +2000D/9)/(363D/128+ T/m + 200D/3).

When the inter-cluster network adopts a 10Gb Ethernet card, the GPU compression and decompression functions are considered, and when a calculation operator is a linear operator type (T-10T-D), the GPU acceleration ratio N and the hundred-megabyte data calculation time consumption T are in the relation of (T +200/9)/(T/m + 26689/3840). When the GPU parallelism m is 64, N is (t +200/9)/(t/64+ 26689/3840).

When the inter-cluster network adopts a 10Gb Ethernet card, the GPU compression and decompression functions are considered, and when the calculation operator is an exponential operator type (T is 100T D), the GPU acceleration ratio N is in relation with the hundred-megabyte data calculation time consumption T, namely N is (T + 20/9)/(T/m + 26689/38400). When the GPU parallelism m is 64, N ═ t +20/9)/(t × (t + t/64+ 26689/38400).

In an embodiment of the present invention, when the inter-cluster network uses a 100Gb IB card, the GPU compression and decompression functions are considered, and the GPU acceleration ratio N ═ T (cpu)/T (GPU + CP/UP) ═ T ± (1+ lg (N)) + D × (N/90 ]/[1.1D/v + (11D/128+ T/m) ((1 + lg (N)) + D × (N)/c/90 ] ═ 4T +100D/9)/(1.1D/v +11D/32+4T/m +100D ± (T +25D/9)/(11D/40v + T/m +5D/6) corresponding to the calculation operator is calculated.

Fig. 2e is a schematic diagram illustrating a relationship between a GPU acceleration ratio and a hundred million data calculation time, and as shown in fig. 2e, when the inter-cluster network uses a IB card of 100Gb, and the GPU compression and decompression functions are considered, and the calculation operator is a linear operator type (T ═ 10T × D), a relationship between a GPU acceleration ratio N and a hundred million data calculation time T is (T +5/18)/(T/m + 1409/3840). When the GPU parallelism m is 64, N is (t +5/18)/(t/64+ 1409/3840).

Fig. 2f is a schematic diagram of a relationship between a GPU acceleration ratio and hundred million data calculation time, and as shown in fig. 2f, when the inter-cluster network uses a IB card of 100Gb, the GPU compression and decompression functions are considered, and the calculation operator is an exponential calculation type (T ═ 100T × D), the relationship between the GPU acceleration ratio N and the hundred million data calculation time T is N ═ T +1/36)/(T ═ T/m + 1409/38400. When the GPU parallelism m is 64, N ═ t +1/36)/(t × (t + t/64+ 1409/38400).

Step 150, if the GPU acceleration ratio is larger than a preset value, determining to operate a calculation operator in the GPU; otherwise, determining to run a calculation operator in the CPU.

The preset value can be 1 or a value larger than 1, when the GPU acceleration ratio is larger than 1, the GPU is considered to have an acceleration effect, and the calculation operators can run on the GPU, so that the calculation performance can be improved. When the GPU acceleration ratio is less than or equal to 1, the GPU is considered to have no acceleration effect, the calculation operator can continue to operate in the CPU, the acceleration performance of the GPU can be evaluated before resource scheduling, the operating hardware of the calculation operator is determined, and resources are efficiently scheduled.

In an embodiment of the present invention, when the database system receives the SQL statement, the SQL statement is parsed and decomposed into an SQL physical execution plan, and the physical execution plan is composed of SQL operators, for example, in an example of the embodiment of the present invention, the main computing operator after the SQL statement 1 is parsed is a filter operator, and the main computing operator after the SQL statement 2 is parsed is a join operator. According to the statistical result, the calculation of the hundred million data in the CPU consumes time, the filter operator is 5 milliseconds, and the join operator is 200 milliseconds. The filter operator is a linear operator category, and when the SQL statement 1 is processed, an acceleration ratio calculation formula can be adopted when the calculation operator is the linear operator category; the join operator is an exponent operator category, and when the SQL statement 2 is processed, an acceleration ratio calculation formula can be adopted when the calculation operator is the exponent operator category.

Illustratively, a 10Gb ethernet card is used in the inter-cluster network, and the GPU compression and decompression functions are not considered, when the GPU parallelism is 64, as shown in fig. 2a, when the computation time t of hundred megabytes of data in the CPU is small, the acceleration effect is not obvious, and the acceleration ratio is between 1 and 2. The acceleration ratio N of the embodiment of the invention is 1 when the filter operator is operated, which shows that at this time, the GPU has no acceleration effect, the database system can continuously adopt the CPU to operate the filter operator to complete the calculation, and the calculation capability of the distributed heterogeneous hardware can be maximally utilized, so that the distributed heterogeneous cluster computing system can realize the optimal performance. As shown in fig. 2b, the acceleration ratio N of the embodiment of the present invention when the join operator is operated is 1, which indicates that at this time, the GPU has no acceleration effect, and the database system may continue to use the CPU to operate the join operator to complete the calculation, so that the computing power of the distributed heterogeneous hardware can be maximally utilized, and the distributed heterogeneous cluster computing system can achieve the best performance.

For example, the speed of the network is crucial to cluster acceleration, and in order to illustrate the influence of the network transmission speed between cluster nodes on the acceleration ratio, the 10Gb ethernet card used in fig. 2a and 2b is changed to a 100Gb IB card in the embodiment of the present invention, and the results are shown in fig. 2c and 2d, respectively. As shown in fig. 2c, the acceleration ratio N of the embodiment of the present invention when the filter operator is operated is 0.97, and the acceleration ratio is less than 1, which indicates that at this time, the GPU has no acceleration effect, and the database system may continue to use the CPU to operate the filter operator to complete the calculation, so that the computing capability of the distributed heterogeneous hardware can be maximally utilized, and the distributed heterogeneous cluster computing system can achieve the best performance. As shown in fig. 2d, the acceleration ratio N of the embodiment of the present invention when running the join operator is 2.3, and the acceleration ratio is greater than 1, which indicates that at this time, the acceleration effect of the GPU is good, and the database system can use the GPU to run the join operator to complete the calculation, so that the calculation capability of the distributed heterogeneous hardware can be maximally utilized, and the distributed heterogeneous cluster calculation system can achieve the best performance.

For example, there may be an effect of the compression and decompression functions of the GPU on cluster acceleration, and to illustrate the effect of the compression and decompression functions of the GPU on the acceleration ratio, the embodiment of the present invention introduces the compression and decompression functions of the GPU on the basis of fig. 2c and fig. 2d, and the results are shown in fig. 2e and fig. 2f, respectively. As shown in fig. 2e, the acceleration ratio N of the embodiment of the present invention when the filter operator is operated is 3.2, and the acceleration ratio is greater than 1, which indicates that at this time, the acceleration effect of the GPU is good, and the database system can use the GPU to operate the filter operator to complete the calculation, so that the computing capability of the distributed heterogeneous hardware can be maximally utilized, and the distributed heterogeneous cluster computing system can achieve the best performance. As shown in fig. 2f, when the join operator is operated, the acceleration ratio N is 1.82, and the acceleration ratio is greater than 1, which indicates that at this time, the acceleration effect of the GPU is good, and the database system can use the GPU to operate the join operator to complete the calculation, so that the calculation capability of the distributed heterogeneous hardware can be maximally utilized, and the distributed heterogeneous cluster calculation system can achieve the best performance.

The technical scheme provided by the embodiment of the invention can evaluate the acceleration effect of the GPU before algorithm calculation, can consider the influence of various fine-grained factors such as the total number of machines in a distributed calculation cluster, the total number of machines in each aggregation operation, bus transmission efficiency, GPU parallelism, network transmission speed among cluster nodes, GPU data compression ratio, GPU compression and decompression speed and the like on the acceleration effect of the GPU, can avoid the expense of calculating and measuring first each time, is more accurate and more efficient compared with the prior art, can dynamically schedule distributed batch calculation according to the acceleration ratio, maximizes the calculation capability of hardware and realizes the optimal performance of the hardware. It should be noted that the technical implementation form of the embodiment of the present invention is not limited to a single algorithm or a single calculation operator, and may be implemented by being embedded in a hardware system, and the specific implementation principle is consistent.

Example two

Fig. 3 is a schematic structural diagram of a resource scheduling apparatus according to a second embodiment of the present invention. With reference to fig. 3, the apparatus comprises: the system comprises a calculation operator acquisition module 310, a distributed CPU calculation total consumption time calculation module 320, a distributed GPU calculation total consumption time calculation module 330, a GPU acceleration ratio calculation module 340 and a calculation operator operation hardware determination module 350.

The calculation operator obtaining module 310 is configured to obtain a calculation operator to be operated in the distributed calculation cluster;

the distributed CPU total computation time consumption computation module 320 is used for computing the total computation time consumption of the distributed CPUs matched with the computation operators according to the CPU cluster characteristic parameters matched with the distributed computation clusters, the preset single-machine input and output data quantity and the computation categories of the computation operators;

the distributed GPU total time consumption calculation module 330 is used for calculating the total time consumption of distributed GPU calculation matched with the calculation operators according to the GPU single-computer characteristic parameters matched with the distributed calculation clusters, the GPU cluster characteristic parameters, the single-computer input and output data volume and the calculation categories;

and the GPU acceleration ratio calculation module 340 is configured to calculate a GPU acceleration ratio corresponding to the calculation operator according to the total distributed heterogeneous CPU calculation time and the total distributed heterogeneous GPU calculation time.

A calculation operator operation hardware determination module 350, configured to determine to operate a calculation operator in the GPU if the GPU acceleration ratio is greater than a preset value; otherwise, determining to run a calculation operator in the CPU.

On the basis of the foregoing embodiments, optionally, the module 320 for calculating the total consumed time by the distributed CPU includes: the single-computer CPU calculation time consumption calculation unit is used for calculating the single-computer CPU calculation time consumption matched with the calculation operator according to the single-computer input data quantity and the calculation type; and the distributed CPU calculates the total time consumption calculating unit, and is used for calculating the distributed CPU calculating total time consumption matched with the calculating operator according to the single CPU calculating time consumption, the single output data quantity and the CPU cluster characteristic parameters.

On the basis of the foregoing embodiments, optionally, the distributed CPU calculates a total consumption time calculation unit, including: the aggregation times and the aggregation node number determining subunit are used for determining the aggregation times and the aggregation node number according to the total number of machines in the distributed computing cluster and the total number of machines in each aggregation operation; and the distributed CPU calculates the total time consumption and calculates the total time consumption of the distributed CPU according to the aggregation times, the aggregation node number, the single CPU calculation time consumption, the single output data volume and the network transmission speed among the cluster nodes.

On the basis of the foregoing embodiments, optionally, the distributed GPU calculates the total consumed time calculation module 330, which includes: the single-computer GPU calculation time consumption calculation unit is used for calculating the calculation time consumption of the single-computer GPU according to the calculation type, the GPU single-computer characteristic parameters and the single-computer input and output data volume; and the distributed GPU total time consumption calculating unit is used for calculating the distributed GPU total time consumption matched with the calculating operator according to the single-machine GPU calculation time consumption, the GPU cluster characteristic parameters and the single-machine input data volume.

On the basis of the foregoing embodiments, optionally, the calculation time consumption calculation unit of the stand-alone GPU includes: and the single-machine GPU time consumption calculating subunit is used for calculating the time consumption of the single-machine GPU according to the calculation type, the bus transmission efficiency, the GPU parallelism and the single-machine input and output data quantity.

On the basis of the foregoing embodiments, optionally, the distributed GPU calculates the total consumption time calculation unit, and includes: the aggregation times and the aggregation node number determining subunit are used for determining the aggregation times and the aggregation node number according to the total number of machines in the distributed computing cluster and the total number of machines in each aggregation operation; the distributed GPU calculates the total consumed time first calculating subunit, is used for calculating the total consumed time of the distributed GPU according to the aggregation times, the number of aggregation nodes, the single-machine GPU calculated consumed time, the single-machine output data volume and the network transmission speed among the cluster nodes; or the second calculating subunit is used for calculating the total time consumption of the distributed GPU according to the aggregation times, the number of the aggregation nodes, the single-machine GPU calculation time consumption, the single-machine output data volume, the network transmission speed among the cluster nodes, the GPU data compression ratio and the GPU compression and decompression speed.

The resource scheduling device provided by the embodiment of the invention can execute the resource scheduling method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention, and as shown in fig. 4, the computer device includes:

one or more processors 410, one processor 410 being illustrated in FIG. 4;

a memory 420;

the apparatus may further comprise: an input device 430 and an output device 440.

The processor 410, the memory 420, the input device 430 and the output device 440 in the apparatus may be connected by a bus or other means, for example, in fig. 4.

The memory 420 serves as a non-transitory computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a resource scheduling method in the embodiment of the present invention (for example, the calculation operator acquisition module 310, the distributed CPU calculation total consumption time calculation module 320, the distributed GPU calculation total consumption time calculation module 330, the GPU acceleration ratio calculation module 340, and the calculation operator operation hardware determination module 350 shown in fig. 3). The processor 410 executes various functional applications and data processing of the computer device by executing the software programs, instructions and modules stored in the memory 420, namely, a resource scheduling method for implementing the above method embodiments is realized, that is:

calculating the total time consumption of distributed CPU calculation matched with a calculation operator according to the characteristic parameters of the CPU cluster matched with the distributed calculation cluster, the input and output data quantity of a preset single machine and the calculation category of the calculation operator;

calculating the total time consumption according to the distributed heterogeneous CPU and the distributed heterogeneous GPU, and calculating a GPU acceleration ratio corresponding to a calculation operator;

The memory 420 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 420 may optionally include memory located remotely from processor 410, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 440 may include a display device such as a display screen.

Example four

A fourth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a resource scheduling method according to the fourth embodiment of the present invention:

Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for scheduling resources, comprising:

2. The method of claim 1, wherein calculating the total computation time of the distributed CPUs matched with the computation operators according to the CPU cluster feature parameters matched with the distributed computation clusters, the preset single-machine input and output data amount and the computation categories of the computation operators comprises:

calculating the calculation time consumption of the single CPU matched with the calculation operator according to the single input data quantity and the calculation type;

and calculating the total calculation time of the distributed CPU matched with the calculation operator according to the calculation time of the single CPU, the single output data quantity and the CPU cluster characteristic parameters.

3. The method of claim 2, wherein calculating the total computation time of the distributed CPUs matched with the computation operator according to the computation time of the stand-alone CPUs, the stand-alone output data amount and the CPU cluster characteristic parameters comprises:

determining aggregation times and aggregation node numbers according to the total number of machines in the distributed computing cluster and the total number of machines in each aggregation operation;

and calculating the total calculation time of the distributed CPU according to the aggregation times, the aggregation node number, the single CPU calculation time, the single output data volume and the network transmission speed among the cluster nodes.

4. The method of claim 1, wherein calculating the total computation time of the distributed GPU matched with the computation operator according to the GPU standalone feature parameter, the GPU cluster feature parameter, the standalone input/output data amount, and the computation category matched with the distributed computation cluster comprises:

calculating the calculation time consumption of the single-computer GPU according to the calculation type, the GPU single-computer characteristic parameters and the single-computer input and output data quantity;

and calculating the total calculation time of the distributed GPU matched with the calculation operator according to the calculation time of the stand-alone GPU, the GPU cluster characteristic parameters and the stand-alone input data volume.

5. The method of claim 4, wherein calculating the stand-alone GPU computation time according to the computation category, the GPU stand-alone feature parameters, and the stand-alone input and output data volume comprises:

and calculating the calculation time consumption of the single-machine GPU according to the calculation type, the bus transmission efficiency, the GPU parallelism and the single-machine input and output data quantity.

6. The method of claim 4, wherein calculating the total distributed GPU computation time matched with a computation operator according to the stand-alone GPU computation time, the GPU cluster feature parameters and the stand-alone input data amount comprises:

calculating the total calculation time of the distributed GPU according to the aggregation times, the aggregation node number, the single-machine GPU calculation time, the single-machine output data volume and the network transmission speed among the cluster nodes; alternatively, the first and second electrodes may be,

and calculating the total calculation time of the distributed GPU according to the aggregation times, the aggregation node number, the single-machine GPU calculation time, the single-machine output data volume, the network transmission speed among cluster nodes, the GPU data compression ratio and the GPU compression and decompression speed.

7. A computer device comprising a processor and a memory, the memory to store instructions that, when executed, cause the processor to:

8. The computer device of claim 7, wherein the processor is configured to calculate the total time consumed by the distributed CPU computation matched to the computation operator by:

9. The computer device of claim 8, wherein the processor is configured to calculate the total time consumed by the distributed CPU computation matched to the computation operator by:

10. The computer device of claim 8, wherein the processor is configured to calculate the total time consumed for computing the distributed GPU matched to the compute operator by:

11. The computer device of claim 10, wherein the processor is configured to calculate the stand-alone GPU computation time by:

12. The computer device of claim 10, wherein the processor is configured to calculate the total time consumed for computing the distributed GPU matched to the compute operator by:

13. A storage medium for storing instructions for performing the resource scheduling method of any one of claims 1-6.