CN111736989A

CN111736989A - Multi-mode distributed cluster GPU index detection method and system

Info

Publication number: CN111736989A
Application number: CN202010506445.2A
Authority: CN
Inventors: 张登银; 李俊江; 程义; 寇英杰; 周正; 韩文生; 康世博
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-10-02
Anticipated expiration: 2040-06-05
Also published as: WO2021243855A1; CN111736989B

Abstract

The invention discloses a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing Unit), wherein a GPU sniffer reads a mode value and timer frequency in an environment variable of a working node, reads the GPU number and GPU information parameters of the working node, calculates the GPU performance scores of the working node under different working modes per se, and reports information; the memory compares the reported information with the database of the data plane, so that each field in the corresponding data of the database is updated to each field in the reported information; the checker waits for receiving and checking the reported information. The invention realizes the GPU information updating by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.

Description

Multi-mode distributed cluster GPU index detection method and system

Technical Field

The invention relates to a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing unit), and belongs to the technical field of cloud computing.

Background

In the field of cloud computing, a GPU (Graphics Processing Unit) is applied to accelerate the training speed of a machine learning algorithm, training tasks and workflows of the GPU are gradually diversified, and the training tasks and workflows put forward different requirements on performance indexes of the GPU, however, most of distributed clusters are insufficient in detection of the performance indexes of the GPU at present, and only the number of Graphics cards can be simply detected but the performance indexes of fine particles of the Graphics cards cannot be detected, so that the computing requirements of various complex scenes cannot be adapted, and the GPU with specific requirements is scheduled to run on non-adapted nodes, so that the GPU resource utilization rate of the whole distributed cluster is low, and the performance of the whole distributed cluster is affected.

The frequency of using the GPU in the cloud computing service is in an ascending trend, tasks using the GPU are continuously emerged, challenges are brought to GPU resource scheduling, the rationality of the GPU resource scheduling is related to the timeliness of distributed cluster GPU detection, the distributed cluster needs to detect the GPU state in time, otherwise, the task distribution inside the whole distributed cluster is unbalanced, the GPU resource scheduling result is influenced, and the low operation efficiency of the distributed cluster is indirectly caused.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing unit), which can reduce information reporting frequency, reduce information transmission cost and adapt to GPU (graphics processing unit) calculation requirements of more complex scenes.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a method for detecting an index of a multi-mode distributed cluster GPU, the method comprising the following steps:

checking whether the configuration file content of the working node exists, if so, reading the configuration file content of the working node and storing the configuration file content into a GPU information list cache so as to communicate with a data plane, and if so, executing the step (2); if the communication fails, recording failure reasons, sending the failure reasons to the local, sending failure information to engineering personnel, waiting for rated time, and communicating with the data plane again until the communication is normal; if the fault does not exist, recording the fault reason, sending the fault reason to the local, sending fault information to engineering personnel, and ending the step;

reading a mode value in the environment variable of the working node, and performing corresponding mode switching on the working mode of the working node according to the mode value;

reading the timer frequency from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; reading the GPU number and the GPU information parameters of the working nodes, storing the GPU number and the GPU information parameters into a GPU information list cache, and emptying the GPU information list cache if no GPU exists in the GPU information list cache; if the health state of the GPU is unhealthy, the GPU information parameters are not added into a GPU information list cache; after the step (5) is executed, resetting the timer, waiting for the next time period for reading the GPU information parameters, and executing the step (3) again; the working nodes are asynchronously concurrent, and the step (4) is executed;

calculating the maximum value of the GPU information parameter of the working node, storing the maximum value into a GPU information list cache, and directly executing the step (5) if the GPU information list cache of the working node is empty; calculating the GPU performance scores of the working nodes under different working modes according to the GPU information parameter values and the maximum values of the GPU information parameters, and setting the GPU with the highest performance score as the MainGPU;

initializing the sending information, judging whether a GPU exists in a GPU information list cache of the working node or not, if not, packaging the information without the GPU, and sending the packaged information to a data plane for reporting by loading the content of the configuration file in the GPU information list cache in the step (1); if the information exists, adding the information parameters and the corresponding values of the MainGPU into a field for sending the information, calculating the sum of the number of the GPUs and the video memory capacity of the GPU, adding the sum into the field for sending the information, loading the content of the configuration file in the GPU information list cache in the step (1), sending the sending information to a data plane for reporting, receiving and checking by a checker, and storing the sending information into the GPU information list cache;

and (3) when information reporting is executed next time, comparing each field of the sending information in the GPU information list cache with each field of the newly generated sending information, if the comparison is the same, not reporting the information, otherwise, covering the sending information in the GPU information list cache with the newly generated sending information, loading the content of the configuration file in the GPU information list cache in the step (1), and sending the newly generated sending information to a data plane for being received and checked by a checker.

With reference to the first aspect, further, the configuration file content of the work node in step (1) includes an IP address and a port number of the data plane.

With reference to the first aspect, further, in step (2), based on the presence or absence and the size of the mode value, the mode is divided into a resource priority mode, a high performance mode and an energy saving mode, and if the mode value is null, the self working mode is switched to the resource priority mode; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; and if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode.

With reference to the first aspect, further, in the step (3), an effective range of the timer frequency is 0.1 to 10, and a unit of the timer frequency is one second;

if the read timer frequency value is empty, the GPU sniffer reads the time period of the GPU information parameters

Setting the default to 1 second, namely reading the GPU information parameters of the working nodes again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;

the GPU information parameters comprise a GPU identification number, a GPU health state, a GPU model, GPU working power, GPU video memory frequency, GPU video memory capacity, GPU idle video memory, GPU core number and GPU bit width.

With reference to the first aspect, further, in step (4), the GPU performance score of the working node in different working modes is calculated as

Wherein, Score is the GPU performance Score of the working nodes in different working modes;

MemClight is GPU video memory frequency weight, CoreWeight is GPU core number weight, BandWeight is GPU bit width weight, PowWeight is GPU working power weight, FreeMemWeight is GPU idle video memory weight, and MemoryWeiight is GPU video memory capacity weight;

GMemoryclock is GPU video memory frequency, GCores is GPU core number, GBandwidth is GPU bit width, GPower is GPU working power, GMemory is GPU video memory capacity, and GFreeemory is GPU idle video memory;

MaxMemLock is the maximum value of GPU video memory frequency, MaxMores is the maximum value of GPU core number, MaxBandwidth is the maximum value of GPU bit width, MaxPower is the maximum value of GPU working power, MaxMemory is the maximum value of GPU video memory capacity, and MaxFreeMemem is the maximum value of GPU idle video memory;

and setting corresponding GPU video memory frequency weight, GPU core number weight, GPU bit width weight, GPU working power weight, GPU idle video memory weight and GPU video memory capacity weight according to different working modes, wherein the GPU video memory frequency weight, the GPU core number weight, the GPU bit width weight, the GPU working power weight, the GPU idle video memory weight and the GPU video memory capacity weight can be correspondingly adjusted according to the real state of the current distributed cluster.

In a second aspect, the present invention further provides a multimode distributed cluster GPU index detection method, including the following steps:

waiting for the starting of the memory and the butt joint shared memory, if the butt joint fails, writing a butt joint log into the local for an engineer to check errors by the data node, and ending the step; if the butt joint is successful, the data node starts a checker;

checking whether the content of the configuration file of the data node exists or not, if not, recording the failure reason and sending the failure reason to the local, sending the failure information to engineering personnel, and ending the step;

if the data node exists, reading the configuration file content of the data node and storing the configuration file content into a GPU information list cache of the working node so as to start a Web server to block monitoring, waiting for receiving and verifying the reported information transmitted by a GPU sniffer, and if the verification is wrong, discarding the reported information and writing the reported time and the error into a data plane log; if the verification is passed, the reported information is sent to a memory so that the memory can compare the reported information with a database of the data plane;

if the comparison result shows that the reported information is new data, the memory immediately stores the reported information, otherwise, the comparison result shows that whether each field in the reported information is consistent with each field in the corresponding data of the database or not;

if the two are consistent, the memory does not process; if the data plane logs are inconsistent, the memory updates each field in the corresponding data of the database into each field in the reported information, and writes the updating result into the data plane logs;

and waiting for receiving and checking the reported information transmitted by the GPU sniffer again.

With reference to the second aspect, further, the configuration file content of the data node includes an IP address and a port number of the data node;

the reporting information comprises whether the reporting node is a distributed cluster internal node or not, whether the reporting node has a reporting information authority or not, whether the reporting information format is standard or not and whether the reporting information internal field is legal or not;

whether the data node belongs to the distributed cluster or not and the information reporting authority are stored in a database of the data plane in advance.

With reference to the first aspect or the second aspect, further, the number of working nodes is scaled according to actual production, but is at least greater than one, and the inside of each working node includes a GPU sniffer; the data plane is composed of not less than 3 data nodes, and the data nodes comprise a memory and a checker.

In a third aspect, the present invention provides a multimode distributed cluster GPU metric detection system, the system comprising:

check profile content module: checking whether the configuration file content of the working node exists;

a mode switching module: the system comprises a plurality of working nodes, a plurality of storage units and a plurality of control units, wherein the working nodes are used for reading mode values in working node environment variables and performing corresponding mode switching on self working modes according to the mode values;

a reading module: the timer frequency is read from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; the GPU information processing system is also used for reading the GPU number and the GPU information parameters of the working nodes and storing the GPU number and the GPU information parameters into a GPU information list cache;

a calculation scoring module: the maximum value of the GPU information parameters of the working nodes is calculated and stored in a GPU information list cache; the GPU performance score of the working nodes under different working modes of the GPU is calculated according to the GPU information parameter value and the maximum value of the GPU information parameter, and the GPU with the highest performance score is set as the MainGPU;

an information reporting module: the method is used for initializing the sending information, judging whether the GPU exists in the GPU information list cache of the working node or not and reporting the information.

In a fourth aspect, the present invention further provides a multimode distributed cluster GPU index detection system, where the system includes:

waiting for the butt joint module: the shared memory is used for waiting for the starting of the memory and butting the shared memory;

check profile content module: for checking whether profile content of the data node exists;

a data comparison module: a database for comparing the reported information with the data plane by the memory;

an update write module: the memory is used for updating each field in the corresponding data of the database into each field in the reported information and writing the updating result into the data plane log;

and the reporting module waits again: and the GPU sniffer is used for waiting for receiving and checking the reported information transmitted by the GPU sniffer again.

Compared with the prior art, the invention has the following beneficial effects:

the invention realizes the GPU information updating by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.

Drawings

Fig. 1 is an internal architecture diagram of a working node and a data plane of a multimode distributed cluster GPU index detection method according to an embodiment of the present invention;

FIG. 2 is a GPU sniffer workflow diagram of a multi-mode distributed cluster GPU index detection method according to an embodiment of the present invention;

fig. 3 is a flow chart of a checker work of the multimode distributed cluster GPU index detection method according to the embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 2, the present invention provides a multimode distributed cluster GPU index detection method, which comprises the following steps:

the method comprises the following steps that (1) a GPU sniffer checks whether configuration file content of a working node exists or not, the configuration file content of the working node comprises an IP address and a port number of a data plane, if the configuration file content of the working node exists, the configuration file content of the working node is read and stored into a GPU information list cache so as to communicate with the data plane, and if the communication is normal, the step (2) is executed; if the communication fails, recording failure reasons, sending the failure reasons to the local, sending failure information to engineering personnel, waiting for rated time, and communicating with the data plane again until the communication is normal; if the fault does not exist, recording the fault reason, sending the fault reason to the local, sending fault information to engineering personnel, and ending the step;

mode switching, namely reading a mode value in an environment variable of a working node by a GPU sniffer, and performing corresponding mode switching on a self working mode according to the mode value, specifically, dividing the mode into a resource priority mode, a high-performance mode and an energy-saving mode based on the existence and the size of the mode value, and switching the self working mode into the resource priority mode if the mode value is null; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode;

reading the frequency of a counter and the GPU, reading the frequency of a timer from the environment variables of the working nodes by the GPU sniffer so as to set the time period for reading the information parameters of the GPU by the GPU sniffer;

the effective range of the timer frequency is 0.1-10, and the unit of the timer frequency is one second; if the read timer frequency value is empty, the time period for reading the GPU information parameters by the GPU sniffer is set to be 1 second by default, namely the GPU information parameters of the working nodes are read again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;

the GPU sniffer reads the GPU number and GPU information parameters of the working nodes and stores the GPU number and the GPU information parameters into a GPU information list cache, wherein the GPU information parameters comprise a GPU identification number, a GPU health state, a GPU model, GPU working power, GPU video memory frequency, GPU video memory capacity, GPU idle video memory, GPU core number and GPU bit width;

if no GPU exists in the GPU information list cache, emptying the GPU information list cache; if the health state of the GPU is unhealthy, the GPU information parameters are not added into a GPU information list cache; after the step (5) is executed, resetting the timer, waiting for the next time period for reading the GPU information parameters, and executing the step (3) again; the working nodes are asynchronously concurrent, and the step (4) is executed;

calculating the performance score of the GPU, calculating the maximum value of the GPU information parameter of the working node by the GPU sniffer, storing the maximum value into a GPU information list cache, and directly executing the step (5) if the GPU information list cache of the working node is empty; calculating the GPU performance scores of the working nodes under different working modes according to the GPU information parameter values and the maximum values of the GPU information parameters, and setting the GPU with the highest performance score as the MainGPU;

the calculation formula of the GPU performance scores of the working nodes in different working modes is

Step (5), information is reported, a GPU sniffer initializes the sent information, whether a GPU exists in a GPU information list cache of the working node is judged, if not, the information without the GPU exists is packaged, and the packaged information is sent to a data plane for reporting by loading the content of the configuration file in the GPU information list cache in the step (1); if the information exists, adding the information parameters and the corresponding values of the MainGPU into a field for sending the information, calculating the sum of the number of the GPUs and the video memory capacity of the GPU, adding the sum into the field for sending the information, loading the content of the configuration file in the GPU information list cache in the step (1), sending the sending information to a data plane for reporting, receiving and checking by a checker, and storing the sending information into the GPU information list cache;

when information reporting is executed next time, the GPU sniffer compares each field of sending information in the GPU information list cache with each field of newly generated sending information, if the comparison is the same, information reporting is not conducted, otherwise, the newly generated sending information is used for covering the sending information in the GPU information list cache, the configuration file content in the GPU information list cache in the step (1) is loaded, and the newly generated sending information is sent to the data plane to be reported for being received and checked by the checker.

As shown in fig. 3, the present invention further provides a multimode distributed cluster GPU index detection method, which includes the following steps:

the checker waits for the starting of the memory and the butt joint shared memory, if the butt joint fails, the data node writes the butt joint log into the local for the troubleshooting of engineering personnel, and the step is finished; if the butt joint is successful, the data node starts a checker;

the checker checks whether the configuration file content of the data node exists or not, wherein the configuration file content of the data node comprises an IP address and a port number of the data node;

if not, the checker records and sends the failure reason to the local, and sends the fault information to engineering personnel, and the step is finished;

if the current report information exists, the checker reads the configuration file content of the data node and stores the configuration file content into a GPU information list cache of the working node so as to start the blocking monitoring of the Web server, so that the checker waits for receiving and checking the report information transmitted by the GPU sniffer, wherein the report information comprises whether the report node is a distributed cluster internal node or not, whether the report node has a report information authority or not, whether the report information format is standard or not and whether the report information internal field is legal or not; whether the data node belongs to the distributed cluster or not and whether the information reporting authority is stored in a database of the data plane in advance or not;

if the verification is wrong, the verifier discards the reported information and writes the reported time and the error into a data plane log; if the verification is passed, the verifier sends the reported information to the memory so that the memory can compare the reported information with the database of the data plane;

the checker waits for receiving and checking the reported information transmitted by the GPU sniffer again.

As shown in fig. 1, the number of working nodes is scaled according to actual production, but is at least greater than one, and the working nodes include GPU sniffers and GPUs inside; the data plane is composed of not less than 3 data nodes, and each data node comprises a memory and a checker;

the data plane realizes load balance of data node flow through a heartbeat detection technology and a virtual IP (virtual IP) technology in the existing cloud computing technology, so that high-efficiency availability of the multi-mode distributed cluster GPU index detection system is guaranteed, single-point failure of the system is avoided, and meanwhile, the data plane can realize consistency of data in a plurality of data nodes in a Shared Memory (Shared Memory) mode.

The embodiment of the invention realizes the information updating of the GPU by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.

The embodiment of the invention also provides a multimode distributed cluster GPU index detection system, which comprises:

Embodiments of the present invention also provide a multimode distributed cluster GPU metric detection system, comprising one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing the aforementioned multimode distributed cluster GPU metric detection method.

Embodiments of the present invention also provide a computer readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a processor, implement the aforementioned steps of the multimode distributed cluster GPU metric detection method.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-mode distributed cluster GPU index detection method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the configuration file content of the working node in step (1) includes an IP address and a port number of the data plane.

3. The method according to claim 1, wherein in step (2), based on the presence or absence and the size of the mode value, the mode is divided into a resource priority mode, a high performance mode and an energy saving mode, and if the mode value is null, the self working mode is switched to the resource priority mode; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; and if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode.

4. The method according to claim 1, wherein in step (3), the effective range of the timer frequency is 0.1-10, and the unit of the timer frequency is one second;

if the read timer frequency value is empty, the time period for reading the GPU information parameters by the GPU sniffer is set to be 1 second by default, namely the GPU information parameters of the working nodes are read again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;

5. The method according to claim 1, wherein the calculation formula of the GPU performance scores of the working nodes in different working modes in the step (4) is as follows

6. A multi-mode distributed cluster GPU index detection method is characterized by comprising the following steps:

7. The method of claim 6, wherein the configuration file content of the data node comprises an IP address and a port number of the data node;

8. The method for detecting the indexes of the GPU in the multimode distributed cluster is characterized in that the number of working nodes is expanded and reduced according to actual production, but is at least larger than one, and a GPU sniffer is arranged in each working node; the data plane is composed of not less than 3 data nodes, and the data nodes comprise a memory and a checker.

9. A multi-mode distributed cluster GPU metric detection system, the system comprising:

10. A multi-mode distributed cluster GPU metric detection system, the system comprising: