CN111736989A - Multi-mode distributed cluster GPU index detection method and system - Google Patents

Multi-mode distributed cluster GPU index detection method and system Download PDF

Info

Publication number
CN111736989A
CN111736989A CN202010506445.2A CN202010506445A CN111736989A CN 111736989 A CN111736989 A CN 111736989A CN 202010506445 A CN202010506445 A CN 202010506445A CN 111736989 A CN111736989 A CN 111736989A
Authority
CN
China
Prior art keywords
gpu
information
working
mode
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010506445.2A
Other languages
Chinese (zh)
Other versions
CN111736989B (en
Inventor
张登银
李俊江
程义
寇英杰
周正
韩文生
康世博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010506445.2A priority Critical patent/CN111736989B/en
Priority to PCT/CN2020/110992 priority patent/WO2021243855A1/en
Publication of CN111736989A publication Critical patent/CN111736989A/en
Priority to US17/369,909 priority patent/US11734152B2/en
Application granted granted Critical
Publication of CN111736989B publication Critical patent/CN111736989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Abstract

The invention discloses a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing Unit), wherein a GPU sniffer reads a mode value and timer frequency in an environment variable of a working node, reads the GPU number and GPU information parameters of the working node, calculates the GPU performance scores of the working node under different working modes per se, and reports information; the memory compares the reported information with the database of the data plane, so that each field in the corresponding data of the database is updated to each field in the reported information; the checker waits for receiving and checking the reported information. The invention realizes the GPU information updating by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.

Description

Multi-mode distributed cluster GPU index detection method and system
Technical Field
The invention relates to a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing unit), and belongs to the technical field of cloud computing.
Background
In the field of cloud computing, a GPU (Graphics Processing Unit) is applied to accelerate the training speed of a machine learning algorithm, training tasks and workflows of the GPU are gradually diversified, and the training tasks and workflows put forward different requirements on performance indexes of the GPU, however, most of distributed clusters are insufficient in detection of the performance indexes of the GPU at present, and only the number of Graphics cards can be simply detected but the performance indexes of fine particles of the Graphics cards cannot be detected, so that the computing requirements of various complex scenes cannot be adapted, and the GPU with specific requirements is scheduled to run on non-adapted nodes, so that the GPU resource utilization rate of the whole distributed cluster is low, and the performance of the whole distributed cluster is affected.
The frequency of using the GPU in the cloud computing service is in an ascending trend, tasks using the GPU are continuously emerged, challenges are brought to GPU resource scheduling, the rationality of the GPU resource scheduling is related to the timeliness of distributed cluster GPU detection, the distributed cluster needs to detect the GPU state in time, otherwise, the task distribution inside the whole distributed cluster is unbalanced, the GPU resource scheduling result is influenced, and the low operation efficiency of the distributed cluster is indirectly caused.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a method and a system for detecting indexes of a multi-mode distributed cluster GPU (graphics processing unit), which can reduce information reporting frequency, reduce information transmission cost and adapt to GPU (graphics processing unit) calculation requirements of more complex scenes.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
in a first aspect, the present invention provides a method for detecting an index of a multi-mode distributed cluster GPU, the method comprising the following steps:
checking whether the configuration file content of the working node exists, if so, reading the configuration file content of the working node and storing the configuration file content into a GPU information list cache so as to communicate with a data plane, and if so, executing the step (2); if the communication fails, recording failure reasons, sending the failure reasons to the local, sending failure information to engineering personnel, waiting for rated time, and communicating with the data plane again until the communication is normal; if the fault does not exist, recording the fault reason, sending the fault reason to the local, sending fault information to engineering personnel, and ending the step;
reading a mode value in the environment variable of the working node, and performing corresponding mode switching on the working mode of the working node according to the mode value;
reading the timer frequency from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; reading the GPU number and the GPU information parameters of the working nodes, storing the GPU number and the GPU information parameters into a GPU information list cache, and emptying the GPU information list cache if no GPU exists in the GPU information list cache; if the health state of the GPU is unhealthy, the GPU information parameters are not added into a GPU information list cache; after the step (5) is executed, resetting the timer, waiting for the next time period for reading the GPU information parameters, and executing the step (3) again; the working nodes are asynchronously concurrent, and the step (4) is executed;
calculating the maximum value of the GPU information parameter of the working node, storing the maximum value into a GPU information list cache, and directly executing the step (5) if the GPU information list cache of the working node is empty; calculating the GPU performance scores of the working nodes under different working modes according to the GPU information parameter values and the maximum values of the GPU information parameters, and setting the GPU with the highest performance score as the MainGPU;
initializing the sending information, judging whether a GPU exists in a GPU information list cache of the working node or not, if not, packaging the information without the GPU, and sending the packaged information to a data plane for reporting by loading the content of the configuration file in the GPU information list cache in the step (1); if the information exists, adding the information parameters and the corresponding values of the MainGPU into a field for sending the information, calculating the sum of the number of the GPUs and the video memory capacity of the GPU, adding the sum into the field for sending the information, loading the content of the configuration file in the GPU information list cache in the step (1), sending the sending information to a data plane for reporting, receiving and checking by a checker, and storing the sending information into the GPU information list cache;
and (3) when information reporting is executed next time, comparing each field of the sending information in the GPU information list cache with each field of the newly generated sending information, if the comparison is the same, not reporting the information, otherwise, covering the sending information in the GPU information list cache with the newly generated sending information, loading the content of the configuration file in the GPU information list cache in the step (1), and sending the newly generated sending information to a data plane for being received and checked by a checker.
With reference to the first aspect, further, the configuration file content of the work node in step (1) includes an IP address and a port number of the data plane.
With reference to the first aspect, further, in step (2), based on the presence or absence and the size of the mode value, the mode is divided into a resource priority mode, a high performance mode and an energy saving mode, and if the mode value is null, the self working mode is switched to the resource priority mode; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; and if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode.
With reference to the first aspect, further, in the step (3), an effective range of the timer frequency is 0.1 to 10, and a unit of the timer frequency is one second;
if the read timer frequency value is empty, the GPU sniffer reads the time period of the GPU information parameters
Setting the default to 1 second, namely reading the GPU information parameters of the working nodes again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;
the GPU information parameters comprise a GPU identification number, a GPU health state, a GPU model, GPU working power, GPU video memory frequency, GPU video memory capacity, GPU idle video memory, GPU core number and GPU bit width.
With reference to the first aspect, further, in step (4), the GPU performance score of the working node in different working modes is calculated as
Figure RE-GDA0002591058760000041
Wherein, Score is the GPU performance Score of the working nodes in different working modes;
MemClight is GPU video memory frequency weight, CoreWeight is GPU core number weight, BandWeight is GPU bit width weight, PowWeight is GPU working power weight, FreeMemWeight is GPU idle video memory weight, and MemoryWeiight is GPU video memory capacity weight;
GMemoryclock is GPU video memory frequency, GCores is GPU core number, GBandwidth is GPU bit width, GPower is GPU working power, GMemory is GPU video memory capacity, and GFreeemory is GPU idle video memory;
MaxMemLock is the maximum value of GPU video memory frequency, MaxMores is the maximum value of GPU core number, MaxBandwidth is the maximum value of GPU bit width, MaxPower is the maximum value of GPU working power, MaxMemory is the maximum value of GPU video memory capacity, and MaxFreeMemem is the maximum value of GPU idle video memory;
and setting corresponding GPU video memory frequency weight, GPU core number weight, GPU bit width weight, GPU working power weight, GPU idle video memory weight and GPU video memory capacity weight according to different working modes, wherein the GPU video memory frequency weight, the GPU core number weight, the GPU bit width weight, the GPU working power weight, the GPU idle video memory weight and the GPU video memory capacity weight can be correspondingly adjusted according to the real state of the current distributed cluster.
In a second aspect, the present invention further provides a multimode distributed cluster GPU index detection method, including the following steps:
waiting for the starting of the memory and the butt joint shared memory, if the butt joint fails, writing a butt joint log into the local for an engineer to check errors by the data node, and ending the step; if the butt joint is successful, the data node starts a checker;
checking whether the content of the configuration file of the data node exists or not, if not, recording the failure reason and sending the failure reason to the local, sending the failure information to engineering personnel, and ending the step;
if the data node exists, reading the configuration file content of the data node and storing the configuration file content into a GPU information list cache of the working node so as to start a Web server to block monitoring, waiting for receiving and verifying the reported information transmitted by a GPU sniffer, and if the verification is wrong, discarding the reported information and writing the reported time and the error into a data plane log; if the verification is passed, the reported information is sent to a memory so that the memory can compare the reported information with a database of the data plane;
if the comparison result shows that the reported information is new data, the memory immediately stores the reported information, otherwise, the comparison result shows that whether each field in the reported information is consistent with each field in the corresponding data of the database or not;
if the two are consistent, the memory does not process; if the data plane logs are inconsistent, the memory updates each field in the corresponding data of the database into each field in the reported information, and writes the updating result into the data plane logs;
and waiting for receiving and checking the reported information transmitted by the GPU sniffer again.
With reference to the second aspect, further, the configuration file content of the data node includes an IP address and a port number of the data node;
the reporting information comprises whether the reporting node is a distributed cluster internal node or not, whether the reporting node has a reporting information authority or not, whether the reporting information format is standard or not and whether the reporting information internal field is legal or not;
whether the data node belongs to the distributed cluster or not and the information reporting authority are stored in a database of the data plane in advance.
With reference to the first aspect or the second aspect, further, the number of working nodes is scaled according to actual production, but is at least greater than one, and the inside of each working node includes a GPU sniffer; the data plane is composed of not less than 3 data nodes, and the data nodes comprise a memory and a checker.
In a third aspect, the present invention provides a multimode distributed cluster GPU metric detection system, the system comprising:
check profile content module: checking whether the configuration file content of the working node exists;
a mode switching module: the system comprises a plurality of working nodes, a plurality of storage units and a plurality of control units, wherein the working nodes are used for reading mode values in working node environment variables and performing corresponding mode switching on self working modes according to the mode values;
a reading module: the timer frequency is read from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; the GPU information processing system is also used for reading the GPU number and the GPU information parameters of the working nodes and storing the GPU number and the GPU information parameters into a GPU information list cache;
a calculation scoring module: the maximum value of the GPU information parameters of the working nodes is calculated and stored in a GPU information list cache; the GPU performance score of the working nodes under different working modes of the GPU is calculated according to the GPU information parameter value and the maximum value of the GPU information parameter, and the GPU with the highest performance score is set as the MainGPU;
an information reporting module: the method is used for initializing the sending information, judging whether the GPU exists in the GPU information list cache of the working node or not and reporting the information.
In a fourth aspect, the present invention further provides a multimode distributed cluster GPU index detection system, where the system includes:
waiting for the butt joint module: the shared memory is used for waiting for the starting of the memory and butting the shared memory;
check profile content module: for checking whether profile content of the data node exists;
a data comparison module: a database for comparing the reported information with the data plane by the memory;
an update write module: the memory is used for updating each field in the corresponding data of the database into each field in the reported information and writing the updating result into the data plane log;
and the reporting module waits again: and the GPU sniffer is used for waiting for receiving and checking the reported information transmitted by the GPU sniffer again.
Compared with the prior art, the invention has the following beneficial effects:
the invention realizes the GPU information updating by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.
Drawings
Fig. 1 is an internal architecture diagram of a working node and a data plane of a multimode distributed cluster GPU index detection method according to an embodiment of the present invention;
FIG. 2 is a GPU sniffer workflow diagram of a multi-mode distributed cluster GPU index detection method according to an embodiment of the present invention;
fig. 3 is a flow chart of a checker work of the multimode distributed cluster GPU index detection method according to the embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
As shown in fig. 2, the present invention provides a multimode distributed cluster GPU index detection method, which comprises the following steps:
the method comprises the following steps that (1) a GPU sniffer checks whether configuration file content of a working node exists or not, the configuration file content of the working node comprises an IP address and a port number of a data plane, if the configuration file content of the working node exists, the configuration file content of the working node is read and stored into a GPU information list cache so as to communicate with the data plane, and if the communication is normal, the step (2) is executed; if the communication fails, recording failure reasons, sending the failure reasons to the local, sending failure information to engineering personnel, waiting for rated time, and communicating with the data plane again until the communication is normal; if the fault does not exist, recording the fault reason, sending the fault reason to the local, sending fault information to engineering personnel, and ending the step;
mode switching, namely reading a mode value in an environment variable of a working node by a GPU sniffer, and performing corresponding mode switching on a self working mode according to the mode value, specifically, dividing the mode into a resource priority mode, a high-performance mode and an energy-saving mode based on the existence and the size of the mode value, and switching the self working mode into the resource priority mode if the mode value is null; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode;
reading the frequency of a counter and the GPU, reading the frequency of a timer from the environment variables of the working nodes by the GPU sniffer so as to set the time period for reading the information parameters of the GPU by the GPU sniffer;
the effective range of the timer frequency is 0.1-10, and the unit of the timer frequency is one second; if the read timer frequency value is empty, the time period for reading the GPU information parameters by the GPU sniffer is set to be 1 second by default, namely the GPU information parameters of the working nodes are read again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;
the GPU sniffer reads the GPU number and GPU information parameters of the working nodes and stores the GPU number and the GPU information parameters into a GPU information list cache, wherein the GPU information parameters comprise a GPU identification number, a GPU health state, a GPU model, GPU working power, GPU video memory frequency, GPU video memory capacity, GPU idle video memory, GPU core number and GPU bit width;
if no GPU exists in the GPU information list cache, emptying the GPU information list cache; if the health state of the GPU is unhealthy, the GPU information parameters are not added into a GPU information list cache; after the step (5) is executed, resetting the timer, waiting for the next time period for reading the GPU information parameters, and executing the step (3) again; the working nodes are asynchronously concurrent, and the step (4) is executed;
calculating the performance score of the GPU, calculating the maximum value of the GPU information parameter of the working node by the GPU sniffer, storing the maximum value into a GPU information list cache, and directly executing the step (5) if the GPU information list cache of the working node is empty; calculating the GPU performance scores of the working nodes under different working modes according to the GPU information parameter values and the maximum values of the GPU information parameters, and setting the GPU with the highest performance score as the MainGPU;
the calculation formula of the GPU performance scores of the working nodes in different working modes is
Figure RE-GDA0002591058760000091
Wherein, Score is the GPU performance Score of the working nodes in different working modes;
MemClight is GPU video memory frequency weight, CoreWeight is GPU core number weight, BandWeight is GPU bit width weight, PowWeight is GPU working power weight, FreeMemWeight is GPU idle video memory weight, and MemoryWeiight is GPU video memory capacity weight;
GMemoryclock is GPU video memory frequency, GCores is GPU core number, GBandwidth is GPU bit width, GPower is GPU working power, GMemory is GPU video memory capacity, and GFreeemory is GPU idle video memory;
MaxMemLock is the maximum value of GPU video memory frequency, MaxMores is the maximum value of GPU core number, MaxBandwidth is the maximum value of GPU bit width, MaxPower is the maximum value of GPU working power, MaxMemory is the maximum value of GPU video memory capacity, and MaxFreeMemem is the maximum value of GPU idle video memory;
and setting corresponding GPU video memory frequency weight, GPU core number weight, GPU bit width weight, GPU working power weight, GPU idle video memory weight and GPU video memory capacity weight according to different working modes, wherein the GPU video memory frequency weight, the GPU core number weight, the GPU bit width weight, the GPU working power weight, the GPU idle video memory weight and the GPU video memory capacity weight can be correspondingly adjusted according to the real state of the current distributed cluster.
Step (5), information is reported, a GPU sniffer initializes the sent information, whether a GPU exists in a GPU information list cache of the working node is judged, if not, the information without the GPU exists is packaged, and the packaged information is sent to a data plane for reporting by loading the content of the configuration file in the GPU information list cache in the step (1); if the information exists, adding the information parameters and the corresponding values of the MainGPU into a field for sending the information, calculating the sum of the number of the GPUs and the video memory capacity of the GPU, adding the sum into the field for sending the information, loading the content of the configuration file in the GPU information list cache in the step (1), sending the sending information to a data plane for reporting, receiving and checking by a checker, and storing the sending information into the GPU information list cache;
when information reporting is executed next time, the GPU sniffer compares each field of sending information in the GPU information list cache with each field of newly generated sending information, if the comparison is the same, information reporting is not conducted, otherwise, the newly generated sending information is used for covering the sending information in the GPU information list cache, the configuration file content in the GPU information list cache in the step (1) is loaded, and the newly generated sending information is sent to the data plane to be reported for being received and checked by the checker.
As shown in fig. 3, the present invention further provides a multimode distributed cluster GPU index detection method, which includes the following steps:
the checker waits for the starting of the memory and the butt joint shared memory, if the butt joint fails, the data node writes the butt joint log into the local for the troubleshooting of engineering personnel, and the step is finished; if the butt joint is successful, the data node starts a checker;
the checker checks whether the configuration file content of the data node exists or not, wherein the configuration file content of the data node comprises an IP address and a port number of the data node;
if not, the checker records and sends the failure reason to the local, and sends the fault information to engineering personnel, and the step is finished;
if the current report information exists, the checker reads the configuration file content of the data node and stores the configuration file content into a GPU information list cache of the working node so as to start the blocking monitoring of the Web server, so that the checker waits for receiving and checking the report information transmitted by the GPU sniffer, wherein the report information comprises whether the report node is a distributed cluster internal node or not, whether the report node has a report information authority or not, whether the report information format is standard or not and whether the report information internal field is legal or not; whether the data node belongs to the distributed cluster or not and whether the information reporting authority is stored in a database of the data plane in advance or not;
if the verification is wrong, the verifier discards the reported information and writes the reported time and the error into a data plane log; if the verification is passed, the verifier sends the reported information to the memory so that the memory can compare the reported information with the database of the data plane;
if the comparison result shows that the reported information is new data, the memory immediately stores the reported information, otherwise, the comparison result shows that whether each field in the reported information is consistent with each field in the corresponding data of the database or not;
if the two are consistent, the memory does not process; if the data plane logs are inconsistent, the memory updates each field in the corresponding data of the database into each field in the reported information, and writes the updating result into the data plane logs;
the checker waits for receiving and checking the reported information transmitted by the GPU sniffer again.
As shown in fig. 1, the number of working nodes is scaled according to actual production, but is at least greater than one, and the working nodes include GPU sniffers and GPUs inside; the data plane is composed of not less than 3 data nodes, and each data node comprises a memory and a checker;
the data plane realizes load balance of data node flow through a heartbeat detection technology and a virtual IP (virtual IP) technology in the existing cloud computing technology, so that high-efficiency availability of the multi-mode distributed cluster GPU index detection system is guaranteed, single-point failure of the system is avoided, and meanwhile, the data plane can realize consistency of data in a plurality of data nodes in a Shared Memory (Shared Memory) mode.
The embodiment of the invention realizes the information updating of the GPU by setting the GPU information list cache and the data plane setting field comparison through the working node, thereby reducing the information reporting frequency and the information transmission cost; the diversity of GPU resources is highlighted through a multi-mode scoring strategy so as to adapt to the GPU computing requirements of more complex scenes.
The embodiment of the invention also provides a multimode distributed cluster GPU index detection system, which comprises:
check profile content module: checking whether the configuration file content of the working node exists;
a mode switching module: the system comprises a plurality of working nodes, a plurality of storage units and a plurality of control units, wherein the working nodes are used for reading mode values in working node environment variables and performing corresponding mode switching on self working modes according to the mode values;
a reading module: the timer frequency is read from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; the GPU information processing system is also used for reading the GPU number and the GPU information parameters of the working nodes and storing the GPU number and the GPU information parameters into a GPU information list cache;
a calculation scoring module: the maximum value of the GPU information parameters of the working nodes is calculated and stored in a GPU information list cache; the GPU performance score of the working nodes under different working modes of the GPU is calculated according to the GPU information parameter value and the maximum value of the GPU information parameter, and the GPU with the highest performance score is set as the MainGPU;
an information reporting module: the method is used for initializing the sending information, judging whether the GPU exists in the GPU information list cache of the working node or not and reporting the information.
The embodiment of the invention also provides a multimode distributed cluster GPU index detection system, which comprises:
waiting for the butt joint module: the shared memory is used for waiting for the starting of the memory and butting the shared memory;
check profile content module: for checking whether profile content of the data node exists;
a data comparison module: a database for comparing the reported information with the data plane by the memory;
an update write module: the memory is used for updating each field in the corresponding data of the database into each field in the reported information and writing the updating result into the data plane log;
and the reporting module waits again: and the GPU sniffer is used for waiting for receiving and checking the reported information transmitted by the GPU sniffer again.
Embodiments of the present invention also provide a multimode distributed cluster GPU metric detection system, comprising one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the one or more programs include instructions for performing the aforementioned multimode distributed cluster GPU metric detection method.
Embodiments of the present invention also provide a computer readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a processor, implement the aforementioned steps of the multimode distributed cluster GPU metric detection method.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A multi-mode distributed cluster GPU index detection method is characterized by comprising the following steps:
checking whether the configuration file content of the working node exists, if so, reading the configuration file content of the working node and storing the configuration file content into a GPU information list cache so as to communicate with a data plane, and if so, executing the step (2); if the communication fails, recording failure reasons, sending the failure reasons to the local, sending failure information to engineering personnel, waiting for rated time, and communicating with the data plane again until the communication is normal; if the fault does not exist, recording the fault reason, sending the fault reason to the local, sending fault information to engineering personnel, and ending the step;
reading a mode value in the environment variable of the working node, and performing corresponding mode switching on the working mode of the working node according to the mode value;
reading the timer frequency from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; reading the GPU number and the GPU information parameters of the working nodes, storing the GPU number and the GPU information parameters into a GPU information list cache, and emptying the GPU information list cache if no GPU exists in the GPU information list cache; if the health state of the GPU is unhealthy, the GPU information parameters are not added into a GPU information list cache; after the step (5) is executed, resetting the timer, waiting for the next time period for reading the GPU information parameters, and executing the step (3) again; the working nodes are asynchronously concurrent, and the step (4) is executed;
calculating the maximum value of the GPU information parameter of the working node, storing the maximum value into a GPU information list cache, and directly executing the step (5) if the GPU information list cache of the working node is empty; calculating the GPU performance scores of the working nodes under different working modes according to the GPU information parameter values and the maximum values of the GPU information parameters, and setting the GPU with the highest performance score as the MainGPU;
initializing the sending information, judging whether a GPU exists in a GPU information list cache of the working node or not, if not, packaging the information without the GPU, and sending the packaged information to a data plane for reporting by loading the content of the configuration file in the GPU information list cache in the step (1); if the information exists, adding the information parameters and the corresponding values of the MainGPU into a field for sending the information, calculating the sum of the number of the GPUs and the video memory capacity of the GPU, adding the sum into the field for sending the information, loading the content of the configuration file in the GPU information list cache in the step (1), sending the sending information to a data plane for reporting, receiving and checking by a checker, and storing the sending information into the GPU information list cache;
and (3) when information reporting is executed next time, comparing each field of the sending information in the GPU information list cache with each field of the newly generated sending information, if the comparison is the same, not reporting the information, otherwise, covering the sending information in the GPU information list cache with the newly generated sending information, loading the content of the configuration file in the GPU information list cache in the step (1), and sending the newly generated sending information to a data plane for being received and checked by a checker.
2. The method according to claim 1, wherein the configuration file content of the working node in step (1) includes an IP address and a port number of the data plane.
3. The method according to claim 1, wherein in step (2), based on the presence or absence and the size of the mode value, the mode is divided into a resource priority mode, a high performance mode and an energy saving mode, and if the mode value is null, the self working mode is switched to the resource priority mode; if the mode value belongs to the high-performance mode, switching the self working mode into the high-performance mode; and if the mode value belongs to the energy-saving mode, switching the self working mode into the energy-saving mode.
4. The method according to claim 1, wherein in step (3), the effective range of the timer frequency is 0.1-10, and the unit of the timer frequency is one second;
if the read timer frequency value is empty, the time period for reading the GPU information parameters by the GPU sniffer is set to be 1 second by default, namely the GPU information parameters of the working nodes are read again every 1 second; if the read timer frequency value exceeds 10, setting the timer frequency value to 10; if the read timer frequency value is lower than 0.1, setting the timer frequency value to be 0.1; if the read timer frequency value is in the effective range, the timer frequency value is not reset;
the GPU information parameters comprise a GPU identification number, a GPU health state, a GPU model, GPU working power, GPU video memory frequency, GPU video memory capacity, GPU idle video memory, GPU core number and GPU bit width.
5. The method according to claim 1, wherein the calculation formula of the GPU performance scores of the working nodes in different working modes in the step (4) is as follows
Figure FDA0002526720570000031
Wherein, Score is the GPU performance Score of the working nodes in different working modes;
MemClight is GPU video memory frequency weight, CoreWeight is GPU core number weight, BandWeight is GPU bit width weight, PowWeight is GPU working power weight, FreeMemWeight is GPU idle video memory weight, and MemoryWeiight is GPU video memory capacity weight;
GMemoryclock is GPU video memory frequency, GCores is GPU core number, GBandwidth is GPU bit width, GPower is GPU working power, GMemory is GPU video memory capacity, and GFreeemory is GPU idle video memory;
MaxMemLock is the maximum value of GPU video memory frequency, MaxMores is the maximum value of GPU core number, MaxBandwidth is the maximum value of GPU bit width, MaxPower is the maximum value of GPU working power, MaxMemory is the maximum value of GPU video memory capacity, and MaxFreeMemem is the maximum value of GPU idle video memory;
and setting corresponding GPU video memory frequency weight, GPU core number weight, GPU bit width weight, GPU working power weight, GPU idle video memory weight and GPU video memory capacity weight according to different working modes, wherein the GPU video memory frequency weight, the GPU core number weight, the GPU bit width weight, the GPU working power weight, the GPU idle video memory weight and the GPU video memory capacity weight can be correspondingly adjusted according to the real state of the current distributed cluster.
6. A multi-mode distributed cluster GPU index detection method is characterized by comprising the following steps:
waiting for the starting of the memory and the butt joint shared memory, if the butt joint fails, writing a butt joint log into the local for an engineer to check errors by the data node, and ending the step; if the butt joint is successful, the data node starts a checker;
checking whether the content of the configuration file of the data node exists or not, if not, recording the failure reason and sending the failure reason to the local, sending the failure information to engineering personnel, and ending the step;
if the data node exists, reading the configuration file content of the data node and storing the configuration file content into a GPU information list cache of the working node so as to start a Web server to block monitoring, waiting for receiving and verifying the reported information transmitted by a GPU sniffer, and if the verification is wrong, discarding the reported information and writing the reported time and the error into a data plane log; if the verification is passed, the reported information is sent to a memory so that the memory can compare the reported information with a database of the data plane;
if the comparison result shows that the reported information is new data, the memory immediately stores the reported information, otherwise, the comparison result shows that whether each field in the reported information is consistent with each field in the corresponding data of the database or not;
if the two are consistent, the memory does not process; if the data plane logs are inconsistent, the memory updates each field in the corresponding data of the database into each field in the reported information, and writes the updating result into the data plane logs;
and waiting for receiving and checking the reported information transmitted by the GPU sniffer again.
7. The method of claim 6, wherein the configuration file content of the data node comprises an IP address and a port number of the data node;
the reporting information comprises whether the reporting node is a distributed cluster internal node or not, whether the reporting node has a reporting information authority or not, whether the reporting information format is standard or not and whether the reporting information internal field is legal or not;
whether the data node belongs to the distributed cluster or not and the information reporting authority are stored in a database of the data plane in advance.
8. The method for detecting the indexes of the GPU in the multimode distributed cluster is characterized in that the number of working nodes is expanded and reduced according to actual production, but is at least larger than one, and a GPU sniffer is arranged in each working node; the data plane is composed of not less than 3 data nodes, and the data nodes comprise a memory and a checker.
9. A multi-mode distributed cluster GPU metric detection system, the system comprising:
check profile content module: checking whether the configuration file content of the working node exists;
a mode switching module: the system comprises a plurality of working nodes, a plurality of storage units and a plurality of control units, wherein the working nodes are used for reading mode values in working node environment variables and performing corresponding mode switching on self working modes according to the mode values;
a reading module: the timer frequency is read from the environment variable of the working node so as to set the time period for reading the GPU information parameter by the timer frequency; the GPU information processing system is also used for reading the GPU number and the GPU information parameters of the working nodes and storing the GPU number and the GPU information parameters into a GPU information list cache;
a calculation scoring module: the maximum value of the GPU information parameters of the working nodes is calculated and stored in a GPU information list cache; the GPU performance score of the working nodes under different working modes of the GPU is calculated according to the GPU information parameter value and the maximum value of the GPU information parameter, and the GPU with the highest performance score is set as the MainGPU;
an information reporting module: the method is used for initializing the sending information, judging whether the GPU exists in the GPU information list cache of the working node or not and reporting the information.
10. A multi-mode distributed cluster GPU metric detection system, the system comprising:
waiting for the butt joint module: the shared memory is used for waiting for the starting of the memory and butting the shared memory;
check profile content module: for checking whether profile content of the data node exists;
a data comparison module: a database for comparing the reported information with the data plane by the memory;
an update write module: the memory is used for updating each field in the corresponding data of the database into each field in the reported information and writing the updating result into the data plane log;
and the reporting module waits again: and the GPU sniffer is used for waiting for receiving and checking the reported information transmitted by the GPU sniffer again.
CN202010506445.2A 2020-06-05 2020-06-05 Multi-mode distributed cluster GPU index detection method and system Active CN111736989B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202010506445.2A CN111736989B (en) 2020-06-05 2020-06-05 Multi-mode distributed cluster GPU index detection method and system
PCT/CN2020/110992 WO2021243855A1 (en) 2020-06-05 2020-08-25 Multi-mode distributed cluster gpu indicator detection method and system
US17/369,909 US11734152B2 (en) 2020-06-05 2021-07-07 Method and system for detecting GPU-related factors of multi-mode distributed cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010506445.2A CN111736989B (en) 2020-06-05 2020-06-05 Multi-mode distributed cluster GPU index detection method and system

Publications (2)

Publication Number Publication Date
CN111736989A true CN111736989A (en) 2020-10-02
CN111736989B CN111736989B (en) 2022-10-14

Family

ID=72648351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010506445.2A Active CN111736989B (en) 2020-06-05 2020-06-05 Multi-mode distributed cluster GPU index detection method and system

Country Status (2)

Country Link
CN (1) CN111736989B (en)
WO (1) WO2021243855A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117135151A (en) * 2023-09-01 2023-11-28 摩尔线程智能科技(北京)有限责任公司 Fault detection method of GPU (graphics processing unit) cluster, GPU cluster and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114584489A (en) * 2022-03-08 2022-06-03 浪潮云信息技术股份公司 Ssh channel-based remote environment information and configuration detection method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446168A (en) * 2016-09-26 2017-02-22 北京赛思信安技术股份有限公司 Oriented distribution data warehouse high efficiency load client end realization method
WO2019233047A1 (en) * 2018-06-07 2019-12-12 国电南瑞科技股份有限公司 Power grid dispatching-based operation and maintenance method
CN110647580A (en) * 2019-09-05 2020-01-03 南京邮电大学 Distributed container cluster mirror image management main node, slave node, system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776252B (en) * 2016-12-08 2019-08-02 武汉斗鱼网络科技有限公司 A kind of method and device for evaluating GPU performance
CN108021487B (en) * 2017-11-24 2021-03-26 中国航空工业集团公司西安航空计算技术研究所 GPU (graphics processing Unit) graphic processing performance monitoring and analyzing method
CN110891000B (en) * 2019-11-07 2021-10-26 浪潮(北京)电子信息产业有限公司 GPU bandwidth performance detection method, system and related device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446168A (en) * 2016-09-26 2017-02-22 北京赛思信安技术股份有限公司 Oriented distribution data warehouse high efficiency load client end realization method
WO2019233047A1 (en) * 2018-06-07 2019-12-12 国电南瑞科技股份有限公司 Power grid dispatching-based operation and maintenance method
CN110647580A (en) * 2019-09-05 2020-01-03 南京邮电大学 Distributed container cluster mirror image management main node, slave node, system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117135151A (en) * 2023-09-01 2023-11-28 摩尔线程智能科技(北京)有限责任公司 Fault detection method of GPU (graphics processing unit) cluster, GPU cluster and electronic equipment
CN117135151B (en) * 2023-09-01 2024-05-03 摩尔线程智能科技(北京)有限责任公司 Fault detection method of GPU cluster, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2021243855A1 (en) 2021-12-09
CN111736989B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
CN111736989B (en) Multi-mode distributed cluster GPU index detection method and system
AU2011299337B2 (en) Controlled automatic healing of data-center services
CN101976217A (en) Anomaly detection method and system for network processing unit
CN112559407B (en) STP link layer state machine optimization method
US20100131952A1 (en) Assistance In Performing Action Responsive To Detected Event
CN106777126B (en) Data online migration method supporting heterogeneous time sequence database
CN105243004A (en) Failure resource detection method and apparatus
CN110766167B (en) Interactive feature selection method, device and readable storage medium
US20230305880A1 (en) Cluster distributed resource scheduling method, apparatus and device, and storage medium
CN114357495B (en) Prediction machine under-chain aggregation method, device, equipment and medium based on block chain
CN104850394A (en) Management method of distributed application program and distributed system
WO2022120995A1 (en) Device computing power evaluation method and system based on pow consensus mechanism
CN112558875A (en) Data verification method and device, electronic equipment and storage medium
CN110750445A (en) Method, system and equipment for testing high-availability function of YARN component
CN103297264A (en) Cloud platform failure recovery method and system
CN110716875A (en) Concurrency test method based on feedback mechanism in domestic office environment
CN115454958A (en) Data processing method, device, equipment, system and medium based on artificial intelligence
CN114490256A (en) Operation and maintenance monitoring system and method
CN110941535A (en) Hard disk load balancing method
CN113760459A (en) Virtual machine fault detection method, storage medium and virtualization cluster
CN111581034A (en) RAID card fault processing method and device
CN111782141A (en) Data inspection method and device
CN112988243B (en) Equipment switching method and device and computing equipment
CN111124310B (en) Storage system scheduling optimization method and related components
CN110333934A (en) A kind of interface bulk processing method and processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant