CN111949389B

CN111949389B - Slurm-based information acquisition method and device, server and computer-readable storage medium

Info

Publication number: CN111949389B
Application number: CN202010802073.8A
Authority: CN
Inventors: 胡梦龙; 张涛; 吕灼恒; 张晋锋; 李斌; 原帅; 袁伟
Original assignee: Zhongke Shuguang International Information Industry Co ltd; Zhongke Sugon Information Industry Chengdu Co ltd; Dawning Information Industry Beijing Co Ltd; Dawning Information Industry Co Ltd
Current assignee: Zhongke Shuguang International Information Industry Co.,Ltd.; ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.; Dawning Information Industry Beijing Co Ltd; Dawning Information Industry Co Ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2022-02-18
Anticipated expiration: 2040-08-11
Also published as: CN111949389A

Abstract

The application relates to an information acquisition method and device based on Slurm, a server and a computer readable storage medium, comprising the following steps: and collecting operation running information and cluster node information through a Slurm operation scheduling system. And calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database. The TDengine time sequence database is a high-performance database, and the data processing speed of the TDengine time sequence database is obviously higher than that of other general databases. In addition, the operation running information and the cluster node information can be respectively stored in the TDengine time sequence database only by adopting the same preset plug-in. The complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm job scheduling system is improved.

Description

Slurm-based information acquisition method and device, server and computer-readable storage medium

Technical Field

The application relates to the technical field of computers, in particular to an information acquisition method and device based on Slurm, a server and a computer readable storage medium.

Background

With the continuous development of computer technology, the related technology of Super Computing Cluster (SCC) also continuously performs iterative update. Slurm (called simply Linux Utility Resource Management) is an open-source job scheduling system with high fault tolerance and high expandability, which is applied to a supercomputer cluster. The Slurm is widely applied to the super computing cluster and can inquire and file various information generated by the Slurm job scheduling system in the running process.

However, under the requirements of high-performance calculation and high-throughput calculation, the number of nodes and the number of jobs supported in the super computing cluster are more and more, and therefore, higher requirements are put forward on the information acquisition performance of the churm job scheduling system. Therefore, how to improve the information acquisition performance of the Slurm job scheduling system is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an information acquisition method, an information acquisition device, a server and a computer readable storage medium based on the Slurm, and the information acquisition performance of a Slurm job scheduling system can be improved.

An information acquisition method based on Slurm comprises the following steps:

collecting operation running information and cluster node information through a Slurm operation scheduling system;

and calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database.

In the embodiment of the application, the operation information and the cluster node information are collected through a Slurm operation scheduling system. And calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database. The TDengine time sequence database is a high-performance database, and the data processing speed of the TDengine time sequence database is obviously higher than that of other general databases. In addition, by the method, the operation running information and the cluster node information can be respectively stored in the TDengine time sequence database by only adopting one preset plug-in. The traditional Slurm job scheduling system cannot use the same plug-in to call the same database to store two kinds of information, and needs to use two plug-ins to call two different general databases respectively to store job running information and cluster node information respectively. Obviously, the same preset plug-in is adopted to store the operation information and the cluster node information into the TDengine time sequence database respectively, so that the complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm operation scheduling system is improved.

In one embodiment, invoking the preset plug-in to store the collected job running information into a TDengine time sequence database includes:

adding the collected operation information into a global linked list through the preset plug-in;

and calling a pre-created job operation information acquisition thread to poll the global linked list, and writing the job operation information in the global linked list into a job operation information table in a TDengine database.

In the embodiment of the application, the operation running information can be acquired only when one operation life cycle is finished, and cannot be acquired in real time. Therefore, the operation running information with complete operation is collected in a global linked list mode. And adding the operation running information of the operation stored in the memory into the global linked list when each operation life cycle is finished, and clearing the operation running information corresponding to the operation from the global linked list after the operation life cycle is finished. Therefore, the complete operation running information of the latest operation can be screened from the memory in a global linked list mode. Therefore, the server cluster can call a pre-created job operation information acquisition thread to poll the global linked list, and write the job operation information in the global linked list into a job operation information table in the TDengine database. And finally, writing the complete operation running information of the latest operation into an operation running information table in a TDengine database.

In one embodiment, adding the collected job running information to a global linked list through the preset plugin comprises:

when the life cycle of each operation is finished, extracting target operation running information from the collected operation running information corresponding to the operation through the preset plug-in;

and adding the target operation running information into a global linked list.

In the embodiment of the application, the real-time job running information of each job is stored in the memory, so that the information amount is too large. Therefore, when each operation life cycle is finished, the target operation running information is extracted from the complete operation running information of the operation stored in the memory and added into the global linked list. Therefore, on one hand, when each operation life cycle is finished, the complete operation running information of the operation is acquired from the memory; on the other hand, the target operation running information is screened from the complete operation running information and added into the global linked list, so that the target operation running information required in the memory is written into the database, and a large amount of irrelevant information is prevented from being written. Therefore, the complete operation information of the latest operation can be obtained by reading the global linked list, and the waste of system resources caused by reading irrelevant information from the global linked list is avoided.

In an embodiment, the invoking the preset plug-in to store the collected cluster node information in a TDengine time sequence database includes:

and writing the cluster node information into a cluster node information table in a TDengine database through a pre-established cluster node information acquisition thread.

In the embodiment of the application, different from the operation information, the cluster node information is collected and updated by the server cluster in real time, and the operation information can be collected only when one operation life cycle is finished and cannot be collected in real time. Therefore, the job running information is acquired in a form of a global linked list.

Because the cluster node information is collected and updated in real time by the server cluster and can be collected in real time, the server cluster creates a cluster node information collection thread through a preset plug-in, and the cluster node information can be written into a cluster node information table in a TDengine database only by regularly reading the cluster node information from a memory through the thread. Obviously, the same preset plug-in is adopted to store the operation information and the cluster node information into the TDengine time sequence database respectively, so that the complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm operation scheduling system is improved.

In one embodiment, the cluster node information table includes an energy consumption information table, a network information table and a file system information table; writing the cluster node information into a cluster node information table in a TDengine database through a pre-established cluster node information acquisition thread, wherein the cluster node information acquisition thread comprises the following steps:

writing the acquired energy consumption information into an energy consumption information table in a TDengine database according to a preset first time period through a pre-established energy consumption information acquisition thread;

writing the acquired network information into a network information table in a TDengine database according to a preset second time period through a pre-established network information acquisition thread;

and writing the acquired file system information into a file system information table in a TDengine database according to a preset third time period through a pre-established file system information acquisition thread.

In the embodiment of the application, different acquisition threads are respectively created for different cluster node information, different information is periodically and respectively acquired through the different acquisition threads, and the different information is respectively written into corresponding information tables in a TDengine database. Therefore, the cluster node information is written into the TDengine database in a classified manner, subsequent calling is facilitated, and the efficiency of subsequent information calling is improved.

In one embodiment, the method further comprises:

and initializing the TDengine time sequence database through the preset plug-in to establish a connection relation between the preset plug-in and the TDengine time sequence database.

In the embodiment of the application, the TDengine time sequence database is initialized through the preset plug-in, and the connection relation between the preset plug-in and the TDengine time sequence database is established. Therefore, the preset plug-in can be called to store the collected operation information and the collected cluster node information into the TDengine time sequence database respectively through the connection relation between the preset plug-in and the TDengine time sequence database. The operation information and the cluster node information can be respectively stored in the TDengine time sequence database by adopting the same preset plug-in, the complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm operation scheduling system is improved.

In one embodiment, before the initializing the TDengine timing database by the preset plug-in, the method includes:

under the condition that Slurm main management service is started, whether the preset plug-in is initialized or not is checked;

if not, initializing the preset plug-in, and creating an information table in the TDengine time sequence database through the preset plug-in; the information table comprises at least one of a job operation information table and a cluster node information table.

In the embodiment of the application, under the condition that the Slurm main management service is started, whether the preset plug-in is initialized or not is checked. If the preset plug-in is not initialized, initializing the preset plug-in, and after the preset plug-in is initialized, creating an information table in a TDengine time sequence database through the preset plug-in. Therefore, the connection relation between the preset plug-in and the TDengine time sequence database is established through the preset plug-in. And calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into an information table in the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database.

Therefore, the operation information and the cluster node information can be respectively stored in the TDengine time sequence database by adopting the same preset plug-in, the complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm operation scheduling system is improved.

An information acquisition device based on Slurm, comprising:

the information acquisition module is used for acquiring operation information and cluster node information through a Slurm operation scheduling system;

and the information storage module is used for calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database.

A server comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the above method.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as above.

The information acquisition method, the device, the server and the computer readable storage medium based on the Slurm acquire the operation information and the cluster node information through the Slurm operation scheduling system. And calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database. The TDengine time sequence database is a high-performance database, and the data processing speed of the TDengine time sequence database is obviously higher than that of other general databases. In addition, by the method, the operation running information and the cluster node information can be respectively stored in the TDengine time sequence database by only adopting one preset plug-in. The traditional Slurm job scheduling system cannot use the same plug-in to call the same database to store two kinds of information, and needs to use two plug-ins to call two different general databases respectively to store job running information and cluster node information respectively. Obviously, the same preset plug-in is adopted to store the operation information and the cluster node information into the TDengine time sequence database respectively, so that the complexity and the operation cost of system development and maintenance are reduced, and the information acquisition performance of the Slurm operation scheduling system is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of an application environment of an information collection method based on Slurm in one embodiment;

FIG. 2 is a flow diagram of a Slurm-based information collection method in one embodiment;

fig. 3 is a flowchart illustrating a method for calling the preset plug-in to store the collected job running information in the TDengine time sequence database according to an embodiment;

fig. 4 is a flowchart illustrating a method for calling the preset plugin to store the collected cluster node information in the TDengine time sequence database in one embodiment;

fig. 5 is a flowchart of a method for writing cluster node information into a cluster node information table in a TDengine database according to a preset cycle through a pre-created cluster node information acquisition thread in one embodiment;

FIG. 6 is a block diagram of an embodiment of an information acquisition device based on Slurm;

FIG. 7 is a block diagram of the information storage module of FIG. 6;

fig. 8 is a schematic diagram of an internal configuration of a server in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another.

As shown in fig. 1, fig. 1 is an application scenario diagram of an information acquisition method based on churm in an embodiment. The application environment includes a server cluster 100, where the server cluster 100 includes a database 120 and a plurality of servers, such as a server 140, a server 160, and a server 180 (where the server 140 is a master server and the others are slave servers). Wherein, the database 120 may be a TDengine time sequence database. The server 140 collects job running information and cluster node information through a Slurm job scheduling system; and calling the preset plug-in to store the collected job running information and the collected cluster node information into the TDengine time sequence database 120 respectively through the connection relation between the preset plug-in and the TDengine time sequence database.

Fig. 2 is a flowchart of an information collection method based on churm in an embodiment, and as shown in fig. 2, an information collection method based on churm is provided and applied to a supercomputer cluster, for example, a server cluster. The method includes the following steps 220 through 240.

And step 220, collecting operation information and cluster node information through a Slurm operation scheduling system.

Slurm (called Simple Linux Utility Resource Management) is a job scheduling system applied to a supercomputer cluster. The job running information refers to attribute information related to job running of the Slurm job scheduling system, and includes any one or more of job ID, job name, job directory, input/output, belonging account, belonging cluster, service quality, job usage CPU/GPU details, job dependency, job reservation, job script, and the like, which is not limited in this application.

The cluster node information refers to information related to each node (server) in the server cluster, and includes any one or more of energy consumption information, network information, file system information, and the like, which is not limited in this application.

The server cluster collects operation information and cluster node information through a Slurm job scheduling system. Specifically, a main server in the server cluster collects operation information and cluster node information through a Slurm operation scheduling system, and temporarily stores the collected operation information and cluster node information into a memory.

And 240, calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database.

The TDengine time sequence database is an open-source and high-performance time sequence database. And after the initialization of the preset plug-in, the connection relationship between the preset plug-in and the TDengine time sequence database can be established through the preset plug-in. And then, calling a preset plug-in by a main server in the server cluster to respectively store the collected job operation information and the cluster node information into a TDengine time sequence database. Specifically, a main server in the server cluster calls a preset plug-in to write the operation running information and the cluster node information in the memory into a TDengine time sequence database. The preset plugin is called to write the operation running information and the cluster node information in the memory into the TDengine time sequence database for the same or different time periods, which is not limited in the present application.

A Time Series Database (Time Series Database) is a specialized Database for storing and managing Time Series data, and provides a distributed cloud Database service with high performance reading and writing and high computing power for the Time Series data. The time sequence database is commonly used in the scene of monitoring the equipment of the internet of things and monitoring the internet service.

Conventionally, after a server cluster collects job running information and cluster node information through a Slurm job scheduling system and stores the job running information and the cluster node information in a memory, the server cluster collects the job running information from the memory through an Elasticissearch plug-in at the end of a job life cycle and writes the job running information into an Elasticissearch database. For cluster node information, a server cluster periodically collects cluster node information from a memory through an InfluxDB plug-in and writes the cluster node information into an InfluxDB database. Therefore, if two kinds of information, namely the job running information and the cluster node information, are to be collected, two kinds of databases, namely an Elasticsearch database and an infixdb database, need to be installed, and different plug-ins are configured for the two kinds of databases at the same time. And because the single-insertion performance of infiluxdb is very low, Kafka or other message queue software must be used to implement batch writing, which increases the complexity and operating cost of system development and maintenance.

In one embodiment, step 240, invoking a preset plug-in to store the collected job running information in a TDengine time sequence database includes:

step 242, adding the collected operation information to the global linked list through a preset plug-in;

and 244, calling a pre-created job running information acquisition thread to poll the global linked list, and writing the job running information in the global linked list into a job running information table in the TDengine database.

Specifically, a main server in the server cluster collects job operation information through a churm job scheduling system, and temporarily stores the collected job operation information into a memory, that is, the memory stores the real-time job operation information of each job. And then, when each operation life cycle is finished, acquiring the stored complete operation running information of the operation from the memory and adding the information into the global linked list. The global linked list also exists in the memory, and in general, after the life cycle of the current job is finished and the thread reads the complete job running information of the job, the job running information corresponding to the current job is cleared from the global linked list.

The amount of information is too large based on the real-time job execution information of each job stored in the memory. Therefore, when each operation life cycle is finished, the complete operation running information of the operation stored in the memory is added into the global linked list. Thus, since only the complete job execution information is stored, the amount of information is greatly reduced. After the life cycle of the operation is finished and the thread reads the complete operation running information of the operation, the operation running information corresponding to the operation is cleared from the global linked list, so that the residual space in the global linked list is ensured to store the latest operation running information. Therefore, the complete operation running information of the latest operation can be directly obtained by reading the global linked list.

And when the collected operation information is stored in a TDengine time sequence database, the server cluster creates an operation information collection thread through a preset plug-in. Then, the server cluster calls a pre-created job operation information acquisition thread to poll the global linked list, and when the job operation information exists on the global linked list, the job operation information is written into a TDengine database, specifically a job operation information table written into the TDengine database. The operation information table is an information table established by the server cluster in the TDengine database through the plug-in, and is used for storing operation information of each operation.

In one embodiment, adding the collected job running information to the global linked list through the preset plug-in includes:

when the life cycle of each operation is finished, extracting target operation running information from the collected operation running information corresponding to the operation through a preset plug-in;

and adding the target operation running information into the global linked list.

Specifically, a main server in the server cluster collects job operation information through a churm job scheduling system, and temporarily stores the collected job operation information into a memory, that is, the memory stores the real-time job operation information of each job. And then, when each operation life cycle is finished, acquiring the stored complete operation running information of the operation from the memory. Next, a target job run message is extracted from the complete job run information of the job. For example, partially important job execution information is extracted from the complete job execution information of the job as a target job execution message. And finally, adding the extracted target operation information into a global linked list.

As shown in fig. 3, a method for collecting job running information is provided, which is applied to a server cluster, and includes:

step 320, when each operation life cycle is finished, acquiring the stored complete operation running information of the operation from the memory;

step 340, extracting target operation information from the complete operation information of the operation, and adding the target operation information into a global linked list;

step 360, creating a job running information acquisition thread through a preset plug-in;

and 380, analyzing the target operation information on the global linked list one by one through the operation information acquisition thread, and writing the target operation information into an operation information table in the TDengine database.

In one embodiment, invoking a preset plug-in to store the collected cluster node information in a TDengine time sequence database includes:

and writing the cluster node information into a cluster node information table in a TDengine database according to a preset period through a pre-established cluster node information acquisition thread.

Specifically, a main server in the server cluster acquires cluster node information through a Slurm job scheduling system, and temporarily stores the acquired cluster node information into a memory. Because the preset plug-in is configured for the TDengine time sequence database in advance, after the preset plug-in is initialized, the connection relationship between the preset plug-in and the TDengine time sequence database can be established through the preset plug-in. The process of establishing the connection relationship between the preset plug-in and the TDengine time sequence database is actually a process of calling the database through the plug-in. Then, a main server in the server cluster creates a cluster node information acquisition thread through a preset plug-in, the cluster node information is periodically read in the memory according to a preset period through the thread, and the cluster node information is written into a cluster node information table in a TDengine database.

As shown in fig. 4, a cluster node information collecting method is provided, which is applied to a server cluster, and includes:

step 420, creating a cluster node information acquisition thread through a preset plug-in;

and step 440, periodically reading the cluster node information in the memory through the thread, and writing the cluster node information into a cluster node information table in the TDengine database.

In the embodiment of the application, different from the operation information, the cluster node information is collected and updated by the server cluster in real time, and the operation information can be collected only when one operation life cycle is finished and cannot be collected in real time. Therefore, the job running information is collected in a form of a global linked list.

In one embodiment, the cluster node information table comprises an energy consumption information table, a network information table and a file system information table; as shown in fig. 5, writing cluster node information into a cluster node information table in a TDengine database according to a preset cycle through a pre-created cluster node information collection thread, including:

step 520, writing the acquired energy consumption information into an energy consumption information table in a TDengine database according to a preset first time period through a pre-established energy consumption information acquisition thread;

step 540, writing the acquired network information into a network information table in a TDengine database according to a preset second time period through a pre-created network information acquisition thread;

and step 560, writing the acquired file system information into a file system information table in the TDengine database according to a preset third time period through the pre-created file system information acquisition thread.

Specifically, the cluster node information refers to information related to each node (server) in the server cluster, and includes any one or more of energy consumption information, network information, file system information, and the like, which is not limited in this application. Therefore, the server cluster creates a corresponding information table in the TDengine database in advance according to different categories of cluster node information through the plug-in. For example, an energy consumption information table is created in a TDengine database for energy consumption information; establishing a network information table in a TDengine database aiming at the network information; and creating a file system information table in a TDengine database aiming at the file system information.

Wherein the energy consumption information includes, but is not limited to: temperature, fan, voltage, power consumption, etc. The network information mainly includes data of IB (InfiniBand) network, including but not limited to: the number of packets to be transmitted and received, the size of data to be transmitted and received, and the like are transmitted and received when communication is performed between nodes during the operation of a job. The file system information includes the number of transmission and reception packets, the size of transmission and reception data, and the like in the process of reading and writing a file to the file system.

Then, different collection threads are respectively created according to different cluster node information. Specifically, an energy consumption information acquisition thread, a network information acquisition thread, a file system information acquisition thread and other threads are created. And finally, respectively acquiring different information at regular intervals through different acquisition threads, and respectively writing the different information into corresponding information tables in a TDengine database.

Specifically, the acquired energy consumption information is written into an energy consumption information table in a TDengine database according to a preset first time period through a pre-created energy consumption information acquisition thread. And writing the acquired network information into a network information table in a TDengine database according to a preset second time period through a pre-established network information acquisition thread. And writing the acquired file system information into a file system information table in a TDengine database according to a preset third time period through a pre-established file system information acquisition thread. The first time period, the second time period, and the third time period may be the same, may not be completely the same, or may be completely different, which is not limited in this application.

In one embodiment, a method for information acquisition based on churm is provided, and the method further includes:

The TDengine time sequence database is an open-source and high-performance time sequence database. And a preset plug-in is configured for the TDengine time sequence database in advance, and after the preset plug-in is initialized, the TDengine time sequence database is initialized through the preset plug-in. And then, a connection relation between the preset plug-in and the TDengine time sequence database can be established through the preset plug-in.

In one embodiment, before initializing the TDengine timing database through the preset plug-in, the method includes:

under the condition that Slurm main management service is started, whether a preset plug-in is initialized or not is checked;

if not, initializing a preset plug-in, and creating an information table in a TDengine time sequence database through the preset plug-in; the information table includes at least one of a job operation information table and a cluster node information table.

Specifically, a preset plug-in is configured for the TDengine timing database in advance, and after the preset plug-in is initialized, a connection relationship between the preset plug-in and the TDengine timing database can be established through the preset plug-in. And then, calling a preset plug-in by a main server in the server cluster to respectively store the collected job operation information and the cluster node information into a TDengine time sequence database. Therefore, when the churm main management service is started, whether the preset plug-in is initialized or not needs to be checked, the connection relation between the preset plug-in and the TDengine time sequence database can be established through the preset plug-in, and the subsequent step of writing information into the TDengine time sequence database is further realized.

In case the churm primary management service is started, it is checked whether a preset plug-in is initialized. If the preset plug-in is detected to be initialized, the connection relationship between the preset plug-in and the TDengine time sequence database can be established through the preset plug-in. If the preset plug-in is not initialized, initializing the preset plug-in, and after the preset plug-in is initialized, creating an information table in a TDengine time sequence database through the preset plug-in, wherein the information table comprises at least one of an operation running information table and a cluster node information table. And storing the operation information and the cluster node information in an operation information table and a cluster node information table in the TDengine time sequence database respectively.

In one embodiment, as shown in fig. 6, there is provided a Slurm-based information collection apparatus 600, comprising:

the information acquisition module 620 is used for acquiring the operation information and the cluster node information of the operation through the Slurm operation scheduling system;

the information storage module 640 is configured to call the preset plug-in to store the collected job operation information and the collected cluster node information in the TDengine time sequence database respectively according to a connection relationship between the preset plug-in and the TDengine time sequence database.

In one embodiment, as shown in FIG. 7, information storage module 640 includes:

the global linked list unit 642 is used for adding the collected operation information into the global linked list through a preset plug-in;

and the acquisition thread polling unit 644 is configured to call a pre-created job running information acquisition thread to poll the global linked list, and write the job running information in the global linked list into a job running information table in the TDengine database.

In one embodiment, the global linked list unit 642 is further configured to extract target job running information from the collected job running information through a preset plug-in at the end of the life cycle of each job; and adding the target operation running information into the global linked list.

In an embodiment, the information storage module 640 is further configured to write cluster node information into a cluster node information table in the TDengine database according to a preset cycle through a pre-created cluster node information acquisition thread.

In one embodiment, the cluster node information table comprises an energy consumption information table, a network information table and a file system information table; the information storage module 640 is further configured to write the acquired energy consumption information into an energy consumption information table in the TDengine database according to a preset first time period through a pre-created energy consumption information acquisition thread; writing the acquired network information into a network information table in a TDengine database according to a preset second time period through a pre-established network information acquisition thread; and writing the acquired file system information into a file system information table in a TDengine database according to a preset third time period through a pre-established file system information acquisition thread.

In one embodiment, there is provided a Slurm-based information acquisition apparatus 600, further comprising:

and the database initialization module is used for initializing the TDengine time sequence database through the preset plug-in so as to establish a connection relation between the preset plug-in and the TDengine time sequence database.

the system comprises a preset plug-in initialization module, a management module and a management module, wherein the preset plug-in initialization module is used for checking whether a preset plug-in is initialized or not under the condition that Slurm main management service is started;

the information table creating module is used for initializing a preset plug-in unit if the time sequence of the TDengine is not the same as the time sequence of the TDengine, and creating an information table in a TDengine time sequence database through the preset plug-in unit; the information table includes at least one of a job operation information table and a cluster node information table.

The division of each module in the information acquisition device based on the Slurm is merely used for illustration, and in other embodiments, the information acquisition device based on the Slurm may be divided into different modules as needed to complete all or part of the functions of the information acquisition device based on the Slurm.

Fig. 8 is a schematic diagram of an internal configuration of a server in one embodiment. As shown in fig. 8, the server includes a processor and a memory connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole server. The memory may include a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor to implement a method for information collection based on churm provided in the following embodiments. The internal memory provides a cached execution environment for the operating system computer programs in the non-volatile storage medium. The server may be a mobile phone, a tablet computer, or a personal digital assistant or a wearable device, etc.

The implementation of each module in the information acquisition device based on the Slurm provided in the embodiment of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the Slurm-based information collection method.

A computer program product containing instructions which, when run on a computer, cause the computer to perform a method of information collection based on Slurm.

Any reference to memory, storage, database, or other medium used by embodiments of the present application may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An information acquisition method based on Slurm is characterized by comprising the following steps:

calling a preset plug-in to respectively store the collected operation information and the collected cluster node information into a TDengine time sequence database through a connection relation between the preset plug-in and the TDengine time sequence database; when the life cycle of each job is finished, extracting target job running information from the collected job running information corresponding to the job through the preset plug-in, and adding the target job running information into a global linked list; and calling a pre-created job operation information acquisition thread to poll the global linked list, and writing the job operation information in the global linked list into a job operation information table in a TDengine database.

2. The method according to claim 1, wherein the invoking the preset plug-in to store the collected cluster node information in a TDengine timing database includes:

3. The method of claim 2, wherein the cluster node information table comprises an energy consumption information table, a network information table, and a file system information table; writing the cluster node information into a cluster node information table in a TDengine database according to a preset period through a pre-established cluster node information acquisition thread, wherein the cluster node information acquisition thread comprises the following steps:

4. The method of claim 1, further comprising:

5. The method according to claim 4, wherein before the initializing the TDengine timing database by the preset plug-in, the method comprises:

6. An information acquisition device based on Slurm, characterized by comprising:

the information storage module is used for calling the preset plug-in to respectively store the collected operation information and the collected cluster node information into the TDengine time sequence database through the connection relation between the preset plug-in and the TDengine time sequence database; the method is specifically used for extracting target operation running information from the collected operation running information corresponding to the operation through the preset plug-in when the life cycle of each operation is finished, adding the target operation running information into a global linked list, calling a pre-created operation running information collection thread to poll the global linked list, and writing the operation running information in the global linked list into an operation running information table in a TDengine database.

7. The apparatus according to claim 6, wherein the information storage module is further configured to write the cluster node information into a cluster node information table in a TDengine database according to a preset cycle through a pre-created cluster node information collection thread.

8. The apparatus according to claim 6, further comprising a database initialization module configured to initialize the TDengine timing database through the preset plugin, so as to establish a connection relationship between the preset plugin and the TDengine timing database.

9. A server comprising a memory and a processor, the memory having stored thereon a computer program, wherein the computer program, when executed by the processor, causes the processor to perform the steps of the Slurm-based information collection method of any of claims 1 to 5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for information acquisition based on Slurm as claimed in any one of claims 1 to 5.