CN110177020A - A kind of High-Performance Computing Cluster management method based on Slurm - Google Patents

A kind of High-Performance Computing Cluster management method based on Slurm Download PDF

Info

Publication number
CN110177020A
CN110177020A CN201910524257.XA CN201910524257A CN110177020A CN 110177020 A CN110177020 A CN 110177020A CN 201910524257 A CN201910524257 A CN 201910524257A CN 110177020 A CN110177020 A CN 110177020A
Authority
CN
China
Prior art keywords
node
slurm
cluster
calculate
management method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910524257.XA
Other languages
Chinese (zh)
Inventor
赵博颖
郭申
王宇耕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN201910524257.XA priority Critical patent/CN110177020A/en
Publication of CN110177020A publication Critical patent/CN110177020A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5041Network service management, e.g. ensuring proper service fulfilment according to agreements characterised by the time relationship between creation and deployment of a service
    • H04L41/5054Automatic deployment of services triggered by the service manager, e.g. service implementation by automatic configuration of network components
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/30Profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hardware Redundancy (AREA)

Abstract

The High-Performance Computing Cluster management method based on Slurm that the present invention relates to a kind of, comprising: an arbitrarily selected machine is as control node, and other machines are as calculate node;The host name of all calculate nodes or IP information in cluster are got, copies cluster installation kit and installation script to each calculate node;Each calculate node is arrived by SSH service login in control node, deployment is built by cluster environment on installation script completion node;Process is controlled and received in control node deployment, monitors computational resource is used for, receives the information that calculate node is sent;There are finger daemons to be used to manage calculate node in cluster in each calculate node, information and is sent to control node by SSH service in timing acquiring node state and node;Carry out coordinated management to calculate node and redundancy backup node;Based on Slurm job management mechanism and node state monitoring process, the present situation according to cluster system resource disposes the operation in queue, monitors and batch operation.

Description

A kind of High-Performance Computing Cluster management method based on Slurm
Technical field
The present invention relates to data analysis management technology, a kind of management method of the High-Performance Computing Cluster based on Slurm framework.
Background technique
In current informationized society, the analysis of mass data and processing problem propose very greatly the computing capability of system Challenge.But the limitation of existing computing platform makes calculating service be unable to Dynamic Integration, and data storage, data is caused to calculate analysis Etc. there are certain bottlenecks.Multiple computing resources are constituted group system by network interconnection to have been found to be to solve the problems, such as this Effective means.However cluster scale becomes larger, structure is complicated, calculates high isomerism and the application pair of environment for caused cluster topology The diversification of the nonfunction requirements such as performance greatly increases the processes difficulty such as installation, configuration and management of group system.Cluster System is reduced its management cost by Self management mode, adapted to there is an urgent need to change current static state, passively way to manage Calculate the personalized nonfunction requirement of variation and the application of environment.
Summary of the invention
The High-Performance Computing Cluster management method based on Slurm that the purpose of the present invention is to provide a kind of, it is above-mentioned existing for solving There is the problem of technology,
A kind of High-Performance Computing Cluster management method based on Slurm of the present invention, comprising: an arbitrarily selected machine is as control Node, other machines are as calculate node;The host name of all calculate nodes or IP information in cluster are got, cluster is installed Packet and installation script copy each calculate node to;Passed through by SSH service login to each calculate node in control node Cluster environment builds deployment on installation script completion node;Process is controlled and received in control node deployment, is calculated for monitoring Resource receives the information that calculate node is sent;It is used to manage the calculating in cluster there are finger daemon in each calculate node Node information and is sent to control node by SSH service in timing acquiring node state and node;To calculate node and superfluous The carry out coordinated management of remaining backup node;Based on Slurm job management mechanism and node state monitoring process, according to group system The present situation of resource disposes the operation in queue, monitors and batch operation.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein node fault-tolerance backup Method includes: the deployment Slurmctld finger daemon in control node, the configuration file of timing acquisition Slurm and is saved before Status information, disk is periodically written into the status information of full controller, if control node break down, incremental changes letter Disk can be written in breath immediately, and system completes control according to the Standby control node being arranged in advance in configuration file slurm.conf at this time The switching of node processed;Dispose Slurmd finger daemon in calculate node, monitor physical computing nodes and operating system, Slurmd into Journey is based on heartbeat inspecting mechanism feedback and reports calculate node state.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein in control node, Starting one monitoring finger daemon of maintenance, and operation is interacted, starts, hangs up and cancelled by api interface and Slurm; In calculate node, according to the heartbeat message agreement agreed upon, periodically to the health monitoring finger daemon for being located at control node Heartbeat message is sent, error message is reported.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein if application program goes out Wrong perhaps operation time-out is by restarting this operation or by job migration to spare calculate node;If physical node delay machine, Close this node, and by job migrations whole on this node to backing up calculate node.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein from the configuration of cluster The host name of all calculate nodes or IP information in cluster are got in file slurm.conf.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein instructed by scp Copy cluster installation kit and installation script to each calculate node.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein setting clustered control Password login is exempted from by SSH between node and each calculate node.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein on all the nodes Installation production domesticization milky way kylin system.
One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein made based on Slurm Industry administrative mechanism and node state monitoring process, the present situation according to cluster system resource is to the operation carry out portion in queue Administration, monitoring and batch operation include: that operation deployment includes: by the node state monitoring process in control node NodeMonitor determines the available computational resources number in cluster, submits the information that will run operation later;Operation control packet It includes: Operation control being hung up current work when the operation of high priority enters if it exists by scancel order in Slurm Until high priority job run is completed.As a result extracting includes: that job execution is judged after the completion, if running succeeded, is shown Show last output as a result, returning to failure cause if failure.
The present invention proposes that a kind of script mounting means completes the deployment for the isomeric group that different architectural nodes are built, significantly simple The complex operations of isomeric group installation configuration are changed.On this basis, the present invention provides a kind of management method of isomeric group, mentions For modules such as resource management monitoring, fault-tolerant support, job managements, by distributing rationally, Rational provides surely for user jointly Fixed efficient Heterogeneous Cluster Environment.
Detailed description of the invention
Fig. 1 show isomeric group basic block diagram;
Fig. 2 show key deployment cluster script process figure;
Fig. 3 show resource management monitoring module and realizes interaction figure;
Fig. 4 show node fault-tolerance administrative mechanism schematic diagram;
Fig. 5 show operation deployment flow chart;
Fig. 6 show Operation control process figure;
Fig. 7 show result and extracts flow chart.
Specific embodiment
To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to of the invention Specific embodiment is described in further detail.
A kind of High-Performance Computing Cluster management method based on Slurm, comprising:
1. selecting multiple computer nodes builds High Performance Cluster System, the hardware structure of node is unlimited.In all nodes Upper installation production domesticization milky way kylin system, Fig. 1 show isomeric group basic block diagram, as shown in Figure 1.
2. the deployment of building of High Performance Cluster System is a complicated process, a kind of script peace is provided based on Slurm Dress mode, Slurm are a kind of telescopic in height that can be used for mass computing node cluster and fault-tolerant cluster manager dual system and operation Scheduling system, Fig. 2 show key deployment cluster script process figure, as shown in Figure 2.
(1) for an arbitrarily selected machine as control node, clustered control section is arranged as calculate node in other machines It can be in order to avoid password login by SSH between point and each calculate node;
(2) host name of all calculate nodes or IP letter in cluster are got from the configuration file slurm.conf of cluster Breath is instructed by scp and copies cluster installation kit and installation script to each calculate node;
(3) it by SSH service login to each calculate node in control node, completes to collect on node by installation script Build deployment in group rings border;
By this script mounting means, largely reduce the troublesome operation of cluster installation, user can pass through control Node processed completes the configuration installation of all nodes in cluster.
3. High-Performance Computing Cluster management method proposed by the present invention mainly passes through resource management monitoring module, fault-tolerant support mould Block, job management module etc. are realized.
(1) resource management monitoring module
Fig. 3 show resource management monitoring module and realizes interaction figure, as shown in figure 3, merging in resource management monitoring module Entire group system and Internet resources construct shared, configurable resource pool, provide all computing resources in system Flexible monitoring management, High Availabitity guarantee and automation O&M.
The critical component status information that each node in cluster can be got in the present invention in real time, in control node Control receiving process is disposed, monitors computational resource is used for, receives the information that calculate node is sent;In each calculate node There are finger daemon monitor, monitor is used to manage the calculate node in cluster, its timing acquiring node state and section The relevant information of the critical components such as CPU, memory, network, disk and control node is sent to by SSH service on point.
(2) fault-tolerant support module
Fig. 4 show node fault-tolerance administrative mechanism schematic diagram, as shown in figure 4, fault-tolerant support module is mainly used for calculating The coordinated management of node and redundancy backup node, solve due to individually calculate or control unit fail caused by the system failure.
Slurm node administration mechanism is merged in the present invention and application program intelligent management mechanism proposes that a kind of new node holds Wrong backup method.
Step 1: in control node dispose Slurmctld finger daemon, the configuration file of timing acquisition Slurm and it Disk is periodically written in the status information of full controller by the status information of preceding preservation, if control node breaks down, increment Disk can be written in change information immediately, and system is according to the Standby control node being arranged in advance in configuration file slurm.conf at this time Complete the automatic switchover of control node.
Step 2: Slurmd finger daemon is disposed in calculate node, monitors physical computing nodes and operating system, Slurmd Process is based on heartbeat inspecting mechanism feedback and reports calculate node state, reports the states such as physics delay machine or operating system collapse.
Step 3: if Slurm software itself break down or upper layer parallel computation application program obstruction, at this time failure without Method is monitored.Therefore, an application program intelligent management platform is constructed for monitoring application program.In control node, starting It safeguards a monitoring finger daemon, and is interacted by api interface and Slurm, start, hang up, cancel operation;It is saved calculating Point, according to the heartbeat message agreement agreed upon, from intelligent management application program periodically to the health prison for being located at control node It controls finger daemon and sends heartbeat message, report the error messages such as application program error or operation time-out.
For the failure of appearance, also use two kinds of different faults countermeasures in the present invention, if application program error or Person's operation time-out, this operation is restarted by application program intelligent management platform or is saved job migration to spare calculating Point;If physical node delay machine, this node is closed by management platform, and job migrations whole on this node to backup are calculated into section Point.
(3) job management module
Fig. 5 show operation deployment flow chart, as shown in figure 5, job management module can working as according to cluster system resource Preceding situation such as is disposed to the operation in queue, is monitored, being distributed at the operation.It is based on Slurm job management mechanism and node shape State monitoring process etc. is realized, provides the functions such as operation deployment, Operation control, result extraction for user.
1) operation is disposed
Fig. 6 show Operation control process figure, as shown in fig. 6, the main submission for realizing operation of operation deployment, modification and The operation such as deletion, is determined in cluster by the node state monitoring process NodeMonitor in control node use tricks first Number of resources is calculated, the relevant information that will run operation is submitted later, mainly there is job title, required computing resource quantity, operation Number etc..
2) Operation control
Operation control is mainly used for realizing the controls such as hang-up, the release of operation operation.It is ordered by scancel in Slurm etc. The flexible control realized to operation is enabled, when the operation of high priority enters if it exists, system can be hung up current work until high Priority job operation is completed.Operation control process is as follows:
3) result is extracted
Fig. 7 show result and extracts flow chart, as shown in fig. 7, it is mainly that user exports final calculating knot that result, which is extracted, Fruit.Judged after the completion of job execution, if running succeeded, show last output as a result, returning to failure cause if failure, User is allowed more intuitively to obtain output result.
The present invention proposes a kind of based on Slurm aiming at the problem that management difficult to arrange of current isomery High-Performance Computing Cluster The management method of the High-Performance Computing Cluster of (Simple Linux Utility for Resource Management) building.It is first First, the hardware structure of each node of building High-Performance Computing Cluster is determined;Secondly, completing High-Performance Computing Cluster by script mounting means Build;Finally, providing the dynamic management that the modules such as resource management monitoring, fault-tolerant support, job management carry out cluster.
The configuration management of current high performance cluster is a cumbersome process, and the present invention proposes that a kind of script mounting means is complete At the deployment for the isomeric group that different architectural nodes are built, the complex operations of isomeric group installation configuration are enormously simplified.Herein On the basis of, the present invention provides a kind of management method of isomeric group, provides resource management monitoring, fault-tolerant support, job management etc. Module, by distributing rationally, Rational provide the Heterogeneous Cluster Environment of stability and high efficiency jointly for user.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims (9)

1. a kind of High-Performance Computing Cluster management method based on Slurm, comprising:
An arbitrarily selected machine is as control node, and other machines are as calculate node;
The host name of all calculate nodes or IP information in cluster are got, cluster installation kit and installation script are copied to each Calculate node;
By SSH service login to each calculate node in control node, cluster environment on node is completed by installation script Build deployment;
Process is controlled and received in control node deployment, monitors computational resource is used for, receives the information that calculate node is sent;Each There are finger daemons to be used to manage the calculate node in cluster, information in timing acquiring node state and node in calculate node And control node is sent to by SSH service;
Carry out coordinated management to calculate node and redundancy backup node;
Based on Slurm job management mechanism and node state monitoring process, the present situation according to cluster system resource is to queue In operation disposed, monitored and batch operation.
2. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that node fault-tolerance backup Method includes:
Slurmctld finger daemon, the configuration file of timing acquisition Slurm and previously stored state are disposed in control node Disk is periodically written in the status information of full controller by information, if control node breaks down, incremental changes information can be stood Disk is written, system completes control node according to the Standby control node being arranged in advance in configuration file slurm.conf at this time Switching;
Slurmd finger daemon is disposed in calculate node, monitors physical computing nodes and operating system, Slurmd process is based on the heart It jumps monitoring mechanism feedback and reports calculate node state.
3. the High-Performance Computing Cluster management method based on Slurm as claimed in claim 2, which is characterized in that in control node, open One monitoring finger daemon of dynamic maintenance, and operation is interacted, starts, hangs up and cancelled by api interface and Slurm;? Calculate node is periodically sent out to the health monitoring finger daemon for being located at control node according to the heartbeat message agreement agreed upon Heartbeat message is sent, error message is reported.
4. the High-Performance Computing Cluster management method based on Slurm as claimed in claim 3, which is characterized in that if application program goes out Wrong perhaps operation time-out is by restarting this operation or by job migration to spare calculate node;If physical node delay machine, Close this node, and by job migrations whole on this node to backing up calculate node.
5. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that from the configuration of cluster The host name of all calculate nodes or IP information in cluster are got in file slurm.conf.
6. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that being instructed by scp will Cluster installation kit and installation script copy each calculate node to.
7. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that setting clustered control Password login is exempted from by SSH between node and each calculate node.
8. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that on all the nodes Installation production domesticization milky way kylin system.
9. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that be based on Slurm operation Administrative mechanism and node state monitoring process, according to cluster system resource the present situation the operation in queue is disposed, Monitoring and batch operation include:
Operation deployment includes: to determine can be used in cluster by the node state monitoring process NodeMonitor in control node Computing resource number submits the information that will run operation later;
Operation control includes: by scancel order in Slurm to Operation control, when the operation of high priority enters if it exists, Current work is hung up until high priority job run is completed.
As a result extracting includes: that job execution is judged after the completion, if running succeeded, shows last output as a result, if failure Then return to failure cause.
CN201910524257.XA 2019-06-18 2019-06-18 A kind of High-Performance Computing Cluster management method based on Slurm Pending CN110177020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910524257.XA CN110177020A (en) 2019-06-18 2019-06-18 A kind of High-Performance Computing Cluster management method based on Slurm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910524257.XA CN110177020A (en) 2019-06-18 2019-06-18 A kind of High-Performance Computing Cluster management method based on Slurm

Publications (1)

Publication Number Publication Date
CN110177020A true CN110177020A (en) 2019-08-27

Family

ID=67697432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910524257.XA Pending CN110177020A (en) 2019-06-18 2019-06-18 A kind of High-Performance Computing Cluster management method based on Slurm

Country Status (1)

Country Link
CN (1) CN110177020A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111541591A (en) * 2020-07-09 2020-08-14 武汉绿色网络信息服务有限责任公司 SSH-based server detection method and device
CN111949389A (en) * 2020-08-11 2020-11-17 曙光信息产业(北京)有限公司 Slurm-based information acquisition method and device, server and computer-readable storage medium
CN112380086A (en) * 2019-09-29 2021-02-19 北京城建设计发展集团股份有限公司 Intelligent sensing control system and method for distributed micro-service architecture data center
CN112445595A (en) * 2020-11-26 2021-03-05 深圳晶泰科技有限公司 Multitask submission system based on slurm computing platform
CN112737934A (en) * 2020-12-28 2021-04-30 常州森普信息科技有限公司 Cluster type Internet of things edge gateway device and method
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN114584455A (en) * 2022-03-04 2022-06-03 吉林大学 Small and medium-sized high-performance cluster monitoring system based on enterprise WeChat
CN114745385A (en) * 2022-04-12 2022-07-12 吉林大学 Method for constructing slurm scheduling parallel computing cluster
CN115202992A (en) * 2022-09-15 2022-10-18 中国空气动力研究与发展中心计算空气动力研究所 CFD operation convergence monitoring method for slurm scheduling system
CN115834594A (en) * 2022-11-16 2023-03-21 贵州电网有限责任公司 Data collection method for improving high-performance computing application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167041A (en) * 2013-03-28 2013-06-19 广州中国科学院软件应用技术研究所 System and method for supporting cloud environment application cluster automation deployment
CN105100267A (en) * 2015-08-24 2015-11-25 用友网络科技股份有限公司 Deployment apparatus and deployment method for large enterprise private cloud
CN106844021A (en) * 2016-12-06 2017-06-13 中国电子科技集团公司第三十二研究所 Computing environment resource management system and management method thereof
CN108319514A (en) * 2018-01-26 2018-07-24 山东超越数控电子股份有限公司 A kind of visual scheduling system based on Slurm job managements

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167041A (en) * 2013-03-28 2013-06-19 广州中国科学院软件应用技术研究所 System and method for supporting cloud environment application cluster automation deployment
CN105100267A (en) * 2015-08-24 2015-11-25 用友网络科技股份有限公司 Deployment apparatus and deployment method for large enterprise private cloud
CN106844021A (en) * 2016-12-06 2017-06-13 中国电子科技集团公司第三十二研究所 Computing environment resource management system and management method thereof
CN108319514A (en) * 2018-01-26 2018-07-24 山东超越数控电子股份有限公司 A kind of visual scheduling system based on Slurm job managements

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
任金龙: "基于Ansible 的云平台自动化部署的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
刘杨兵: "基于集群环境的作业管理中间件的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380086A (en) * 2019-09-29 2021-02-19 北京城建设计发展集团股份有限公司 Intelligent sensing control system and method for distributed micro-service architecture data center
CN111541591B (en) * 2020-07-09 2020-09-15 武汉绿色网络信息服务有限责任公司 SSH-based server detection method and device
CN111541591A (en) * 2020-07-09 2020-08-14 武汉绿色网络信息服务有限责任公司 SSH-based server detection method and device
CN111949389A (en) * 2020-08-11 2020-11-17 曙光信息产业(北京)有限公司 Slurm-based information acquisition method and device, server and computer-readable storage medium
CN112445595B (en) * 2020-11-26 2022-10-25 深圳晶泰科技有限公司 Multitask submission system based on slurm computing platform
CN112445595A (en) * 2020-11-26 2021-03-05 深圳晶泰科技有限公司 Multitask submission system based on slurm computing platform
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN112737934A (en) * 2020-12-28 2021-04-30 常州森普信息科技有限公司 Cluster type Internet of things edge gateway device and method
CN114584455A (en) * 2022-03-04 2022-06-03 吉林大学 Small and medium-sized high-performance cluster monitoring system based on enterprise WeChat
CN114584455B (en) * 2022-03-04 2023-06-30 吉林大学 Small and medium-sized high-performance cluster monitoring system based on enterprise WeChat
CN114745385A (en) * 2022-04-12 2022-07-12 吉林大学 Method for constructing slurm scheduling parallel computing cluster
CN114745385B (en) * 2022-04-12 2023-05-30 吉林大学 Method for constructing slurm scheduling parallel computing cluster
CN115202992B (en) * 2022-09-15 2022-11-22 中国空气动力研究与发展中心计算空气动力研究所 CFD operation convergence monitoring method for slurm scheduling system
CN115202992A (en) * 2022-09-15 2022-10-18 中国空气动力研究与发展中心计算空气动力研究所 CFD operation convergence monitoring method for slurm scheduling system
CN115834594A (en) * 2022-11-16 2023-03-21 贵州电网有限责任公司 Data collection method for improving high-performance computing application
CN115834594B (en) * 2022-11-16 2024-04-19 贵州电网有限责任公司 Data collection method for improving high-performance computing application

Similar Documents

Publication Publication Date Title
CN110177020A (en) A kind of High-Performance Computing Cluster management method based on Slurm
CN107544839B (en) Virtual machine migration system, method and device
US8364460B2 (en) Systems and methods for analyzing performance of virtual environments
CN108270726B (en) Application instance deployment method and device
US20130013766A1 (en) Computer cluster and method for providing a disaster recovery functionality for a computer cluster
WO2017067484A1 (en) Virtualization data center scheduling system and method
JP5305040B2 (en) Server computer switching method, management computer and program
CN108347339B (en) Service recovery method and device
CN110134518A (en) A kind of method and system improving big data cluster multinode high application availability
KR20200078328A (en) Systems and methods of monitoring software application processes
CN110580198B (en) Method and device for adaptively switching OpenStack computing node into control node
CN110445662A (en) OpenStack control node is adaptively switched to the method and device of calculate node
CN110958311A (en) YARN-based shared cluster elastic expansion system and method
CN113626280B (en) Cluster state control method and device, electronic equipment and readable storage medium
CN112948063A (en) Cloud platform creation method and device, cloud platform and cloud platform implementation system
CN112199178A (en) Cloud service dynamic scheduling method and system based on lightweight container
CN112612635B (en) Multi-level protection method for application program
CN110046064B (en) Cloud server disaster tolerance implementation method based on fault drift
CN106959885A (en) A kind of virtual machine High Availabitity realizes system and its implementation
Moghaddam et al. Self-healing redundancy for openstack applications through fault-tolerant multi-agent task scheduling
US10768996B2 (en) Anticipating future resource consumption based on user sessions
CN114598591B (en) Embedded platform node fault recovery system and method
Fu et al. Proactive resource management for failure resilient high performance computing clusters
Kitamura et al. Development of a Server Management System Incorporating a Peer-to-Peer Method for Constructing a High-availability Server System
Azaiez et al. Hybrid fault tolerance model for cloud dependability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190827