CN115766714A - Public computing platform based on super computing - Google Patents

Public computing platform based on super computing Download PDF

Info

Publication number
CN115766714A
CN115766714A CN202211321995.2A CN202211321995A CN115766714A CN 115766714 A CN115766714 A CN 115766714A CN 202211321995 A CN202211321995 A CN 202211321995A CN 115766714 A CN115766714 A CN 115766714A
Authority
CN
China
Prior art keywords
management
platform
computing
user
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211321995.2A
Other languages
Chinese (zh)
Inventor
陈莉琳
涂乔逵
陈聪
石维钗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital Research Institute Fujian Information Industry Development Co ltd
Fujian Digital Fujian Cloud Computing Operation Co ltd
Original Assignee
Digital Research Institute Fujian Information Industry Development Co ltd
Fujian Digital Fujian Cloud Computing Operation Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Research Institute Fujian Information Industry Development Co ltd, Fujian Digital Fujian Cloud Computing Operation Co ltd filed Critical Digital Research Institute Fujian Information Industry Development Co ltd
Priority to CN202211321995.2A priority Critical patent/CN115766714A/en
Publication of CN115766714A publication Critical patent/CN115766714A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a public computing platform based on super computing, which comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node; the management login node is provided with a management platform comprising a user management unit and a charging unit, and the user management unit is used for managing the identity account of a platform user; the charging unit realizes quota control and real-time charging for the user; the management login node comprises a main management cluster and an auxiliary management cluster, a user management unit and a charging unit are in communication connection with the main management cluster, the main management cluster is accessed to the computing system through a Slurm task scheduling platform, and the Slurm task scheduling platform controls and calls nodes of corresponding types in the computing system to complete computing tasks based on user quota; the computing system is connected with a Singularity container platform that provides a container image for direct use. The invention provides all-round service from computing power, algorithm to data.

Description

Public computing platform based on super computing
Technical Field
The invention relates to the field of cloud computing, in particular to a public computing platform based on super computing.
Background
With the rapid increase of data volume, new technologies such as big data, artificial intelligence, internet of things, mobile internet and the like have stronger and stronger demands on computing, and unified support and service for more computing services are urgently needed while computing capacity is greatly improved. The traditional super computing mainly serves for scientific computing which is compared with advanced research, and the architecture and the requirements of the traditional super computing are obviously different from those of currently popular cloud computing. With the rise of new technologies such as internet of things, big data and artificial intelligence, VR/AR and the like in recent years, more and more new technologies and new applications need strong computing power and data processing power software and hardware basic supporting platforms. Meanwhile, the application of the new technology and the new architecture and the requirement on the adaptive capacity of the platform need to fully absorb and utilize the advantage of more flexible resource allocation of the cloud computing technology in the traditional super computing architecture. How a traditional super computing center platform system exerts the advantages of computing and data processing capabilities and can better serve the requirements of various high-complexity computing or high data throughput and processing, and the system is a new generation platform system which needs to be improved and adapted; on the other hand, more and more new technologies can not be realized only by IT technical teams, and more are the comprehensive capability of combining traditional subject research results with information technology.
Disclosure of Invention
It is an object of the present invention to provide a common computing platform based on supercomputing.
The technical scheme adopted by the invention is as follows:
the public computing platform based on the super computing comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node, and user data are stored in the LDAP authentication server and used for carrying out authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is loaded with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and cost recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform, and the Singularity container platform provides a packaged container mirror image which is convenient to use directly so as to cooperate with the Slurm task scheduling platform to carry out high-performance computation;
the computing system comprises a CPU computing node, a memory computing node (fat node) and a GPU computing node; the storage system adopts a parallel storage architecture, and a platform database is loaded on the storage system; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, rapid loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.
Further, the user information is written into the platform database, and meanwhile, a copy of the user information is backed up and stored to the LDAP authentication server.
The user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard ldap protocol.
Further, identity and account management includes addition, deletion, modification, and querying of accounts.
Furthermore, the management platform further comprises a visualization unit, the visualization unit realizes the visual use of the management platform, the visualization unit integrates one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of submitting and managing job tasks on an online Web interface and accessing the server on the line of commands through the Web interface.
Furthermore, each management cluster comprises at least two management nodes, one management node of the main management cluster is used as a main node, and the other management node is used as a standby node, so that the normal operation of the system is ensured; two management nodes of the slave management cluster do not coordinate the work in primary and secondary.
Further, the public computing platform further comprises an NTP server and a DNS server, wherein the NTP server is used for time synchronization management of the computer, and the DNS server provides a network access address query service.
Further, the remote operation and maintenance accesses the main management cluster through an SSH protocol to perform operation and maintenance operation of the management platform.
Furthermore, a firewall is erected at a port where the switch equipment is communicated with the external network.
By adopting the technical scheme, the invention further extends on the basis of the traditional service mode (IaaS, paaS and SaaS) by adopting a construction scheme combining the super-computation containerization technology and the virtualization technology, conforms to the user characteristics, the service requirements and the application scenes of a super-computation artificial intelligence platform, not only provides strong computing power provided by a traditional computing center for scientific research and computation users, but also provides a user-defined and dynamically switched computing environment according to the requirements of actual application programs, and provides all-round services from computing power, algorithms to data.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a schematic diagram of a common computing platform based on supercomputing according to the present invention;
FIG. 2 is a schematic diagram of the architecture of a common computing platform based on supercomputing according to the present invention;
FIG. 3 is a functional architecture diagram of a management platform according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
As shown in one of fig. 1 to 3, the present invention discloses a public computing platform based on super computing, which includes a management login node, and an LDAP authentication server, a storage system and a computing system connected to the management login node, where the LDAP authentication server stores user data and uses the user data to perform authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is provided with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and fee recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform which provides packaged container mirror images convenient for direct use so as to cooperate with the Slurm task scheduling platform to perform high-performance computation;
the computing system comprises a CPU computing node, a memory computing node (fat node) and a GPU computing node; the storage system adopts a parallel storage architecture, and a platform database is loaded on the storage system; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, quick loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.
CPU compute nodes, i.e., parallel nodes, become the subject of a high performance compute farm. The cluster system formed by parallel operation processing nodes has high cost performance, mature technology and rich application support, and is the first choice of a computing architecture. The CPU computing nodes adopt high-density computing nodes. And memory computing nodes (fat nodes) meet special requirements of large memory, single-machine multi-core, high local IO and the like. Compared with a two-way cluster system, the multi-path fat node has lower cost performance but insufficient density. For this case, the fat node employs multiple high-performance compute nodes. And the GPU node and the TPU node are used for supporting graphics rendering, deep learning, AI and other applications. Artificial intelligence and deep learning algorithms and calculations involve hundreds of millions of parameters, and GPU and TPU nodes have strong calculation capacity and memory bandwidth. The GPU nodes adopt high-integration GPU computing nodes, and a plurality of GPU accelerator cards are configured on a single machine.
The invention mainly provides hardware resources such as a server, a fat node server, a GPU server, storage and the like, and the hardware resources utilize the existing resources of an owner. The cluster queue mode of the computing system is divided into three types, namely CPU, GPU and fat node, and each type charges according to two modes of sharing and monopolizing. The invention adopts a CentOS 7 and above systems as operating systems, adopts Slurm to carry out the job scheduling of the supercomputing cluster, and provides different matched software environments and system dependence for supercomputing users based on Singularity. The management platform provides a uniform Web interface based on browser access for a user, and realizes the functions of user management, charging management, high-performance computing task management, cloud platform virtual machine management, operation and maintenance management and the like.
Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, highly scalable cluster Management and large-scale small Linux cluster job scheduling system. The job debugging system is developed and deployed based on SLURM, realizes highly scalable and fault-tolerant cluster management and job scheduling of a high-performance computing node cluster, reasonably distributes resources for a task queue, and monitors jobs until the jobs are completed.
The system structure of the SLURM is divided into two parts, namely Controller daemons and User commands. SLURM is developed by the cooperation of the functions of mechanisms including Lawrence Livermore National Laboratory, bull, hewlett packard and the like in the United states, has high scalability and a better fault-tolerant mechanism, can be compatible with various UNIX systems without modifying the kernel of an operating system, and can be used for managing the resource allocation and job scheduling of various clusters and large-scale parallel processing systems. An SLURM job scheduling system is deployed on a plurality of super computers, and cluster resource management systems on super computers of 'Tianhe I' and 'Tianhe II' independently developed by the national defense science and technology university also use the SLURM for management.
In addition, high performance computing applications are numerous, each requiring a different companion software environment and a large number of system dependencies, such as different versions of operating systems, libraries, compilers, and so forth. The computing cluster system built based on the container framework is selected to effectively isolate different HPC applications, the problems of dependency and the like between installation and upgrading of traditional HPC application programs and installation packages are solved, and the deployment, operation and maintenance of the system become more effective. Moreover, the container also stores the computing environment, data and codes, and the repeatability of the experiment is improved. The container has portability, repeatability and flexibility, is quickly started, has low physical performance loss compared with other virtualization technologies, and has obvious technical advantages in the aspects of resource isolation, flexible scheduling, additional virtualization load and the like. In addition, job scheduling management is built in the container, and a job scheduling management module can realize configuration of computing nodes, queues, scheduling strategies and reserved resources through a Web page, so that a system administrator can conveniently manage the cluster, various scheduling strategies are supported, the authority of a user (group), a disk quota and a resource quota are controlled, and the conditions of memory overflow and excessive consumption of system resources are prevented.
The container technology can provide a plurality of system environment options outside the host computer, and meets the requirements of running different software. Meanwhile, after the software and the related dependent environment are packaged in the container once, the complex software environment can be seamlessly operated on various platforms without repeated configuration, and the related operation and maintenance workload is greatly reduced. The HPC cluster built based on the container architecture can effectively isolate different HPC applications, solves the problems of dependency and the like of installation, upgrading and installation packages of traditional HPC application programs, and enables the deployment, operation and maintenance of the system to be more effective.
The super computing platform containerization technology of the invention adopts Singularity technology to meet the requirement of high-performance computation of science and application data centers. Singularity addresses application containers designed for software incompatibility in the HPC environment.
Singularity is characterized in that a common user can conveniently use a packaged container mirror image in a cluster, and the container mirror image is matched with a job scheduling system, so that the use of the container mirror image is very convenient and the same as the use of other application software. Singularity also has the great advantage that a docker mirror can be directly used, greatly improving the usability of Singularity.
Advantages of Singularity: singularity has most of the advantages that container technology contains, such as fast start-up, low resource overhead, easy migration and expansion, etc. In addition, there are some unique advantages over the container technology of Docker: (1) more relaxed environment packing migration: things depended by Singularity are in the mirror image file, and the mirror image is directly copied without independently packaging/importing. There is no complex caching mechanism and the image is already compressed, taking up very little disk space. (2) seamless integration with existing systems: system user permission, network and the like directly inherit host configuration, commands are not required to be executed after entering a certain mirror image, and the commands in the mirror image can be directly called externally just like executing a locally installed command. (3) without running the daemon process: singularity provides a completely runtime environment, does not need a separate process when not used, and does not occupy any resources. The resource restriction and permission problems are also solved by not replacing the daemon process with an execution instruction. (4) Singularity supports multiple images and container file formats, and can even use Docker's provided images directly, as simple as going to pull an image from Docker Hub. Singularity can easily integrate with existing HPC systems, making it a lightweight vessel cloud with little additional development.
Based on the fact that the operation test results of the container and the traditional physical machine are basically the same, the container virtualization consumes very little hardware resources, the expenses of the container technology are less than 3%, and the singularity performance is slightly good.
Further, the user information is written into the platform database, and meanwhile, a backup copy is stored to the LDAP authentication server. And the user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard LDAP protocol. Identity and account management includes addition, deletion, modification, and querying of accounts.
Specifically, the user management unit mainly implements management (including addition, deletion, modification, and query), authorization control, quota information management, user group and user membership management, fee charging, and the like on the identity and account of the platform user. The user information can be written into the platform database, and also can be stored into an LDAP, and can be provided for systems such as a Slurm job scheduling system and a visual platform to use through a standard LDAP protocol. The method can be expanded based on the LDAP, and single sign-on of a plurality of systems and platforms is realized through the CAS protocol.
The user management unit mainly comprises the following functional modules:
the new user management module: (1) The system users are divided into three role types including team managers, team common users and system managers; (2) After a system administrator logs in the system, common user or team administrator information including user numbers, user names, user passwords, user permissions, user contact ways, mailboxes, account information and the like can be added to a user management interface; (3) The system administrator has the authority to add the super-computation partition for the new user, and specifies the partition types of the new user, namely three types of CPU, GPU and fat node, and each type is charged according to two modes of sharing and monopolizing; (4) The system administrator has the authority to modify the user information, including user password, user authority, user contact information, mailbox, account information and the like; (5) the system administrator has the authority to delete the users in the platform; (6) After logging in the system, the team manager can add common team user members including user numbers, user names, user passwords, user permissions, user contact ways, mailboxes and the like on a user management interface.
(II) an account information management module: (1) The system administrator can add account information for a team or an individual, and the team administrator has the authority to maintain the account basic information and perform corresponding safety setting; (2) The system administrator has the right to add new individual accounts or group accounts; (3) The system administrator and the team administrator have the authority to maintain the basic information of the accounts of the system administrator and the team administrator, wherein the basic information of the accounts comprises the steps of modifying personal account information, checking account use information, checking the use condition of the team members on accounts and the like; (4) The team administrator and the individual can perform security settings such as password management in the account information management module, including password modification, account mobile phone number binding, mailbox binding, account privacy problem setting and the like; (5) The team administrator and the team common user can share the same account.
(III) the user authority management module: (1) The system comprises a project group management function, wherein a team manager can maintain member information and inquire account cost information; (2) The team administrator has project group management authority, and the team administrator logs in an account to maintain project group members, wherein the project group members are newly added, inquired, modified and deleted; (3) The team administrator has the authority to add new project group members to the team account, and query the use conditions of all the project group members.
(IV) an alarm management module: (1) The individual user and the team manager user can set an alarm threshold value for the account, check alarm information, maintain the alarm information and the like; (2) The individual user and the team administrator user have the authority to set a self-defined alarm threshold value for the account in the alarm management module, and when the balance of the account reaches the alarm threshold value, the system sends prompt information; (3) The individual user and the team manager user have the authority to inquire the expense alarm information list and check the details of the received alarm information; (4) The individual user and the team manager user have the authority to carry out maintenance operation on the alarm information, including the operation of deleting the alarm information and the like.
And (V) the user resource management module: (1) The user resource management module comprises functions of HPC resource use details, job information management and the like; (2) The team manager and the individual user have the authority to inquire account resource use details; (3) The team manager and the individual user have the authority to inquire the job information, including the job starting time, the job finishing time, the job duration and the like.
And (VI) a system management module: (1) The system management module comprises operation log management, personnel management and team management; (2) The system administrator has the authority to maintain and query the operation log information, including searching the operation log information and the like; (3) The system administrator has the authority to maintain the personnel account information, the personnel personal information and the like, wherein the personnel account information, the personnel modifying information, the personnel deleting information, the account deleting information and the like are added; (4) The system administrator has the authority to carry out team management and maintenance, including operations such as adding and deleting the team accounts; (5) The operations of the administrator and the ordinary user have log records.
The charging unit realizes quota control and real-time charging functions for the user, and the HPC platform charges according to the specific number of the CPU, the GPU and the fat node (memory) used by the user and charges resources when the user uses the computer; each type carries out charging according to two modes of sharing and monopolizing and provides charging detail functions.
The charging unit mainly comprises the following modules:
the user account information management module: (1) The user account information management module comprises an account balance detail inquiry function and the like; (2) The system administrator has the authority to carry out recharging operation on users (common users and team administrators) in the system; (3) The team administrator has the authority to inquire account balance details of the team and the members thereof, and the individual user has the authority to inquire the account balance details of the individual user.
(II) an information acquisition module: (1) The information acquisition module comprises resource information acquisition and statistics and operation information acquisition functions; (2) The system collects the resource use information and the operation information as the basis for charging calculation.
(III) a charging policy management module: (1) The charging strategy management module comprises functions of charging strategy setting, resource charging and the like; (2) The system administrator has the authority to set the charging rules of each resource partition in the charging policy management module; (3) The system carries out charging calculation according to the charging strategy of each resource partition and the collected conditions of the resource use duration, the resource use amount and the like, and charges the account.
(IV) a bill management module: (1) Operations such as a team manager and an individual user having permission to inquire product bills of an account, historical bills and the like; (2) The team administrator and the individual user have the authority to use the account fee and the use amount details, including operations such as inquiring the use details and the like.
Furthermore, the management platform further comprises a visualization unit, the visualization unit realizes the visual use of the management platform, the visualization unit integrates one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of online submission and management of job tasks and online access to the server through a command line.
Specifically, the visualization unit specifically comprises the following functions: (1) Cluster management, checking and managing cluster and node state information; (2) File management, namely managing files under a job directory on line, including creating, viewing, deleting, renaming, moving, downloading, copying/pasting files and the like, entering other directories, creating the directory, uploading the files, displaying file information and the like; (3) Managing the operation, checking the operation state and the operation details in the queue, creating the operation, editing the operation and submitting the operation; and (4) integrating the online terminal.
Furthermore, each management cluster comprises at least two management nodes, one management node of the main management cluster is used as a main node, and the other management node is used as a standby node, so that the normal operation of the system is ensured; two management nodes of the slave management cluster do not coordinate the work in primary and secondary.
Further, the public computing platform further comprises an NTP server and a DNS server, wherein the NTP server is used for time synchronization management of the computer, and the DNS server provides a network access address query service.
Further, the remote operation and maintenance accesses the main management cluster through an SSH protocol to perform operation and maintenance operation of the management platform.
Furthermore, a firewall is arranged on a port of the switch device communicated with the external network.
The invention adopts a construction scheme combining an ultra-computation containerization technology and a virtualization technology, further extends on the basis of the traditional service modes (IaaS, paaS and SaaS), conforms to the user characteristics, the service requirements and the application scenes of an ultra-computation artificial intelligence platform, not only provides strong computing power provided by the traditional computing center for scientific research and computation users, but also provides a user-defined and dynamically-switched computing environment according to the actual application program requirements, and provides all-round services from computing power, algorithm to data.
It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The embodiments and features of the embodiments in the present application may be combined with each other without conflict. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Claims (9)

1. Public computing platform based on super computing, its characterized in that: the system comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node, user data are stored in the LDAP authentication server, and the user data are used for carrying out authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is provided with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and fee recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform, and the Singularity container platform provides a packaged container mirror image which is convenient to use directly so as to cooperate with the Slurm task scheduling platform to carry out high-performance computation;
the computing system comprises a CPU computing node, a memory computing node and a GPU computing node; the storage system adopts a parallel storage architecture; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, rapid loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.
2. The supercomputing-based public computing platform of claim 1, wherein: and writing the user information into the platform database and simultaneously backing up and storing one copy to the LDAP authentication server.
3. The supercomputing-based public computing platform of claim 1, wherein: and the user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard LDAP protocol.
4. The supercomputing-based public computing platform of claim 1, wherein: identity and account management includes addition, deletion, modification, and querying of accounts.
5. The supercomputing-based public computing platform of claim 1, wherein: the management platform further comprises a visualization unit, the visualization unit is used for realizing visualization of the management platform, the visualization unit is integrated with one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of submitting and managing job tasks on an online Web interface and accessing the server on the line of commands through the Web interface.
6. The common supercomputing-based computing platform of claim 1, wherein: each management cluster comprises at least two management nodes, one management node of the main management cluster is used as a main node, and the other management node of the main management cluster is used as a standby node so as to ensure the normal operation of the system; two management nodes of the slave management cluster are not divided into primary and secondary coordination jobs.
7. The supercomputing-based public computing platform of claim 1, wherein: the public computing platform further comprises an NTP server and a DNS server, wherein the NTP server is used for time synchronization management of the computers, and the DNS server provides network access address query service.
8. The supercomputing-based public computing platform of claim 1, wherein: and the remote operation and maintenance accesses the main management cluster through the SSH protocol to carry out operation and maintenance operation of the management platform.
9. The common supercomputing-based computing platform of claim 1, wherein: and a firewall is erected at a port where the switch equipment is communicated with the external network.
CN202211321995.2A 2022-10-27 2022-10-27 Public computing platform based on super computing Pending CN115766714A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211321995.2A CN115766714A (en) 2022-10-27 2022-10-27 Public computing platform based on super computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211321995.2A CN115766714A (en) 2022-10-27 2022-10-27 Public computing platform based on super computing

Publications (1)

Publication Number Publication Date
CN115766714A true CN115766714A (en) 2023-03-07

Family

ID=85353472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211321995.2A Pending CN115766714A (en) 2022-10-27 2022-10-27 Public computing platform based on super computing

Country Status (1)

Country Link
CN (1) CN115766714A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629382A (en) * 2023-05-29 2023-08-22 上海和今信息科技有限公司 Method for docking HPC cluster by machine learning platform based on Kubernetes, and corresponding device and system
CN117056061A (en) * 2023-10-13 2023-11-14 浙江远算科技有限公司 Cross-supercomputer task scheduling method and system based on container distribution mechanism
CN117473798A (en) * 2023-12-26 2024-01-30 国家超级计算天津中心 Simulation project management method, device, equipment and storage medium
CN117972670A (en) * 2024-03-28 2024-05-03 北京大学 Cloud container mirror image building method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629382A (en) * 2023-05-29 2023-08-22 上海和今信息科技有限公司 Method for docking HPC cluster by machine learning platform based on Kubernetes, and corresponding device and system
CN116629382B (en) * 2023-05-29 2024-01-02 上海和今信息科技有限公司 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes
CN117056061A (en) * 2023-10-13 2023-11-14 浙江远算科技有限公司 Cross-supercomputer task scheduling method and system based on container distribution mechanism
CN117056061B (en) * 2023-10-13 2024-01-09 浙江远算科技有限公司 Cross-supercomputer task scheduling method and system based on container distribution mechanism
CN117473798A (en) * 2023-12-26 2024-01-30 国家超级计算天津中心 Simulation project management method, device, equipment and storage medium
CN117473798B (en) * 2023-12-26 2024-05-14 国家超级计算天津中心 Simulation project management method, device, equipment and storage medium
CN117972670A (en) * 2024-03-28 2024-05-03 北京大学 Cloud container mirror image building method and device

Similar Documents

Publication Publication Date Title
CN115766714A (en) Public computing platform based on super computing
CN108885582B (en) Multi-tenant memory services for memory pool architecture
US20160205541A1 (en) Apparatus For End-User Transparent Utilization of Computational, Storage, and Network Capacity of Mobile Devices, and Associated Methods
Satoh A framework for data processing at the edges of networks
CN103491155A (en) Cloud computing method and system for achieving mobile computing and obtaining mobile data
Hu et al. A green private cloud architecture with global collaboration
Fan et al. A live migration algorithm for containers based on resource locality
CN111400036A (en) Cloud application management system, method, device and medium based on server cluster
Chen et al. Preemptive and low latency datacenter scheduling via lightweight containers
US12026072B2 (en) Metering framework for improving resource utilization for a disaster recovery environment
CN114996750A (en) Data sharing method and device
Mendes et al. Oversubscribing micro-clouds with energy-aware containers scheduling
US20210067599A1 (en) Cloud resource marketplace
Huai et al. Civic: a hypervisor based virtual computing environment
CN115102851B (en) Fusion platform for HPC and AI fusion calculation and resource management method thereof
Satoh MapReduce-based data processing on IoT
Guo et al. Decomposing and executing serverless applications as resource graphs
Koutsovasilis et al. A holistic approach to data access for cloud-native analytics and machine learning
Pei et al. Cloud computing technology and its applications
Hwang et al. CAVA: Exploring memory locality for big data analytics in virtualized clusters
Szekely 𝜎� OS: Elastic Realms for Multi-Tenant Cloud Computing
US11609702B2 (en) Intelligent storage allocation based on workspace context in a ZFS environment
Richard et al. I-Cluster: Intense computing with untapped resources
CN118708297A (en) E-commerce cluster deployment method and system for dynamic expansion based on Kubernetes clusters
Zhao et al. vPFS+: managing I/O performance for diverse HPC applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination