CN115766714A

CN115766714A - Public computing platform based on super computing

Info

Publication number: CN115766714A
Application number: CN202211321995.2A
Authority: CN
Inventors: 陈莉琳; 涂乔逵; 陈聪; 石维钗
Original assignee: Digital Research Institute Fujian Information Industry Development Co ltd; Fujian Digital Fujian Cloud Computing Operation Co ltd
Current assignee: Digital Research Institute Fujian Information Industry Development Co ltd; Fujian Digital Fujian Cloud Computing Operation Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-07

Abstract

The invention discloses a public computing platform based on super computing, which comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node; the management login node is provided with a management platform comprising a user management unit and a charging unit, and the user management unit is used for managing the identity account of a platform user; the charging unit realizes quota control and real-time charging for the user; the management login node comprises a main management cluster and an auxiliary management cluster, a user management unit and a charging unit are in communication connection with the main management cluster, the main management cluster is accessed to the computing system through a Slurm task scheduling platform, and the Slurm task scheduling platform controls and calls nodes of corresponding types in the computing system to complete computing tasks based on user quota; the computing system is connected with a Singularity container platform that provides a container image for direct use. The invention provides all-round service from computing power, algorithm to data.

Description

Public computing platform based on super computing

Technical Field

The invention relates to the field of cloud computing, in particular to a public computing platform based on super computing.

Background

With the rapid increase of data volume, new technologies such as big data, artificial intelligence, internet of things, mobile internet and the like have stronger and stronger demands on computing, and unified support and service for more computing services are urgently needed while computing capacity is greatly improved. The traditional super computing mainly serves for scientific computing which is compared with advanced research, and the architecture and the requirements of the traditional super computing are obviously different from those of currently popular cloud computing. With the rise of new technologies such as internet of things, big data and artificial intelligence, VR/AR and the like in recent years, more and more new technologies and new applications need strong computing power and data processing power software and hardware basic supporting platforms. Meanwhile, the application of the new technology and the new architecture and the requirement on the adaptive capacity of the platform need to fully absorb and utilize the advantage of more flexible resource allocation of the cloud computing technology in the traditional super computing architecture. How a traditional super computing center platform system exerts the advantages of computing and data processing capabilities and can better serve the requirements of various high-complexity computing or high data throughput and processing, and the system is a new generation platform system which needs to be improved and adapted; on the other hand, more and more new technologies can not be realized only by IT technical teams, and more are the comprehensive capability of combining traditional subject research results with information technology.

Disclosure of Invention

It is an object of the present invention to provide a common computing platform based on supercomputing.

The technical scheme adopted by the invention is as follows:

the public computing platform based on the super computing comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node, and user data are stored in the LDAP authentication server and used for carrying out authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is loaded with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and cost recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform, and the Singularity container platform provides a packaged container mirror image which is convenient to use directly so as to cooperate with the Slurm task scheduling platform to carry out high-performance computation;

the computing system comprises a CPU computing node, a memory computing node (fat node) and a GPU computing node; the storage system adopts a parallel storage architecture, and a platform database is loaded on the storage system; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, rapid loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.

Further, the user information is written into the platform database, and meanwhile, a copy of the user information is backed up and stored to the LDAP authentication server.

The user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard ldap protocol.

Further, identity and account management includes addition, deletion, modification, and querying of accounts.

Furthermore, the management platform further comprises a visualization unit, the visualization unit realizes the visual use of the management platform, the visualization unit integrates one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of submitting and managing job tasks on an online Web interface and accessing the server on the line of commands through the Web interface.

Furthermore, each management cluster comprises at least two management nodes, one management node of the main management cluster is used as a main node, and the other management node is used as a standby node, so that the normal operation of the system is ensured; two management nodes of the slave management cluster do not coordinate the work in primary and secondary.

Further, the public computing platform further comprises an NTP server and a DNS server, wherein the NTP server is used for time synchronization management of the computer, and the DNS server provides a network access address query service.

Further, the remote operation and maintenance accesses the main management cluster through an SSH protocol to perform operation and maintenance operation of the management platform.

Furthermore, a firewall is erected at a port where the switch equipment is communicated with the external network.

By adopting the technical scheme, the invention further extends on the basis of the traditional service mode (IaaS, paaS and SaaS) by adopting a construction scheme combining the super-computation containerization technology and the virtualization technology, conforms to the user characteristics, the service requirements and the application scenes of a super-computation artificial intelligence platform, not only provides strong computing power provided by a traditional computing center for scientific research and computation users, but also provides a user-defined and dynamically switched computing environment according to the requirements of actual application programs, and provides all-round services from computing power, algorithms to data.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic diagram of a common computing platform based on supercomputing according to the present invention;

FIG. 2 is a schematic diagram of the architecture of a common computing platform based on supercomputing according to the present invention;

FIG. 3 is a functional architecture diagram of a management platform according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

As shown in one of fig. 1 to 3, the present invention discloses a public computing platform based on super computing, which includes a management login node, and an LDAP authentication server, a storage system and a computing system connected to the management login node, where the LDAP authentication server stores user data and uses the user data to perform authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is provided with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and fee recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform which provides packaged container mirror images convenient for direct use so as to cooperate with the Slurm task scheduling platform to perform high-performance computation;

the computing system comprises a CPU computing node, a memory computing node (fat node) and a GPU computing node; the storage system adopts a parallel storage architecture, and a platform database is loaded on the storage system; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, quick loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.

CPU compute nodes, i.e., parallel nodes, become the subject of a high performance compute farm. The cluster system formed by parallel operation processing nodes has high cost performance, mature technology and rich application support, and is the first choice of a computing architecture. The CPU computing nodes adopt high-density computing nodes. And memory computing nodes (fat nodes) meet special requirements of large memory, single-machine multi-core, high local IO and the like. Compared with a two-way cluster system, the multi-path fat node has lower cost performance but insufficient density. For this case, the fat node employs multiple high-performance compute nodes. And the GPU node and the TPU node are used for supporting graphics rendering, deep learning, AI and other applications. Artificial intelligence and deep learning algorithms and calculations involve hundreds of millions of parameters, and GPU and TPU nodes have strong calculation capacity and memory bandwidth. The GPU nodes adopt high-integration GPU computing nodes, and a plurality of GPU accelerator cards are configured on a single machine.

The invention mainly provides hardware resources such as a server, a fat node server, a GPU server, storage and the like, and the hardware resources utilize the existing resources of an owner. The cluster queue mode of the computing system is divided into three types, namely CPU, GPU and fat node, and each type charges according to two modes of sharing and monopolizing. The invention adopts a CentOS 7 and above systems as operating systems, adopts Slurm to carry out the job scheduling of the supercomputing cluster, and provides different matched software environments and system dependence for supercomputing users based on Singularity. The management platform provides a uniform Web interface based on browser access for a user, and realizes the functions of user management, charging management, high-performance computing task management, cloud platform virtual machine management, operation and maintenance management and the like.

Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, highly scalable cluster Management and large-scale small Linux cluster job scheduling system. The job debugging system is developed and deployed based on SLURM, realizes highly scalable and fault-tolerant cluster management and job scheduling of a high-performance computing node cluster, reasonably distributes resources for a task queue, and monitors jobs until the jobs are completed.

The system structure of the SLURM is divided into two parts, namely Controller daemons and User commands. SLURM is developed by the cooperation of the functions of mechanisms including Lawrence Livermore National Laboratory, bull, hewlett packard and the like in the United states, has high scalability and a better fault-tolerant mechanism, can be compatible with various UNIX systems without modifying the kernel of an operating system, and can be used for managing the resource allocation and job scheduling of various clusters and large-scale parallel processing systems. An SLURM job scheduling system is deployed on a plurality of super computers, and cluster resource management systems on super computers of 'Tianhe I' and 'Tianhe II' independently developed by the national defense science and technology university also use the SLURM for management.

In addition, high performance computing applications are numerous, each requiring a different companion software environment and a large number of system dependencies, such as different versions of operating systems, libraries, compilers, and so forth. The computing cluster system built based on the container framework is selected to effectively isolate different HPC applications, the problems of dependency and the like between installation and upgrading of traditional HPC application programs and installation packages are solved, and the deployment, operation and maintenance of the system become more effective. Moreover, the container also stores the computing environment, data and codes, and the repeatability of the experiment is improved. The container has portability, repeatability and flexibility, is quickly started, has low physical performance loss compared with other virtualization technologies, and has obvious technical advantages in the aspects of resource isolation, flexible scheduling, additional virtualization load and the like. In addition, job scheduling management is built in the container, and a job scheduling management module can realize configuration of computing nodes, queues, scheduling strategies and reserved resources through a Web page, so that a system administrator can conveniently manage the cluster, various scheduling strategies are supported, the authority of a user (group), a disk quota and a resource quota are controlled, and the conditions of memory overflow and excessive consumption of system resources are prevented.

The container technology can provide a plurality of system environment options outside the host computer, and meets the requirements of running different software. Meanwhile, after the software and the related dependent environment are packaged in the container once, the complex software environment can be seamlessly operated on various platforms without repeated configuration, and the related operation and maintenance workload is greatly reduced. The HPC cluster built based on the container architecture can effectively isolate different HPC applications, solves the problems of dependency and the like of installation, upgrading and installation packages of traditional HPC application programs, and enables the deployment, operation and maintenance of the system to be more effective.

The super computing platform containerization technology of the invention adopts Singularity technology to meet the requirement of high-performance computation of science and application data centers. Singularity addresses application containers designed for software incompatibility in the HPC environment.

Singularity is characterized in that a common user can conveniently use a packaged container mirror image in a cluster, and the container mirror image is matched with a job scheduling system, so that the use of the container mirror image is very convenient and the same as the use of other application software. Singularity also has the great advantage that a docker mirror can be directly used, greatly improving the usability of Singularity.

Advantages of Singularity: singularity has most of the advantages that container technology contains, such as fast start-up, low resource overhead, easy migration and expansion, etc. In addition, there are some unique advantages over the container technology of Docker: (1) more relaxed environment packing migration: things depended by Singularity are in the mirror image file, and the mirror image is directly copied without independently packaging/importing. There is no complex caching mechanism and the image is already compressed, taking up very little disk space. (2) seamless integration with existing systems: system user permission, network and the like directly inherit host configuration, commands are not required to be executed after entering a certain mirror image, and the commands in the mirror image can be directly called externally just like executing a locally installed command. (3) without running the daemon process: singularity provides a completely runtime environment, does not need a separate process when not used, and does not occupy any resources. The resource restriction and permission problems are also solved by not replacing the daemon process with an execution instruction. (4) Singularity supports multiple images and container file formats, and can even use Docker's provided images directly, as simple as going to pull an image from Docker Hub. Singularity can easily integrate with existing HPC systems, making it a lightweight vessel cloud with little additional development.

Based on the fact that the operation test results of the container and the traditional physical machine are basically the same, the container virtualization consumes very little hardware resources, the expenses of the container technology are less than 3%, and the singularity performance is slightly good.

Further, the user information is written into the platform database, and meanwhile, a backup copy is stored to the LDAP authentication server. And the user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard LDAP protocol. Identity and account management includes addition, deletion, modification, and querying of accounts.

Specifically, the user management unit mainly implements management (including addition, deletion, modification, and query), authorization control, quota information management, user group and user membership management, fee charging, and the like on the identity and account of the platform user. The user information can be written into the platform database, and also can be stored into an LDAP, and can be provided for systems such as a Slurm job scheduling system and a visual platform to use through a standard LDAP protocol. The method can be expanded based on the LDAP, and single sign-on of a plurality of systems and platforms is realized through the CAS protocol.

The user management unit mainly comprises the following functional modules:

the new user management module: (1) The system users are divided into three role types including team managers, team common users and system managers; (2) After a system administrator logs in the system, common user or team administrator information including user numbers, user names, user passwords, user permissions, user contact ways, mailboxes, account information and the like can be added to a user management interface; (3) The system administrator has the authority to add the super-computation partition for the new user, and specifies the partition types of the new user, namely three types of CPU, GPU and fat node, and each type is charged according to two modes of sharing and monopolizing; (4) The system administrator has the authority to modify the user information, including user password, user authority, user contact information, mailbox, account information and the like; (5) the system administrator has the authority to delete the users in the platform; (6) After logging in the system, the team manager can add common team user members including user numbers, user names, user passwords, user permissions, user contact ways, mailboxes and the like on a user management interface.

(II) an account information management module: (1) The system administrator can add account information for a team or an individual, and the team administrator has the authority to maintain the account basic information and perform corresponding safety setting; (2) The system administrator has the right to add new individual accounts or group accounts; (3) The system administrator and the team administrator have the authority to maintain the basic information of the accounts of the system administrator and the team administrator, wherein the basic information of the accounts comprises the steps of modifying personal account information, checking account use information, checking the use condition of the team members on accounts and the like; (4) The team administrator and the individual can perform security settings such as password management in the account information management module, including password modification, account mobile phone number binding, mailbox binding, account privacy problem setting and the like; (5) The team administrator and the team common user can share the same account.

(III) the user authority management module: (1) The system comprises a project group management function, wherein a team manager can maintain member information and inquire account cost information; (2) The team administrator has project group management authority, and the team administrator logs in an account to maintain project group members, wherein the project group members are newly added, inquired, modified and deleted; (3) The team administrator has the authority to add new project group members to the team account, and query the use conditions of all the project group members.

(IV) an alarm management module: (1) The individual user and the team manager user can set an alarm threshold value for the account, check alarm information, maintain the alarm information and the like; (2) The individual user and the team administrator user have the authority to set a self-defined alarm threshold value for the account in the alarm management module, and when the balance of the account reaches the alarm threshold value, the system sends prompt information; (3) The individual user and the team manager user have the authority to inquire the expense alarm information list and check the details of the received alarm information; (4) The individual user and the team manager user have the authority to carry out maintenance operation on the alarm information, including the operation of deleting the alarm information and the like.

And (V) the user resource management module: (1) The user resource management module comprises functions of HPC resource use details, job information management and the like; (2) The team manager and the individual user have the authority to inquire account resource use details; (3) The team manager and the individual user have the authority to inquire the job information, including the job starting time, the job finishing time, the job duration and the like.

And (VI) a system management module: (1) The system management module comprises operation log management, personnel management and team management; (2) The system administrator has the authority to maintain and query the operation log information, including searching the operation log information and the like; (3) The system administrator has the authority to maintain the personnel account information, the personnel personal information and the like, wherein the personnel account information, the personnel modifying information, the personnel deleting information, the account deleting information and the like are added; (4) The system administrator has the authority to carry out team management and maintenance, including operations such as adding and deleting the team accounts; (5) The operations of the administrator and the ordinary user have log records.

The charging unit realizes quota control and real-time charging functions for the user, and the HPC platform charges according to the specific number of the CPU, the GPU and the fat node (memory) used by the user and charges resources when the user uses the computer; each type carries out charging according to two modes of sharing and monopolizing and provides charging detail functions.

The charging unit mainly comprises the following modules:

the user account information management module: (1) The user account information management module comprises an account balance detail inquiry function and the like; (2) The system administrator has the authority to carry out recharging operation on users (common users and team administrators) in the system; (3) The team administrator has the authority to inquire account balance details of the team and the members thereof, and the individual user has the authority to inquire the account balance details of the individual user.

(II) an information acquisition module: (1) The information acquisition module comprises resource information acquisition and statistics and operation information acquisition functions; (2) The system collects the resource use information and the operation information as the basis for charging calculation.

(III) a charging policy management module: (1) The charging strategy management module comprises functions of charging strategy setting, resource charging and the like; (2) The system administrator has the authority to set the charging rules of each resource partition in the charging policy management module; (3) The system carries out charging calculation according to the charging strategy of each resource partition and the collected conditions of the resource use duration, the resource use amount and the like, and charges the account.

(IV) a bill management module: (1) Operations such as a team manager and an individual user having permission to inquire product bills of an account, historical bills and the like; (2) The team administrator and the individual user have the authority to use the account fee and the use amount details, including operations such as inquiring the use details and the like.

Furthermore, the management platform further comprises a visualization unit, the visualization unit realizes the visual use of the management platform, the visualization unit integrates one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of online submission and management of job tasks and online access to the server through a command line.

Specifically, the visualization unit specifically comprises the following functions: (1) Cluster management, checking and managing cluster and node state information; (2) File management, namely managing files under a job directory on line, including creating, viewing, deleting, renaming, moving, downloading, copying/pasting files and the like, entering other directories, creating the directory, uploading the files, displaying file information and the like; (3) Managing the operation, checking the operation state and the operation details in the queue, creating the operation, editing the operation and submitting the operation; and (4) integrating the online terminal.

Furthermore, a firewall is arranged on a port of the switch device communicated with the external network.

The invention adopts a construction scheme combining an ultra-computation containerization technology and a virtualization technology, further extends on the basis of the traditional service modes (IaaS, paaS and SaaS), conforms to the user characteristics, the service requirements and the application scenes of an ultra-computation artificial intelligence platform, not only provides strong computing power provided by the traditional computing center for scientific research and computation users, but also provides a user-defined and dynamically-switched computing environment according to the actual application program requirements, and provides all-round services from computing power, algorithm to data.

It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The embodiments and features of the embodiments in the present application may be combined with each other without conflict. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Claims

1. Public computing platform based on super computing, its characterized in that: the system comprises a management login node, an LDAP authentication server, a storage system and a computing system, wherein the LDAP authentication server is connected with the management login node, user data are stored in the LDAP authentication server, and the user data are used for carrying out authentication processing and authorization on a user; the management login node is respectively connected with the extranet and the intranet through the switch equipment, the management login node is provided with a management platform, the management platform comprises a user management unit and a charging unit, and the user management unit is used for identity and account management, authority control, quota information management, user group and user member relationship management and fee recharging of a platform user; the charging unit is used for realizing quota control and real-time charging functions for the user; the management login node comprises at least two management clusters, wherein one management cluster is used as a main management cluster, the other management cluster is used as a slave management cluster, the user management unit and the charging unit are in communication connection with the main management cluster, the main management cluster is connected with a Slurm task scheduling platform, the Slurm task scheduling platform is accessed into the computing system, and the Slurm task scheduling platform calls nodes of corresponding types in the computing system to complete computing tasks based on quota control of users; the computing system is connected with a Singularity container platform, and the Singularity container platform provides a packaged container mirror image which is convenient to use directly so as to cooperate with the Slurm task scheduling platform to carry out high-performance computation;

the computing system comprises a CPU computing node, a memory computing node and a GPU computing node; the storage system adopts a parallel storage architecture; the management platform is in communication connection with the slave management cluster, the slave management cluster is connected with a Module software management platform, the Module software management platform provides various compiling environments, and management, rapid loading and software environment switching of software environment variables are achieved through the Module software management platform; the CPU computing node, the memory computing node and the GPU computing node are connected with NFS file nodes, and the NFS file nodes are mounted on a computing system.

2. The supercomputing-based public computing platform of claim 1, wherein: and writing the user information into the platform database and simultaneously backing up and storing one copy to the LDAP authentication server.

3. The supercomputing-based public computing platform of claim 1, wherein: and the user management unit and the charging unit are in communication connection with the Slurm task scheduling platform through a standard LDAP protocol.

4. The supercomputing-based public computing platform of claim 1, wherein: identity and account management includes addition, deletion, modification, and querying of accounts.

5. The supercomputing-based public computing platform of claim 1, wherein: the management platform further comprises a visualization unit, the visualization unit is used for realizing visualization of the management platform, the visualization unit is integrated with one-stop service of Web Shell, file management and job submission, and the visualization unit provides functions of submitting and managing job tasks on an online Web interface and accessing the server on the line of commands through the Web interface.

6. The common supercomputing-based computing platform of claim 1, wherein: each management cluster comprises at least two management nodes, one management node of the main management cluster is used as a main node, and the other management node of the main management cluster is used as a standby node so as to ensure the normal operation of the system; two management nodes of the slave management cluster are not divided into primary and secondary coordination jobs.

7. The supercomputing-based public computing platform of claim 1, wherein: the public computing platform further comprises an NTP server and a DNS server, wherein the NTP server is used for time synchronization management of the computers, and the DNS server provides network access address query service.

8. The supercomputing-based public computing platform of claim 1, wherein: and the remote operation and maintenance accesses the main management cluster through the SSH protocol to carry out operation and maintenance operation of the management platform.

9. The common supercomputing-based computing platform of claim 1, wherein: and a firewall is erected at a port where the switch equipment is communicated with the external network.