WO2021155667A1

WO2021155667A1 - Model training method and apparatus, and clustering system

Info

Publication number: WO2021155667A1
Application number: PCT/CN2020/117723
Authority: WO
Inventors: 骆宝童; 丁瑞全; 张恒华; 胡在斌; 黄凯文; 李志�
Original assignee: 北京百度网讯科技有限公司
Priority date: 2020-02-05
Filing date: 2020-09-25
Publication date: 2021-08-12
Also published as: CN111327692A

Abstract

Disclosed are a model training method and apparatus, and a clustering system, which relate to the technical field of artificial intelligence. According to the specific implementation solution, in the aspect of hardware, a control node and at least one compute node are interconnected by means of a network, and a GPU is introduced into the compute node to serve as a compute resource, such that the hardware capacity of a clustering system is greatly improved, and the model training efficiency is thus also improved. In the aspect of software, the clustering system is made to be more convenient to use by means of optimizing a Slurm framework and introducing a client, a super management platform, etc.

Description

Model training method, device and cluster system

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 05, 2020, with the application number of 202010080825.4, and the application name of "Model Training Method, Device and Cluster System", the entire content of which is incorporated into this application by reference middle.

Technical field

This application relates to the technical field of Artificial Intelligence (AI), and in particular to a model training method, device and cluster system.

Background technique

With the continuous development of artificial intelligence, there is an increasing demand for AI model training. In the process of AI model training, when the training data set is small, the effect of deep learning is not ideal, even inferior to relatively simple machine learning methods. However, when the data set increases, the effect of the AI model trained by deep learning begins to exceed the training effect of other machine learning.

In a common deep learning process, a large-scale data set is trained by using high performance computing (HPC) to obtain an AI model. The overall structure of HPC can be divided into the following main parts: external network, master node, compute node, storage, computation network, and management network. Among them, the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.

In the above-mentioned HPC, the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited. As a result, the above-mentioned HPC uses deep learning to train AI models with low efficiency.

Summary of the invention

The embodiments of the present application provide a model training method, device, and cluster system, which use computing nodes with GPU cards to improve the hardware capabilities of the cluster system, thereby improving the efficiency of model training.

In the first aspect, an embodiment of the present application provides a cluster system, including: a control node, at least one computing node, and a storage node; wherein the control node establishes a connection with each of the at least one computing node, and For allocating computing resources for the task of training the target model; the computing node includes at least one central processing unit CPU and at least one graphics processing unit GPU, for using the computing resources to train the target model; the storage node and the at least one Each computing node in the computing node establishes a network connection for storing data required for training the target model.

In a feasible design, any two computing nodes in the at least one computing node are interconnected to establish a network connection based on the unlimited bandwidth Infiniband technology, and the CPU and GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the computing The GPU and GPU inside the node are connected through NV link.

In the second aspect, an embodiment of the present application provides a model training method, which is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes: the control node receives the first data sent by the application program interface API server. Request, the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the control node is the The target model allocates a target resource, and the control node sends a second request to the target computing node, so that the target computing node uses the target resource to train the target model.

In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.

In a feasible design, the above method further includes: the control node receives a management request sent by the second terminal device, the management request is used to request management of the computing node in the cluster system, and the control node The management request manages the computing nodes in the cluster system.

In a feasible design, the control node manages the computing nodes in the cluster system according to the management request, including: the control node calls the cluster open application program interface Open API to authenticate the second user; The second user passes the authentication, and the control node manages the computing nodes in the cluster system according to the management request.

In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The control node calling the cluster open application program interface Open API to authenticate the second user includes: the control node calling the cluster Open API, using the preset authentication mechanism to generate a second key, if the first secret If the key is the same as the second key, the control node determines the management authority of the second user, and the control node sends authority information to the second terminal device according to the management authority, so that the first The second terminal device displays the authority corresponding to the second user according to the authority information.

In a feasible design, the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.

In a third aspect, an embodiment of the present application provides a model training method suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes: a target computing node receives a second request sent by the control node, and The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources for the target model. The first request is sent by the API server through the first terminal according to the first user. The target node is included in the at least one computing node, the target computing node uses the target resource to train the target model, and the target computing node Send the trained target model to the storage node.

In a feasible design, the above method further includes: the target computing node receives a query request sent by the first terminal device, and the query request is used to request to display the target resource on the target computing node to train the When the target model is used for the target resource, the target computing node sends a query response to the first terminal device, and the query response carries information about the target resource usage status, so that the first terminal device is based on The usage status information displays the usage status of the target resource.

In a fourth aspect, an embodiment of the present application provides a model training device, including:

The receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;

A processing unit, configured to allocate target resources to the target model according to the resource information;

The sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.

In a feasible design, the receiving unit is further configured to receive a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;

The processing unit is further configured to manage the computing nodes in the cluster system according to the management request.

In a feasible design, the processing unit, when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, if the second user Through the authentication, the computing nodes in the cluster system are managed according to the management request.

In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The processing unit is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user’s Management authority; the sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.

In a fifth aspect, an embodiment of the present application provides a model training device, including:

The receiving unit is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. A request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;

A processing unit, configured to use the target resource to train the target model;

The sending unit is used to send the trained target model to the storage node.

In a feasible design, the receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of target resources on the target computing node when training the target model The usage status of the target resource;

The sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.

In a sixth aspect, an embodiment of the present application provides an electronic device, including:

At least one processor; and

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the second aspect or any possible implementation of the second aspect method.

In a seventh aspect, an embodiment of the present application provides an electronic device, including:

At least one processor; and

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the third aspect or any possible implementation of the third aspect method.

In an eighth aspect, the embodiments of the present application provide a computer program product containing instructions that, when run on an electronic device, cause the electronic device computer to execute the above-mentioned second aspect or the methods in the various possible implementations of the second aspect .

In a ninth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on an electronic device, cause the electronic device computer to execute the foregoing third aspect or various possible implementation methods of the third aspect

In a tenth aspect, the embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions. The non-transitory computer-readable storage medium stores instructions that, when running on an electronic device, cause the electronic device to Perform the methods in the foregoing second aspect or various possible implementation manners of the second aspect.

In an eleventh aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions. The non-transitory computer-readable storage medium stores instructions that, when run on an electronic device, cause the The device executes the method in the foregoing third aspect or various possible implementation manners of the third aspect.

In a twelfth aspect, an embodiment of the present application provides a cluster system, including: a control node and at least one computing node, wherein the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection, the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).

An embodiment in the above application has the following advantages or beneficial effects: by interconnecting the control node and at least one computing node through a network, the GPU is introduced as a computing resource in the computing node, thereby greatly improving the hardware capabilities of the cluster system, thereby increasing The efficiency of model training. In addition, the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.

Other effects of the above-mentioned optional manners will be described below in conjunction with specific embodiments.

Description of the drawings

The drawings are used to better understand the solution, and do not constitute a limitation to the application. in:

Figure 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application;

Figure 2 is a schematic diagram of the underlying framework of a cluster system provided by an embodiment of the present application;

Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application;

4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of memory monitoring of computing nodes of a cluster system provided by an embodiment of the present application;

Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application;

FIG. 10 is a schematic diagram of an authentication process in the model training method provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application;

FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application; FIG.

FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application;

FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application;

Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.

Detailed ways

The exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be regarded as merely exemplary. Therefore, those of ordinary skill in the art should realize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Likewise, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.

Today, with the rapid development of artificial intelligence, heterogeneous computing platforms composed of CPUs and GPUs are playing an increasingly important role. In the current era of big data, when the training data set is small, the effect of deep learning is not ideal, which is one of the reasons why deep learning has not attracted attention. Deep learning models trained on smaller data sets are not as effective as some relatively simple machine learning methods. However, when the data set is relatively large, the effects of deep learning begin to exceed the effects of other machine learning. High-performance computing (HPC) has the ability to use larger data sets to train models, making HPC the development of artificial intelligence. An important part.

Common HPC used for model training, the overall structure can be divided into the following main parts: external network, master node (master node), computing node (compute node), storage (stroage), computing network (computation network) and management Network (management network), etc. Among them, the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.

At the same time, high-performance computing cluster is a branch of computer science, which aims to solve complex reciprocal calculations or numerical calculations. It is a loosely coupled computing node composed of multiple nodes (servers). gather. Provide users with high-performance computing, network request response or professional applications (including parallel computing, database, web) and other services. However, how to manage the computing nodes of a large-scale computing cluster and how to schedule training tasks is a thorny issue. Although the industry currently introduces simple Linux utility resources (simple Linux utility resource management, slurm) to manage the cluster system, Generally, it is only optimized for the use of the slurm scheduling plug-in, and has not jumped out of the slurm framework, that is, the slurm framework has not been upgraded and optimized.

In view of this, the embodiments of the present application provide a model training method, device, and cluster system. In terms of hardware, by introducing GPU as a computing resource, the hardware capabilities of the cluster system are greatly improved, thereby improving the efficiency of model training; in terms of software , By optimizing the slurm framework, introducing clients, super management platforms, etc., making the cluster system more convenient to use. Hereinafter, the embodiments of the present application will be described in detail from two aspects of hardware capability improvement and software capability improvement.

First, the hardware capabilities are improved.

Fig. 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application. 1, the cluster system provided by the embodiment of the present application includes: a control node, at least one computing node, and a storage node; wherein, the control node establishes a connection with each of the at least one computing node, such as based on Transmission Control Protocol (TCP) network connection, etc.; the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU); The storage node establishes a network connection with each computing node in the at least one computing node to store data required for training the target model. The storage node is, for example, a distributed file system (Hadoop Distributed File System, HDFS). The data required by the target model includes the client, sample data set, etc. In addition, after the target model is trained by the computing node, the target model is also stored in the storage node. The client is used to submit resource information, etc. to the API server, so that The API server integrates resource information, etc., obtains the first request and submits it to the control node. The API server is not shown in the figure. In actual implementation, the API server and the control node can be integrated or independently set. R&D personnel can log in to the cluster system through the first terminal device, submit the first request for requesting model training, etc., and the administrator can log in to the cluster system through the second terminal device to create clusters, delete clusters, online machines, offline machines, Shield operations such as machines, where machines are computing nodes.

It should be noted that the first terminal device and the second terminal device may be the same device or different terminal devices, which is not limited in the embodiment of the present application.

In Figure 1, the computing resources of each computing node include CPU and GPU. The computing node is, for example, an all-in-one machine for AI model training, with 3 CPUs and 8 GPUs, where the CPU and GPU can be flexibly set. In addition, the computing resources included in the computing node may also be a Field-Programmable Gate Array (FPGA), etc., which is not limited in the embodiment of the present application.

The HDFS file system is a system used to temporarily store the user's execution environment and store the final running results. It can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and it can also avoid the placement of the trained model Disadvantages of insecurity on computing nodes.

It should be noted that the number of control nodes in the embodiment of the present application is not limited to one. For example, in order to avoid the downtime of the entire cluster system after the control node fails, the embodiment of the present application may set one master control node and one backup. The control node, when the main controller node fails, the standby control node can be started.

In the cluster system provided by the embodiments of the present application, the control node and at least one computing node are interconnected through a network, and GPUs are introduced as computing resources in the computing nodes, thereby greatly improving the hardware capabilities of the cluster system and thereby improving the efficiency of model training. In addition, the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.

For the sake of clarity, the existing cluster system is referred to as a high-performance computing (HPC) system, and the cluster system provided in the embodiments of the present application is referred to as a high-performance GPU cluster (High Performance GPU Cluster Platform, HGCP). ).

Below, how to improve hardware is explained in detail from the underlying framework, task scheduling, network optimization, performance analysis tools, computing nodes, cluster real-time monitoring, and cluster operation and maintenance management.

A. The underlying framework.

Figure 2 is a schematic diagram of the underlying framework of the cluster system provided by an embodiment of the present application. Please refer to FIG. 2, the cluster system provided by the embodiment of the present application includes six layers of chip, system design, performance optimization, cluster, framework, and application from bottom to top. Among them, the chip layer includes various computing resources, such as CPU, GPU, FPGA, integrated circuit (Application Specific Integrated Circuit, ASIC), and other AI chips. The system design layer includes cloud and edge AI all-in-one machines, high-performance storage pools, high-speed interconnection architecture, etc. The performance optimization layer includes calculation optimization, inpit output (IO), or communication optimization. The cluster layer includes K8S (Kubernetes) cloud native, intelligent scheduling, automatic expansion and contraction, etc. The framework layer includes some deep learning frameworks, such as Paddle Paddle, TF, Torch, etc. The application layer includes video, image, natural language understanding, search, recommendation or advertisement, etc.

Please refer to Figure 2. The cluster system provided by this application is based on the slurm open source Linux cluster resource management system, which has good scalability and high fault tolerance. In addition to the inherent functions of slurm, the HGCP provided by the embodiments of the present application also has complete training task life process management, machine management, and fault monitoring capabilities, with a very high degree of automation. Among them, the inherent functions of slurm include resource management functions and rich job scheduling functions, such as simple first-in-first-out (FIFO), job priority calculation, resource preemption and other functions. Multi Point Interface (MPI) provides good support. In addition, the cluster system provided by the embodiment of the present application also supports the allocation of general computing resources such as GPU, network bandwidth and even memory.

B. Task scheduling.

The existing HPC just uses several basic scheduling logic provided by slurm, such as FIFO. In the embodiments of this application, in order to open up the high-speed circulation of AI training tasks in the cluster system, HGCP has built an efficient task scheduling system in the upper layer, taking full account of the number of high-quality resources in each business and the actual running and pending training in the cluster Tasks, pool all resources, set high-quality logic quotas (quota) for each business, and specify the GPU usage ratio for single-computing node tasks and multi-computing node tasks, reducing the impact of resource fragmentation and effectively reducing cluster resource idleness , Improve the efficiency of GPU cluster resource usage and reduce operating costs.

C. Network optimization.

Generally speaking, network communication is a major bottleneck for deep learning training. Deep learning computing tasks have the characteristics of large calculation volume and many intermediate results. This requires the cluster system to have an efficient message transfer mechanism and massive data storage access capabilities. , And the efficiency of both depends largely on the network speed. Most HPCs based on slurm in the prior art use MultiPoint Interface (MPI) to transmit messages and parallel processing, while using MPI to transmit messages and parallel processing has two problems: slow message transmission and high system CPU usage At the same time, the network hardware of the computing node itself also limits the communication capabilities. To solve these problems, the HGCP provided in the embodiments of the present application optimizes the network. During the optimization process, any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and the GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the The GPU and GPU inside the computing node are connected through NV link. For example, see Figure 3.

Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application. Please refer to Figure 3, which shows two computing nodes, the first computing node and the second computing node. Each computing node includes a CPU node (node) and a GPU box (BOX). The CPU node includes CPU1 and CPU2. , GPU BOX contains three non-volatile memory Express (Non-Volatile Memory Express, NVMe), referred to as hard disk, in addition, GPU BOX also includes 8 GPUs, as shown in the figure 0-8, and network interface controller (network interface controller, NIC), PCIE SW, etc. The solid arrow in the figure shows the PCIE connection, and the dashed arrow shows the NVlink connection. It should be noted that although in the figure for clarity, the GPU part of the first computing node only indicates the PCIE connection, and the GPU part of the second computing node only indicates the NVlink connection, but in reality, the GPU part of each computing node is Including PCIE connection and NVlink connection.

The cluster system provided by the embodiments of this application uses a new I/O bus technology infinite bandwidth (Infiniband, IB) based on full-duplex, switched serial transmission, which replaces the MPI communication method commonly used in existing cluster systems, simplifies and improves Calculate the connection speed between nodes. At the same time, the CPU and GPU in a computing node are connected by PCIE, and the GPU and GPU are interconnected by NVlink, which greatly improves the communication between the GPU cards in the computing node. At the same time, PCIE, NVlink, Ethernet/Remote Direct Memory Access (RDMA) network bandwidths and delays vary widely, and the optimal resource combination needs to be allocated. The HGCP provided in the embodiments of this application adopts topology-aware scheduling , Optimize communication bandwidth.

D. Performance analysis tools.

Under normal circumstances, cluster utilization is the core evaluation indicator. Increasing utilization is equivalent to reducing the cost of its use. At the same time, it helps business training programs to perform data profiling and achieve good results for performance optimization. However, the existing HPC has no system-level fine-grained performance analysis tools. In order to achieve performance analysis, the usual method is to perform performance analysis on a single node after consulting with the business. Human intervention is required from startup to collection to data analysis. , And need to coordinate with the business to start training, only specific problems can be analyzed (case by case), and the efficiency is low and it is not suitable for large-scale promotion.

The HGCP provided in the embodiments of this application uses a deep learning system performance profiler (Dperf) to perform performance analysis on HGCP. Dperf is a common system-level one-stop performance analysis and bottleneck positioning system for deep learning training. . This tool combines the flow information of key computing nodes on NET, IO, H2D, P2P and other data paths with the CPU, Double Data Rate (DDR), and Graphics Double Data Rate memory (Graphics Double Data Rate, The utilization information of key computing resources such as GDDR) is uniformly captured and displayed on the same axis, which is convenient for business positioning program bottlenecks and targeted optimization. At the same time, the Dperf training tool is combined with the cluster task scheduling to automatically monitor the tasks of the GPU training cluster. On the one hand, it can help cluster managers to understand the usage and bottlenecks of each business, and improve the overall utilization of the cluster. On the other hand, to help developers monitor resource utilization, guide parameter adjustments, enhance scalability, and at the same time help locate hardware constraints and optimize hardware configuration, the Dperf provided in the embodiments of the present application has low overhead, multi-dimensionality, easy scalability, and fine-grained And visualization and other advantages. For example, refer to FIG. 4.

Fig. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application. Please refer to Figure 4. The entire process of deep learning training involves environment preparation, data reading, data preprocessing, forward training, backward training, and parameter update. Data storage is constrained by CPU, main memory and hard disk IO, while the training process is subject to upper and lower limits. The influence of factors such as line link, video memory and so on. Through the Dperf system-level performance analysis tool, analyze which hardware affects the program. For example, if the data reading and preprocessing time is long, and the system has more available CPU and disk resources, you can open more data processing processes to increase the data processing speed. If the training program waits for the training data for a long time, the data processing and training can be executed asynchronously to reduce the waiting time.

E. Computing nodes.

The current HPC computing nodes are limited by the number of GPU cards, communication, power consumption, heat dissipation and other issues, and the computing power density is low, and they cannot withstand the needs of model training tasks. The HGCP provided in the embodiments of the present application utilizes computing nodes with GPUs, has high computing density, high heat dissipation efficiency, supports the systemization of hardware modules, the standardization of interconnection interfaces, and the flexibility of interconnection topology, leading the hardware development direction of AI computing. Participate It also leads the development of AI hardware platforms and effectively supports cluster AI training tasks.

F. Real-time monitoring of clusters.

The current HPC Wu real-time fine-grained monitoring, lack of fine-grained monitoring of each computing node, computing task, so that the utilization information of key resources such as CPU, DDR, GPU, and GDDR agree to capture and display coaxially, users and management only Being able to log in to the physical node to view the operating status of the machine, or passively inform the fault information from the business, greatly affects the efficiency of the cluster operation. In the HGCP provided by the embodiments of this application, in order to monitor and analyze the operation of the cluster system, and at the same time collect parameters for system scheduling, a monitoring platform and a hardware monitoring plug-in (Hadoop Authentication Service, HAS) are deployed in the HGCP cluster to monitor and collect data in real time. The key performance data such as CPU, GPU, memory, network and storage of functional components such as control nodes and computing nodes of the HGCP cluster can be visually displayed in a graphical manner to understand the operating status of the hardware environment and discover in time that may be hidden in HGCP The problem of failure, and then provide solutions to the failure in the first time. Exemplarily, refer to FIG. 5, which is a schematic diagram of memory monitoring of a computing node of a cluster system provided in an embodiment of the present application. Please refer to Figure 5. From 14:40 to 15:40, the memory occupation of a computing node is shown in the waveform in the figure.

G. Cluster operation and maintenance management.

At present, with the continuous expansion of the HPC cluster scale and the continuous expansion of computing nodes, the standard operating environment deployment of computing nodes will change into a normalized, time-consuming and laborious task. The current HPC does not provide an efficient and standard Operation and maintenance solutions, fault discovery, location, repair, and restoration all require manual access, which is inefficient and wastes energy. At the same time, a failed computing node can be regarded as an idle computing node, and being idle is equal to waste. Table 1 lists the types and operating hours of general operation and maintenance operations.

Table 1

操作operate	操作方式Operation method	平均操作时间Average operating time
新机器环境配置New machine environment configuration	自动化脚本Automation script	20min20min
机器上线集群Machine online cluster	手动操作Manual operation	30min30min
集群资源队列调整Cluster resource queue adjustment	手动操作Manual operation	10min10min
机器故障维修Machine breakdown repair	维修人员接入Maintenance personnel access	1day1day
故障信息统计Fault information statistics	手动操作Manual operation	1hour1hour
集群环境升级Cluster environment upgrade	手动操作Manual operation	1day1day

The HGCP provided in the embodiments of this application has a smooth operation and maintenance process at the beginning of its construction, and needs to realize processization, process standardization, and standard automation. At the same time, operation and maintenance automation cannot solve all problems. It cannot be automated for the sake of automation. 20% of repetitive tasks consume 80% of time and energy. You only need to concentrate on doing 20% of repetitive tasks, and you can basically achieve a good state. . The cluster automated operation and maintenance tool is designed to manage a large number of computing nodes and has a single graphical user interface. The HGCP cluster provided by the embodiment of the present application performs machine management through a super-management platform system.

Secondly, software capabilities are improved.

Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application. This embodiment explains in detail the model training method described in the embodiment of the present application from the perspective of the interaction between the control node and the computing node. The present embodiment includes:

100. The client on the first terminal sends resource information required for training the target model to the API server.

101. The control node receives the first request sent by the application program interface API server.

Wherein, the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal.

Exemplarily, the current HPC does not have a user client, and the first user uses the HCP training model to be cumbersome. This is because the configuration of the training script, the access to the training data, and the acquisition of the training results need to be obtained directly from the HCP, and HPC does not Good packaging of its own functions greatly wastes the time of the first user. Among them, the first user usually refers to a researcher who trains a model such as a researcher, and the model may be various AI models, such as a face recognition model, a face detection model, etc., which are not limited in the embodiment of the present application. The model training method provided by the embodiment of the present application encapsulates the HGCP in advance to obtain a client, which is stored on HDFS for download by the first user. The first user downloads and installs the client on the first terminal device, and the client is used to submit training tasks to HGCP.

102. The control node allocates target resources to the target model according to the resource information.

The current HPC training task management is extensive. Although it can face multi-tenancy, that is, it can be used by multiple first users at the same time. Different first users train different target models, but different first users have usage requirements. At peaks and troughs, most of the existing slurm-based HPCs use the FIFO queuing mechanism by default, there is no priority limit, and no over-transmission is supported, which makes some first user resources idle, while other first users have no resources available. In the embodiment of this application, the computing resources of HGCP include CPU, GPU, memory, FPGA, etc. The configuration interface is displayed on the display interface of the first terminal device for the first user to configure the number of computing nodes required for training the target model. For each computing node, which CPU, GPU, etc. of the computing node need to be occupied, the first terminal device generates the resource information required for training the target model according to the configuration input by the user, and sends the resource information to the API server, API The server integrates the resource information, etc., generates a first request and sends it to the control node. After receiving the first request sent by the API server, the control node allocates computing resources for the target model according to the first request. For example, the resource information carried in the first request is 4 computing nodes and 16 GPUs, and the control node allocates 4 computing nodes to the target model. Assuming that there are 8 GPUs on each computing node, the 4 computing nodes are respectively Provide 4 GPUs for the target model, or the 4 computing nodes provide 4, 6, 2, and 2 GPUs in sequence.

103. The control node sends a second request to the target computing node.

Wherein, the target computing node is a computing node containing the target resource.

Exemplarily, after the control node configures the target resource for the target model, it sends a second request to the computing node containing the target resource to trigger the target computing node to train the target model.

104. The target computing node uses the target resource to train a target model.

105. The target computing node stores the trained target model in the storage node.

Continue to use the example in step 102, then in steps 103 to 105, continue to use the example in step 102 above, assuming that the target computing nodes that provide 16 GPUs are computing node 1, computing node 2, computing node 3, and computing node 4 , The four computing nodes are used as target computing nodes, and the target model is trained in a distributed manner. After the signaling is completed, the respective trained parts are stored in the storage node, such as in HDFS.

In the model training method provided by the embodiment of the present application, after receiving the first request sent by the API server, the control node allocates target resources to the target model according to the first request, and sends a second request to the target computing node containing the target resource to Trigger the target computing node to perform model training, and store the trained model in the HDFS system. With this solution, users submit training tasks by using a pre-packaged client without editing scripts through command lines. The process is simple and the efficiency of model training is greatly improved.

In the embodiments of the present application, software improvements generally include system architecture improvements and slurm open application programming interface (Application Programming Interface, API) improvements. The two improvements will be described in detail below.

First, the system architecture.

FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application. Please refer to FIG. 7, the HGCP system provided by the embodiment of the present application realizes complete isolation of users and resources. The first user downloads and installs the client from the HDFS system, and sends the resource information required for training the target model to the API through the client, so that the API server integrates the resource information, etc. to obtain the first request, and submits it to the control node The first request for training the model. When the target task is running on the target node, the first user can send a query request to the target computing node through the first terminal device. The query request is used to request to display the target resource on the target computing node when training the target model. The usage status of the target resource; after receiving the query request, the target computing node obtains the running status of the task of training the target model, and downloads the data generated during the permission process. After that, the target computing node sends to the first terminal device A query response, where the query response carries the usage status information of the target resource, so that the first terminal device displays the usage status of the target resource according to the usage status information. After the target model is trained, the target model is maintained on the HDFS system, and the first user or other users can download the final result from the HDFS system. Hereinafter, each model in FIG. 7 will be described in detail.

a. Client.

In the embodiment of the present application, the first user can download anywhere and send resource information to the API server through the client according to the client stored on the HDFS system, so that the API server can integrate the resource information to obtain the first request. And send the first request to the control node, a first request can be regarded as a task. The resource information carried in the first request includes at least one of the following information: the number of target computing nodes, the number of GPUs that are occupied when the target computing node is used to train the target model, and the target computing node is used to train the The number of CPUs occupied in the target model, the path of the HDFS system, and the user name or password of the HDFS. The background corresponding to the client uses the slurm OPEN API described in the embodiments of this application to perform tasks such as submitting, viewing, terminating, and obtaining training data, and the job submission adopts an asynchronous submission mode. For an example, see Figure 8.

FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by the embodiment of the present application. Referring to FIG. 8, the first user submits the job to the upper layer through the client on the first terminal device, the API service (server) performs request authentication, and the job is stored in the database after the authentication is passed. After that, the job manager running on the control node obtains the submitted job from the database and submits the job to HGCP. The job synchronization (Job SyncController) running on the computing node synchronizes the running status of the job to the monitor server (Monitor server). ) And slurm resource management system.

b. HDFS system.

In the HGCP provided by the embodiments of the present application, the HDFS system is a system used to temporarily store the user execution environment and store the final trained model, where the user execution environment is the aforementioned client. In addition, the embodiment of the present application does not limit the HDFS system when necessary, and in other feasible implementation manners, it may also be a file system private to the first user.

c. Resource scheduler.

Exemplarily, the resource scheduler is a module on the control node, which is used to allocate target resources to the target model according to the first request. The granularity of resource allocation is based on GPU instead of computing node. If a model training task of the first user cannot use up all the GPUs on the target computing node, the target computing node and its remaining GPUs will be allocated to other training tasks. The scheduler can support mixed scheduling of CPU and GPU at the same time. For example, when the first user submits a training task whose required resource is a GPU, if all the GPUs are not used up, other users can also submit training tasks through the remaining GPUs.

d. Resources.

In the embodiments of the present application, resources are divided at the granularity of computing nodes and GPUs, and one training task can be run on different GPUs of different computing nodes.

Second, slurm OPEN API.

e. The overall structure.

For example, refer to FIG. 9, which is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application. Please refer to Figure 9, the architecture includes:

Third-party platforms refer to some deep learning platforms, such as paddle clound and other platforms;

Cluster component, refers to the slurm cluster client;

API server refers to the unified entrance of slurm OPEN API, responsible for route analysis and request processing, etc.;

Authentication, refers to the slurm cluster authentication service module;

Database (database) refers to the XDB data platform, which stores data such as user permissions, job information, and queue quota (quota);

Job management (job manager), used for job management control, responsible for job queuing and submission control;

The job synchronization controller (job sync Controller) is responsible for synchronizing data such as job status, GPU utilization, GPU slot, node rank, and time;

Queue synchronization controller (Queue SyncController), responsible for pushing queue update events to third-party platforms (new queue, queue Quota update, etc.);

Node Monitoring Service (MonitorServer): Deployed on each computing node, providing running data of training jobs on that computing node.

f. Interface authentication.

Open API interface authentication is mainly used for requesting identity authentication and judging the legitimacy of the current request. Common methods include token authentication and AK/SK authentication; for interface access security, this article uses AK/SK authentication method. In a feasible implementation manner, the control node receives a management request sent by a second user using a second terminal device, and the management request carries the access key identifier of the second user and the first key, and the first key It is generated by the second terminal device using a preset authentication mechanism. When the control node calls the cluster open application program interface Open API to authenticate the second user, the control node calls the cluster Open API to use the preset authentication mechanism Generate a second key. If the first key and the second key are the same, the control node determines the management authority of the second user, and the control node issues the management authority to the second terminal according to the management authority. The device sends a data stream for updating the graphical interface of the management platform, so that the second terminal device updates and displays the graphical interface of the management platform, so that the second user can manage the graphical interface of the management platform through the updated management platform graphical interface. Cluster system.

Exemplarily, when AK/SK authentication is adopted, the access key ID (access key ID) is used to identify the second user, and the first key is, for example, the secret access key (Secret Access Key, SK), which is used by the second user. For encrypting the authentication string and the key used by the service to verify the authentication string, SK must be kept secret. After receiving the management request sent by the second user, the control node uses the same preset authentication mechanism to generate an authentication string, which is referred to as the second key below. After that, the control node compares the first key in the management request with the generated second key, and if the two keys are the same, it specifies the management authority for the second user and performs related operations. If the two keys are the same If they are not the same, the control node will ignore the operation and return an error code to the second terminal device.

FIG. 10 is a schematic diagram of the authentication process in the model training method provided by the embodiment of the present application. 10, the second user sends AK/SK to the authentication service on the control node through the client on the second terminal device, and the authentication service returns a token to the second terminal device; after that, the second user The client on the second terminal device sends a management request and token to the API service on the control node, and the API service sends a management response to the second terminal device according to the management request and token.

In the embodiment of this application, the second user is an administrator, which can be divided into multiple levels, such as cluster administrators, department administrators, ordinary users, etc., for example, see Table 2.

Table 2

According to Table 2, it can be seen that different permissions can be set for different second users.

g. API deployment.

Exemplarily, refer to FIG. 11, which is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application. Please refer to Figure 11. For service stability, api server is deployed on three servers including server 1, server2 and server3. At the same time, server1 deploys job_manager, job_sync_controller and 4 apiserver instances, and server2 and server3 deploy 1 nginx instance With 8 api server instances, api server is bound to nginx, and nginx is bound to BGW.

h. Supertube platform.

In the embodiments of this application, a super-management platform is set up for HGCP for machine management, cluster management, etc., mainly to provide the following main features for administrators and users:

1) Convenient management: Through the HGCP super-management platform, the administrator can go online, pause, start, restart, and offline any node selected. At the same time, the administrator can also select computing nodes in batches, and broadcast to The selected node issues a command;

2) Modularity: The HGCP super-management platform system runs on a LINUX server, and uses MySQL database to store statistics, monitoring, configuration, logs and other data. The back-end is integrated into a module with general functions, and Hypertext Preprocessor is used. PHP), Python, Ansible, Shell development, through the super management platform API interface to operate database data and computing nodes. The front-end display page is for cluster administrators and ordinary users, simplifying operations as much as possible and improving efficiency;

3) Efficient concurrency: For the environment installation and software upgrade and update of computing nodes, the administrator can issue standard environment configuration packages to all or some nodes in the cluster;

4) Reliability: The HGCP provided in the embodiments of this application is equipped with multiple control nodes to ensure the disaster tolerance and service continuity of the management system. These control nodes use Ansile to remotely manage the cluster computing nodes to perform environmental configuration, upgrade adjustments, and system inspections. And so on.

In the management process of the cluster system, the control node receives a management request sent by a second user using a second terminal device, the management request is used to request management of the computing node in the cluster system, and the management request is the second terminal The device is obtained according to the user's operation on the graphical interface of the management platform, and then the control node calls the cluster open application program interface Open API to authenticate the second user; then, if the second user passes the authentication, the control node The management request manages computing nodes in the cluster system. Exemplarily, please refer to FIG. 12, which is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application. Please refer to Figure 12, the cluster Open API includes cluster management API and machine management API. The screen of the second terminal device displays the management platform graphical interface of the super-management platform. The cluster administrator performs cluster operations through the over-platform graphical interface and calls down the cluster. Management API or machine management API. Among them, when the cluster management API is called, the management request is used to request the creation or deletion of a cluster. Based on the call, the cluster information in the database is configured, and the underlying cluster management module (cluster_manager) detects that there is a new operation task in the database and starts to perform related operations; In the machine management API, the management request is used to request any one of the at least one computing node to perform any one of the following operations: online, offline, restart, reinstall, repair, and shield. Based on the call, configure For node information in the database, the node management module (node_manager) detects that there is a new operation task in the database and starts to perform related operations.

Next, the operation for the cluster and the operation of the node are separately described in detail.

Operations for clusters include:

1. Create a cluster.

During the creation process, first, verify the parameters, including whether the cluster already exists, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and set the task operation (task_op) to install (install) , The task status (task-status) is set to pending (pending), and finally the cluster manager (cluster manager) completes the real online operation.

2. Delete the cluster.

For cluster security, the administrator must first cluster all machines linearly before deleting the cluster. During the deletion process, first, verify the parameters, including whether the cluster exists, whether there are still running machines in the cluster, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and the task operation (task_op ) Is set to uninstall, the task status (task-status) is set to pending, and finally the cluster manager completes the real offline operation.

3. List of basic cluster information

The basic information list of the cluster includes the cluster_info table and the cluster_task table. The cluster_info table contains the information of the cluster that has been running online, and the cluster_info table contains the information of the cluster in the process. If the two tables represent the same cluster, and if there are offline operations, the status is based on the status in the cluster_task table.

4. List of cluster details

In the embodiment of this application, only clusters in the running state can call the detailed interface. The cluster_info table contains the online clusters, and the node_info table aggregates the required information.

5. Display of cluster machine list

The cluster machine list includes node_info table and node_task table. The node_info table obtains the list of online machines, the node_info table obtains the list of online machines, and the node_task table obtains the list of machines in the process.

6. On-line machine

In the embodiment of the present application, going online is an operation, and the effect of this operation is to expand the capacity of the cluster system. During the online process, first, the parameters are verified, including verifying the existence of the cluster, and verifying the validity of the online parameters. After that, write the online task to the node operation task (node_task table), set the task operation (task_op) to install (install), set the task status (task-status) to pending (pending), and write the information about the node to be launched and the node The node information (node_info) table is marked as installing, and finally the node manager (node manager) completes the actual online operation and completes the update of the task and info tables.

7. Offline machines

In the embodiment of the present application, when the machine is downloaded, the machine is automatically marked as unschedulable first, and then the offline process is executed. During the offline process, the parameters are first verified, including verifying the existence of the cluster, and verifying the validity of the offline parameters. After that, query the node information (node_info) table. If there is an error in the process before the order, delete the node directly from the node_info table, write the offline task to the node operation task (node_task table), and set the task operation (task_op) to Uninstall, the task status (task-status) is set to pending (pending), the to-be-online and node information is written into the node information (node_info) table, and finally the node manager completes the real offline operation.

8. Change the shielding state of the machine

During the change process, the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the state shielding.

9. Change machine ownership

During the change process, the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the machine attribution label change.

The foregoing describes the specific implementation of the model training method mentioned in the embodiment of the present application. The following are device embodiments of the present application, which can be used to implement the method embodiments of the present application. For details that are not disclosed in the device embodiments of this application, please refer to the method embodiments of this application.

FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application. The device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server. As shown in FIG. 13, in this embodiment, the model training apparatus 100 may include:

The receiving unit 11 is configured to receive a first request sent by an application program interface API server, where the first request is resource information required by the API server according to the training target model sent by the first user through the client on the first terminal owned;

The processing unit 12 is configured to allocate target resources to the target model according to the resource information;

The sending unit 13 is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.

In a feasible design, the receiving unit 11 is further configured to receive a management request sent by a second terminal device, and the management request is used to request management of computing nodes in the cluster system;

The processing unit 12 is further configured to manage the computing nodes in the cluster system according to the management request.

In a feasible design, when the processing unit 12 manages the computing nodes in the cluster system according to the management request, it calls the cluster open application program interface Open API to authenticate the second user. After the user passes the authentication, the computing nodes in the cluster system are managed according to the management request.

In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The processing unit 12 is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user Management authority;

The sending unit 13 is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.

In a feasible design, the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;

or,

The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.

The device provided in the embodiment of the present application can be used in the method executed by the control node in the above embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.

FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application. The device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server. As shown in FIG. 14, in this embodiment, the model training device 200 may include:

The receiving unit 21 is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. The first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;

The processing unit 22 is configured to use the target resource to train the target model;

The sending unit 23 is used to send the trained target model to the storage node.

In a feasible design, the receiving unit 21 is further configured to receive a query request sent by the first terminal device, and the query request is used to request display of target resources on the target computing node to train the target model The usage status of the target resource at the time;

The sending unit 23 is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the information according to the usage status information. State the usage status of the target resource.

The device provided in the embodiment of the present application can be used in the method executed by the target computing node in the above embodiment, and its implementation principles and technical effects are similar, and will not be repeated here.

Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.

As shown in FIG. 15, the electronic device includes: one or more processors 31, memory 32, and interfaces for connecting various components, including a high-speed interface and a low-speed interface. The various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed. The processor may process instructions executed in the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to an interface). In other embodiments, if necessary, multiple processors and/or multiple buses can be used with multiple memories and multiple memories. Similarly, multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). One processor 31 is taken as an example in FIG. 15.

The memory 32 is a non-transitory computer-readable storage medium provided by this application. Wherein, the memory stores instructions executable by at least one processor, so that the at least one processor executes the model training method provided in this application. The non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make a computer execute the model training method provided in the present application.

As a non-transitory computer-readable storage medium, the memory 32 can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method in the embodiment of the present application (for example, The receiving unit 11, the processing unit 12, and the sending unit 13 shown in FIG. 13, and the receiving unit 21, the processing unit 22, and the sending unit 23 shown in FIG. 14). The processor 31 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 32, that is, implementing the method of model training in the foregoing method embodiment.

The memory 32 may include a program storage area and a data storage area. The program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created based on the use of an electronic device trained by the model. In addition, the memory 32 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 32 may optionally include memories remotely provided with respect to the processor 31, and these remote memories may be connected to an electronic device for model training via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic equipment of the model training method may further include: an input device 33 and an output device 34. The processor 31, the memory 32, the input device 33, and the output device 34 may be connected by a bus or in other ways. In FIG. 15, the connection by a bus is taken as an example.

The input device 33 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic equipment for model training, such as touch screen, keypad, mouse, track pad, touch pad, indicator stick, a Or multiple mouse buttons, trackballs, joysticks and other input devices. The output device 34 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.

Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

These computing programs (also referred to as programs, software, software applications, or codes) include machine instructions for programmable processors, and can be implemented using high-level procedures and/or object-oriented programming languages, and/or assembly/machine language Calculation program. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memory, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

In order to provide interaction with the user, the systems and techniques described here can be implemented on a computer that has: a display device for displaying information to the user (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) ); and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.

The systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

The computer system can include clients and servers. The client and server are generally far away from each other and usually interact through a communication network. The relationship between the client and the server is generated by computer programs that run on the corresponding computers and have a client-server relationship with each other.

An embodiment of the present application also provides a cluster system, including: a control node and at least one computing node, wherein the control node establishes a network connection with each computing node of the at least one computing node based on the transmission control protocol TCP; The computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).

According to the technical solution provided by the embodiments of this application, by introducing GPU as a computing resource, the hardware capability of the cluster system is greatly improved, thereby improving the efficiency of model training; in terms of software, the slurm framework is optimized, and the client and super management are introduced. Platform, etc., make the cluster system more convenient to use.

It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps described in the present application can be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present application can be achieved, this is not limited herein.

The foregoing specific implementations do not constitute a limitation on the protection scope of the present application. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims

A cluster system is characterized by comprising: a control node, at least one computing node, and a storage node; wherein,

The control node establishes a connection with each computing node of the at least one computing node, and is used to allocate computing resources for the task of training the target model;

The computing node includes at least one central processing unit (CPU) and at least one graphics processing unit (GPU), and is configured to use the computing resources to train a target model;

The storage node establishes a network connection with each of the at least one computing node, and is used to store data required for training the target model.
The system of claim 1, wherein:

Any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and GPU inside the computing node are connected through high-speed peripheral component interconnection PCIE, and the GPU and GPU inside the computing node are connected through NV link connection.
A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:

The control node receives a first request sent by an application program interface API server, and the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal ；

The control node allocates target resources to the target model according to the resource information;

The control node sends a second request to the target computing node, so that the target computing node uses the target resource to train a target model.
The method according to claim 3, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
The method according to claim 3 or 4, further comprising:

The control node receives a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;

The control node manages the computing nodes in the cluster system according to the management request.
The method according to claim 5, wherein the control node managing the computing nodes in the cluster system according to the management request comprises:

The control node calls the cluster open application program interface Open API to authenticate the second user;

If the second user passes the authentication, the control node manages the computing nodes in the cluster system according to the management request.
The method according to claim 6, wherein the management request carries the access key identifier of the second user and a first key, and the first key is the second terminal device using a preset The authentication mechanism generated by the control node invoking the cluster open application program interface Open API to authenticate the second user includes:

The control node calls the cluster Open API, and generates a second key by using the preset authentication mechanism;

If the first key and the second key are the same, the control node determines the management authority of the second user;

The control node sends authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
The method of claim 6, wherein:

The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;

or,

The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:

The target computing node receives a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. The first request Is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;

The target computing node uses the target resource to train the target model;

The target computing node sends the trained target model to the storage node.
The method according to claim 9, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
The method according to claim 9 or 10, further comprising:

Receiving, by the target computing node, a query request sent by the first terminal device, where the query request is used to request to display the target resource usage status of the target resource on the target computing node when the target model is trained;

The target computing node sends a query response to the first terminal device, and the query response carries the usage status information of the target resource, so that the first terminal device displays the status of the target resource according to the usage status information. Usage status.
A model training device is characterized in that it comprises:

The receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;

A processing unit, configured to allocate target resources to the target model according to the resource information;

The sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
The apparatus according to claim 12, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
The device according to claim 12 or 13, characterized in that:

The receiving unit is further configured to receive a management request sent by the second terminal device, where the management request is used to request management of computing nodes in the cluster system;

The processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
The device of claim 14, wherein:

The processing unit, when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, and if the second user passes the authentication, then The management request manages the computing nodes in the cluster system.
The device of claim 15, wherein:

The management request carries the access key identifier of the second user and a first key, the first key is generated by the second terminal device using a preset authentication mechanism, and the processing unit is configured to call The cluster Open API uses the preset authentication mechanism to generate a second key, and if the first key and the second key are the same, determine the management authority of the second user;

The sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
The device of claim 15, wherein:

The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;

or,

The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
A model training device is characterized in that it comprises:

The receiving unit is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. A request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in at least one computing node;

A processing unit, configured to use the target resource to train the target model;

The sending unit is used to send the trained target model to the storage node.
The device according to claim 18, wherein the resource information comprises at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
The device according to claim 18 or 19, wherein:

The receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of the target resource on the target computing node when the target model is trained. ；

The sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
An electronic device, characterized in that it comprises:

At least one processor; and

A memory communicatively connected with the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 3-8的方法; Or, to enable the at least one processor to execute the method of any one of claims 9-11.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 3-8; or, the computer instructions It is used to make the computer execute the method according to any one of claims 9-11.
A cluster system is characterized by comprising: a control node and at least one computing node, wherein,

The control node establishes a network connection with each of the at least one computing node based on the transmission control protocol TCP;

The computing resources of the computing node include at least one central processing unit CPU and at least one graphics processing unit GPU.