WO2021155667A1 - Model training method and apparatus, and clustering system - Google Patents

Model training method and apparatus, and clustering system Download PDF

Info

Publication number
WO2021155667A1
WO2021155667A1 PCT/CN2020/117723 CN2020117723W WO2021155667A1 WO 2021155667 A1 WO2021155667 A1 WO 2021155667A1 CN 2020117723 W CN2020117723 W CN 2020117723W WO 2021155667 A1 WO2021155667 A1 WO 2021155667A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
node
request
cluster
computing
Prior art date
Application number
PCT/CN2020/117723
Other languages
French (fr)
Chinese (zh)
Inventor
骆宝童
丁瑞全
张恒华
胡在斌
黄凯文
李志�
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2021155667A1 publication Critical patent/WO2021155667A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of Artificial Intelligence (AI), and in particular to a model training method, device and cluster system.
  • AI Artificial Intelligence
  • HPC high performance computing
  • the overall structure of HPC can be divided into the following main parts: external network, master node, compute node, storage, computation network, and management network.
  • the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
  • the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited.
  • the above-mentioned HPC uses deep learning to train AI models with low efficiency.
  • the embodiments of the present application provide a model training method, device, and cluster system, which use computing nodes with GPU cards to improve the hardware capabilities of the cluster system, thereby improving the efficiency of model training.
  • an embodiment of the present application provides a cluster system, including: a control node, at least one computing node, and a storage node; wherein the control node establishes a connection with each of the at least one computing node, and For allocating computing resources for the task of training the target model; the computing node includes at least one central processing unit CPU and at least one graphics processing unit GPU, for using the computing resources to train the target model; the storage node and the at least one Each computing node in the computing node establishes a network connection for storing data required for training the target model.
  • any two computing nodes in the at least one computing node are interconnected to establish a network connection based on the unlimited bandwidth Infiniband technology, and the CPU and GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the computing The GPU and GPU inside the node are connected through NV link.
  • an embodiment of the present application provides a model training method, which is suitable for a cluster system of a control node, at least one computing node, and a storage node.
  • the method includes: the control node receives the first data sent by the application program interface API server. Request, the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the control node is the The target model allocates a target resource, and the control node sends a second request to the target computing node, so that the target computing node uses the target resource to train the target model.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the above method further includes: the control node receives a management request sent by the second terminal device, the management request is used to request management of the computing node in the cluster system, and the control node The management request manages the computing nodes in the cluster system.
  • control node manages the computing nodes in the cluster system according to the management request, including: the control node calls the cluster open application program interface Open API to authenticate the second user; The second user passes the authentication, and the control node manages the computing nodes in the cluster system according to the management request.
  • the management request carries the access key identifier of the second user and the first key
  • the first key is generated by the second terminal device using a preset authentication mechanism
  • the control node calling the cluster open application program interface Open API to authenticate the second user includes: the control node calling the cluster Open API, using the preset authentication mechanism to generate a second key, if the first secret If the key is the same as the second key, the control node determines the management authority of the second user, and the control node sends authority information to the second terminal device according to the management authority, so that the first The second terminal device displays the authority corresponding to the second user according to the authority information.
  • the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
  • an embodiment of the present application provides a model training method suitable for a cluster system of a control node, at least one computing node, and a storage node.
  • the method includes: a target computing node receives a second request sent by the control node, and The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources for the target model.
  • the first request is sent by the API server through the first terminal according to the first user.
  • the target node is included in the at least one computing node, the target computing node uses the target resource to train the target model, and the target computing node Send the trained target model to the storage node.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the above method further includes: the target computing node receives a query request sent by the first terminal device, and the query request is used to request to display the target resource on the target computing node to train the
  • the target computing node sends a query response to the first terminal device, and the query response carries information about the target resource usage status, so that the first terminal device is based on The usage status information displays the usage status of the target resource.
  • an embodiment of the present application provides a model training device, including:
  • the receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;
  • a processing unit configured to allocate target resources to the target model according to the resource information
  • the sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the receiving unit is further configured to receive a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;
  • the processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
  • the processing unit when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, if the second user Through the authentication, the computing nodes in the cluster system are managed according to the management request.
  • the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism
  • the processing unit is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user’s Management authority; the sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
  • the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
  • an embodiment of the present application provides a model training device, including:
  • the receiving unit is configured to receive a second request sent by the control node.
  • the second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model.
  • a request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
  • a processing unit configured to use the target resource to train the target model
  • the sending unit is used to send the trained target model to the storage node.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of target resources on the target computing node when training the target model The usage status of the target resource;
  • the sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the second aspect or any possible implementation of the second aspect method.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the third aspect or any possible implementation of the third aspect method.
  • the embodiments of the present application provide a computer program product containing instructions that, when run on an electronic device, cause the electronic device computer to execute the above-mentioned second aspect or the methods in the various possible implementations of the second aspect .
  • embodiments of the present application provide a computer program product containing instructions, which when run on an electronic device, cause the electronic device computer to execute the foregoing third aspect or various possible implementation methods of the third aspect
  • the embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions.
  • the non-transitory computer-readable storage medium stores instructions that, when running on an electronic device, cause the electronic device to Perform the methods in the foregoing second aspect or various possible implementation manners of the second aspect.
  • embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions.
  • the non-transitory computer-readable storage medium stores instructions that, when run on an electronic device, cause the The device executes the method in the foregoing third aspect or various possible implementation manners of the third aspect.
  • an embodiment of the present application provides a cluster system, including: a control node and at least one computing node, wherein the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection, the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
  • the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection
  • the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
  • An embodiment in the above application has the following advantages or beneficial effects: by interconnecting the control node and at least one computing node through a network, the GPU is introduced as a computing resource in the computing node, thereby greatly improving the hardware capabilities of the cluster system, thereby increasing The efficiency of model training.
  • the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
  • Figure 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of the underlying framework of a cluster system provided by an embodiment of the present application.
  • Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of memory monitoring of computing nodes of a cluster system provided by an embodiment of the present application
  • Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of an authentication process in the model training method provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application.
  • FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application.
  • FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application.
  • Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.
  • HPC High-performance computing
  • the overall structure can be divided into the following main parts: external network, master node (master node), computing node (compute node), storage (stroage), computing network (computation network) and management Network (management network), etc.
  • the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
  • the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited.
  • the above-mentioned HPC uses deep learning to train AI models with low efficiency.
  • high-performance computing cluster is a branch of computer science, which aims to solve complex reciprocal calculations or numerical calculations. It is a loosely coupled computing node composed of multiple nodes (servers). gather. Provide users with high-performance computing, network request response or professional applications (including parallel computing, database, web) and other services. However, how to manage the computing nodes of a large-scale computing cluster and how to schedule training tasks is a thorny issue.
  • the embodiments of the present application provide a model training method, device, and cluster system.
  • the hardware capabilities of the cluster system are greatly improved, thereby improving the efficiency of model training; in terms of software ,
  • optimizing the slurm framework, introducing clients, super management platforms, etc. making the cluster system more convenient to use.
  • the embodiments of the present application will be described in detail from two aspects of hardware capability improvement and software capability improvement.
  • Fig. 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application.
  • the cluster system provided by the embodiment of the present application includes: a control node, at least one computing node, and a storage node; wherein, the control node establishes a connection with each of the at least one computing node, such as based on Transmission Control Protocol (TCP) network connection, etc.; the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU);
  • the storage node establishes a network connection with each computing node in the at least one computing node to store data required for training the target model.
  • the storage node is, for example, a distributed file system (Hadoop Distributed File System, HDFS).
  • HDFS Hadoop Distributed File System
  • the data required by the target model includes the client, sample data set, etc.
  • the target model is also stored in the storage node.
  • the client is used to submit resource information, etc. to the API server, so that The API server integrates resource information, etc., obtains the first request and submits it to the control node.
  • the API server is not shown in the figure. In actual implementation, the API server and the control node can be integrated or independently set. R&D personnel can log in to the cluster system through the first terminal device, submit the first request for requesting model training, etc., and the administrator can log in to the cluster system through the second terminal device to create clusters, delete clusters, online machines, offline machines, Shield operations such as machines, where machines are computing nodes.
  • first terminal device and the second terminal device may be the same device or different terminal devices, which is not limited in the embodiment of the present application.
  • the computing resources of each computing node include CPU and GPU.
  • the computing node is, for example, an all-in-one machine for AI model training, with 3 CPUs and 8 GPUs, where the CPU and GPU can be flexibly set.
  • the computing resources included in the computing node may also be a Field-Programmable Gate Array (FPGA), etc., which is not limited in the embodiment of the present application.
  • FPGA Field-Programmable Gate Array
  • the HDFS file system is a system used to temporarily store the user's execution environment and store the final running results. It can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and it can also avoid the placement of the trained model Disadvantages of insecurity on computing nodes.
  • control nodes in the embodiment of the present application is not limited to one.
  • the embodiment of the present application may set one master control node and one backup. The control node, when the main controller node fails, the standby control node can be started.
  • control node and at least one computing node are interconnected through a network, and GPUs are introduced as computing resources in the computing nodes, thereby greatly improving the hardware capabilities of the cluster system and thereby improving the efficiency of model training.
  • use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
  • the existing cluster system is referred to as a high-performance computing (HPC) system
  • the cluster system provided in the embodiments of the present application is referred to as a high-performance GPU cluster (High Performance GPU Cluster Platform, HGCP). ).
  • FIG. 2 is a schematic diagram of the underlying framework of the cluster system provided by an embodiment of the present application.
  • the cluster system provided by the embodiment of the present application includes six layers of chip, system design, performance optimization, cluster, framework, and application from bottom to top.
  • the chip layer includes various computing resources, such as CPU, GPU, FPGA, integrated circuit (Application Specific Integrated Circuit, ASIC), and other AI chips.
  • the system design layer includes cloud and edge AI all-in-one machines, high-performance storage pools, high-speed interconnection architecture, etc.
  • the performance optimization layer includes calculation optimization, inpit output (IO), or communication optimization.
  • the cluster layer includes K8S (Kubernetes) cloud native, intelligent scheduling, automatic expansion and contraction, etc.
  • the framework layer includes some deep learning frameworks, such as Paddle Paddle, TF, Torch, etc.
  • the application layer includes video, image, natural language understanding, search, recommendation or advertisement, etc.
  • the cluster system provided by this application is based on the slurm open source Linux cluster resource management system, which has good scalability and high fault tolerance.
  • the HGCP provided by the embodiments of the present application also has complete training task life process management, machine management, and fault monitoring capabilities, with a very high degree of automation.
  • the inherent functions of slurm include resource management functions and rich job scheduling functions, such as simple first-in-first-out (FIFO), job priority calculation, resource preemption and other functions.
  • FIFO simple first-in-first-out
  • MPI Multi Point Interface
  • the cluster system provided by the embodiment of the present application also supports the allocation of general computing resources such as GPU, network bandwidth and even memory.
  • HGCP in order to open up the high-speed circulation of AI training tasks in the cluster system, HGCP has built an efficient task scheduling system in the upper layer, taking full account of the number of high-quality resources in each business and the actual running and pending training in the cluster Tasks, pool all resources, set high-quality logic quotas (quota) for each business, and specify the GPU usage ratio for single-computing node tasks and multi-computing node tasks, reducing the impact of resource fragmentation and effectively reducing cluster resource idleness , Improve the efficiency of GPU cluster resource usage and reduce operating costs.
  • any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and the GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the The GPU and GPU inside the computing node are connected through NV link.
  • PCIE peripheral component interconnection
  • Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application.
  • Each computing node includes a CPU node (node) and a GPU box (BOX).
  • the CPU node includes CPU1 and CPU2.
  • GPU BOX contains three non-volatile memory Express (Non-Volatile Memory Express, NVMe), referred to as hard disk, in addition, GPU BOX also includes 8 GPUs, as shown in the figure 0-8, and network interface controller (network interface controller, NIC), PCIE SW, etc.
  • the solid arrow in the figure shows the PCIE connection, and the dashed arrow shows the NVlink connection.
  • the GPU part of the first computing node only indicates the PCIE connection
  • the GPU part of the second computing node only indicates the NVlink connection
  • the GPU part of each computing node is Including PCIE connection and NVlink connection.
  • the cluster system uses a new I/O bus technology infinite bandwidth (Infiniband, IB) based on full-duplex, switched serial transmission, which replaces the MPI communication method commonly used in existing cluster systems, simplifies and improves Calculate the connection speed between nodes.
  • IB I/O bus technology infinite bandwidth
  • the CPU and GPU in a computing node are connected by PCIE, and the GPU and GPU are interconnected by NVlink, which greatly improves the communication between the GPU cards in the computing node.
  • PCIE, NVlink, Ethernet/Remote Direct Memory Access (RDMA) network bandwidths and delays vary widely, and the optimal resource combination needs to be allocated.
  • the HGCP provided in the embodiments of this application adopts topology-aware scheduling , Optimize communication bandwidth.
  • cluster utilization is the core evaluation indicator. Increasing utilization is equivalent to reducing the cost of its use. At the same time, it helps business training programs to perform data profiling and achieve good results for performance optimization.
  • the existing HPC has no system-level fine-grained performance analysis tools. In order to achieve performance analysis, the usual method is to perform performance analysis on a single node after consulting with the business. Human intervention is required from startup to collection to data analysis. , And need to coordinate with the business to start training, only specific problems can be analyzed (case by case), and the efficiency is low and it is not suitable for large-scale promotion.
  • the HGCP uses a deep learning system performance profiler (Dperf) to perform performance analysis on HGCP.
  • Dperf is a common system-level one-stop performance analysis and bottleneck positioning system for deep learning training.
  • This tool combines the flow information of key computing nodes on NET, IO, H2D, P2P and other data paths with the CPU, Double Data Rate (DDR), and Graphics Double Data Rate memory (Graphics Double Data Rate,
  • DDR Double Data Rate
  • Graphics Double Data Rate Graphics Double Data Rate
  • the utilization information of key computing resources such as GDDR is uniformly captured and displayed on the same axis, which is convenient for business positioning program bottlenecks and targeted optimization.
  • the Dperf training tool is combined with the cluster task scheduling to automatically monitor the tasks of the GPU training cluster.
  • the Dperf provided in the embodiments of the present application has low overhead, multi-dimensionality, easy scalability, and fine-grained And visualization and other advantages. For example, refer to FIG. 4.
  • Fig. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application. Please refer to Figure 4.
  • the entire process of deep learning training involves environment preparation, data reading, data preprocessing, forward training, backward training, and parameter update.
  • Data storage is constrained by CPU, main memory and hard disk IO, while the training process is subject to upper and lower limits. The influence of factors such as line link, video memory and so on.
  • Dperf system-level performance analysis tool analyze which hardware affects the program. For example, if the data reading and preprocessing time is long, and the system has more available CPU and disk resources, you can open more data processing processes to increase the data processing speed. If the training program waits for the training data for a long time, the data processing and training can be executed asynchronously to reduce the waiting time.
  • the current HPC computing nodes are limited by the number of GPU cards, communication, power consumption, heat dissipation and other issues, and the computing power density is low, and they cannot withstand the needs of model training tasks.
  • the HGCP provided in the embodiments of the present application utilizes computing nodes with GPUs, has high computing density, high heat dissipation efficiency, supports the systemization of hardware modules, the standardization of interconnection interfaces, and the flexibility of interconnection topology, leading the hardware development direction of AI computing. Participate It also leads the development of AI hardware platforms and effectively supports cluster AI training tasks.
  • the current HPC Wu real-time fine-grained monitoring lack of fine-grained monitoring of each computing node, computing task, so that the utilization information of key resources such as CPU, DDR, GPU, and GDDR agree to capture and display coaxially, users and management only Being able to log in to the physical node to view the operating status of the machine, or passively inform the fault information from the business, greatly affects the efficiency of the cluster operation.
  • a monitoring platform and a hardware monitoring plug-in are deployed in the HGCP cluster to monitor and collect data in real time.
  • the key performance data such as CPU, GPU, memory, network and storage of functional components such as control nodes and computing nodes of the HGCP cluster can be visually displayed in a graphical manner to understand the operating status of the hardware environment and discover in time that may be hidden in HGCP The problem of failure, and then provide solutions to the failure in the first time.
  • FIG. 5 is a schematic diagram of memory monitoring of a computing node of a cluster system provided in an embodiment of the present application. Please refer to Figure 5. From 14:40 to 15:40, the memory occupation of a computing node is shown in the waveform in the figure.
  • the HGCP provided in the embodiments of this application has a smooth operation and maintenance process at the beginning of its construction, and needs to realize processization, process standardization, and standard automation. At the same time, operation and maintenance automation cannot solve all problems. It cannot be automated for the sake of automation. 20% of repetitive tasks consume 80% of time and energy. You only need to concentrate on doing 20% of repetitive tasks, and you can basically achieve a good state. .
  • the cluster automated operation and maintenance tool is designed to manage a large number of computing nodes and has a single graphical user interface.
  • the HGCP cluster provided by the embodiment of the present application performs machine management through a super-management platform system.
  • Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application. This embodiment explains in detail the model training method described in the embodiment of the present application from the perspective of the interaction between the control node and the computing node.
  • the present embodiment includes:
  • the client on the first terminal sends resource information required for training the target model to the API server.
  • the control node receives the first request sent by the application program interface API server.
  • the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal.
  • the current HPC does not have a user client, and the first user uses the HCP training model to be cumbersome.
  • the first user usually refers to a researcher who trains a model such as a researcher, and the model may be various AI models, such as a face recognition model, a face detection model, etc., which are not limited in the embodiment of the present application.
  • the model training method provided by the embodiment of the present application encapsulates the HGCP in advance to obtain a client, which is stored on HDFS for download by the first user.
  • the first user downloads and installs the client on the first terminal device, and the client is used to submit training tasks to HGCP.
  • the control node allocates target resources to the target model according to the resource information.
  • the current HPC training task management is extensive. Although it can face multi-tenancy, that is, it can be used by multiple first users at the same time. Different first users train different target models, but different first users have usage requirements. At peaks and troughs, most of the existing slurm-based HPCs use the FIFO queuing mechanism by default, there is no priority limit, and no over-transmission is supported, which makes some first user resources idle, while other first users have no resources available.
  • the computing resources of HGCP include CPU, GPU, memory, FPGA, etc.
  • the configuration interface is displayed on the display interface of the first terminal device for the first user to configure the number of computing nodes required for training the target model. For each computing node, which CPU, GPU, etc.
  • the first terminal device generates the resource information required for training the target model according to the configuration input by the user, and sends the resource information to the API server, API The server integrates the resource information, etc., generates a first request and sends it to the control node.
  • the control node allocates computing resources for the target model according to the first request.
  • the resource information carried in the first request is 4 computing nodes and 16 GPUs, and the control node allocates 4 computing nodes to the target model. Assuming that there are 8 GPUs on each computing node, the 4 computing nodes are respectively Provide 4 GPUs for the target model, or the 4 computing nodes provide 4, 6, 2, and 2 GPUs in sequence.
  • the control node sends a second request to the target computing node.
  • the target computing node is a computing node containing the target resource.
  • control node configures the target resource for the target model, it sends a second request to the computing node containing the target resource to trigger the target computing node to train the target model.
  • the target computing node uses the target resource to train a target model.
  • the target computing node stores the trained target model in the storage node.
  • step 102 continues to use the example in step 102 above, assuming that the target computing nodes that provide 16 GPUs are computing node 1, computing node 2, computing node 3, and computing node 4 , The four computing nodes are used as target computing nodes, and the target model is trained in a distributed manner. After the signaling is completed, the respective trained parts are stored in the storage node, such as in HDFS.
  • the control node after receiving the first request sent by the API server, the control node allocates target resources to the target model according to the first request, and sends a second request to the target computing node containing the target resource to Trigger the target computing node to perform model training, and store the trained model in the HDFS system.
  • software improvements generally include system architecture improvements and slurm open application programming interface (Application Programming Interface, API) improvements. The two improvements will be described in detail below.
  • FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application.
  • the HGCP system provided by the embodiment of the present application realizes complete isolation of users and resources.
  • the first user downloads and installs the client from the HDFS system, and sends the resource information required for training the target model to the API through the client, so that the API server integrates the resource information, etc. to obtain the first request, and submits it to the control node The first request for training the model.
  • the target task is running on the target node
  • the first user can send a query request to the target computing node through the first terminal device.
  • the query request is used to request to display the target resource on the target computing node when training the target model.
  • the usage status of the target resource after receiving the query request, the target computing node obtains the running status of the task of training the target model, and downloads the data generated during the permission process. After that, the target computing node sends to the first terminal device A query response, where the query response carries the usage status information of the target resource, so that the first terminal device displays the usage status of the target resource according to the usage status information.
  • the target model is trained, the target model is maintained on the HDFS system, and the first user or other users can download the final result from the HDFS system.
  • each model in FIG. 7 will be described in detail.
  • the first user can download anywhere and send resource information to the API server through the client according to the client stored on the HDFS system, so that the API server can integrate the resource information to obtain the first request. And send the first request to the control node, a first request can be regarded as a task.
  • the resource information carried in the first request includes at least one of the following information: the number of target computing nodes, the number of GPUs that are occupied when the target computing node is used to train the target model, and the target computing node is used to train the The number of CPUs occupied in the target model, the path of the HDFS system, and the user name or password of the HDFS.
  • the background corresponding to the client uses the slurm OPEN API described in the embodiments of this application to perform tasks such as submitting, viewing, terminating, and obtaining training data, and the job submission adopts an asynchronous submission mode.
  • tasks such as submitting, viewing, terminating, and obtaining training data
  • job submission adopts an asynchronous submission mode.
  • FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by the embodiment of the present application.
  • the first user submits the job to the upper layer through the client on the first terminal device, the API service (server) performs request authentication, and the job is stored in the database after the authentication is passed.
  • the job manager running on the control node obtains the submitted job from the database and submits the job to HGCP.
  • the job synchronization (Job SyncController) running on the computing node synchronizes the running status of the job to the monitor server (Monitor server). ) And slurm resource management system.
  • the HDFS system is a system used to temporarily store the user execution environment and store the final trained model, where the user execution environment is the aforementioned client.
  • the embodiment of the present application does not limit the HDFS system when necessary, and in other feasible implementation manners, it may also be a file system private to the first user.
  • the resource scheduler is a module on the control node, which is used to allocate target resources to the target model according to the first request.
  • the granularity of resource allocation is based on GPU instead of computing node. If a model training task of the first user cannot use up all the GPUs on the target computing node, the target computing node and its remaining GPUs will be allocated to other training tasks.
  • the scheduler can support mixed scheduling of CPU and GPU at the same time. For example, when the first user submits a training task whose required resource is a GPU, if all the GPUs are not used up, other users can also submit training tasks through the remaining GPUs.
  • resources are divided at the granularity of computing nodes and GPUs, and one training task can be run on different GPUs of different computing nodes.
  • FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application.
  • the architecture includes:
  • Third-party platforms refer to some deep learning platforms, such as paddle clound and other platforms;
  • Cluster component refers to the slurm cluster client
  • API server refers to the unified entrance of slurm OPEN API, responsible for route analysis and request processing, etc.;
  • Authentication refers to the slurm cluster authentication service module
  • Database refers to the XDB data platform, which stores data such as user permissions, job information, and queue quota (quota);
  • Job management used for job management control, responsible for job queuing and submission control;
  • job synchronization controller (job sync Controller) is responsible for synchronizing data such as job status, GPU utilization, GPU slot, node rank, and time;
  • Queue synchronization controller responsible for pushing queue update events to third-party platforms (new queue, queue Quota update, etc.);
  • Node Monitoring Service Deployed on each computing node, providing running data of training jobs on that computing node.
  • Open API interface authentication is mainly used for requesting identity authentication and judging the legitimacy of the current request. Common methods include token authentication and AK/SK authentication; for interface access security, this article uses AK/SK authentication method.
  • the control node receives a management request sent by a second user using a second terminal device, and the management request carries the access key identifier of the second user and the first key, and the first key It is generated by the second terminal device using a preset authentication mechanism.
  • the control node calls the cluster open application program interface Open API to authenticate the second user, the control node calls the cluster Open API to use the preset authentication mechanism Generate a second key.
  • the control node determines the management authority of the second user, and the control node issues the management authority to the second terminal according to the management authority.
  • the device sends a data stream for updating the graphical interface of the management platform, so that the second terminal device updates and displays the graphical interface of the management platform, so that the second user can manage the graphical interface of the management platform through the updated management platform graphical interface.
  • Cluster system
  • the access key ID (access key ID) is used to identify the second user
  • the first key is, for example, the secret access key (Secret Access Key, SK), which is used by the second user.
  • SK secret Access Key
  • the control node After receiving the management request sent by the second user, the control node uses the same preset authentication mechanism to generate an authentication string, which is referred to as the second key below. After that, the control node compares the first key in the management request with the generated second key, and if the two keys are the same, it specifies the management authority for the second user and performs related operations. If the two keys are the same If they are not the same, the control node will ignore the operation and return an error code to the second terminal device.
  • FIG. 10 is a schematic diagram of the authentication process in the model training method provided by the embodiment of the present application.
  • the second user sends AK/SK to the authentication service on the control node through the client on the second terminal device, and the authentication service returns a token to the second terminal device; after that, the second user
  • the client on the second terminal device sends a management request and token to the API service on the control node, and the API service sends a management response to the second terminal device according to the management request and token.
  • the second user is an administrator, which can be divided into multiple levels, such as cluster administrators, department administrators, ordinary users, etc., for example, see Table 2.
  • FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application.
  • api server is deployed on three servers including server 1, server2 and server3.
  • server1 deploys job_manager, job_sync_controller and 4 apiserver instances
  • server2 and server3 deploy 1 nginx instance
  • api server is bound to nginx
  • nginx is bound to BGW.
  • a super-management platform is set up for HGCP for machine management, cluster management, etc., mainly to provide the following main features for administrators and users:
  • the HGCP super-management platform system runs on a LINUX server, and uses MySQL database to store statistics, monitoring, configuration, logs and other data.
  • the back-end is integrated into a module with general functions, and Hypertext Preprocessor is used. PHP), Python, Ansible, Shell development, through the super management platform API interface to operate database data and computing nodes.
  • the front-end display page is for cluster administrators and ordinary users, simplifying operations as much as possible and improving efficiency;
  • the HGCP provided in the embodiments of this application is equipped with multiple control nodes to ensure the disaster tolerance and service continuity of the management system. These control nodes use Ansile to remotely manage the cluster computing nodes to perform environmental configuration, upgrade adjustments, and system inspections. And so on.
  • the control node receives a management request sent by a second user using a second terminal device, the management request is used to request management of the computing node in the cluster system, and the management request is the second terminal
  • the device is obtained according to the user's operation on the graphical interface of the management platform, and then the control node calls the cluster open application program interface Open API to authenticate the second user; then, if the second user passes the authentication, the control node
  • the management request manages computing nodes in the cluster system.
  • FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application.
  • the cluster Open API includes cluster management API and machine management API.
  • the screen of the second terminal device displays the management platform graphical interface of the super-management platform.
  • the cluster administrator performs cluster operations through the over-platform graphical interface and calls down the cluster.
  • Management API or machine management API when the cluster management API is called, the management request is used to request the creation or deletion of a cluster.
  • the cluster information in the database is configured, and the underlying cluster management module (cluster_manager) detects that there is a new operation task in the database and starts to perform related operations;
  • cluster_manager cluster management module
  • the management request is used to request any one of the at least one computing node to perform any one of the following operations: online, offline, restart, reinstall, repair, and shield.
  • configure For node information in the database, the node management module (node_manager) detects that there is a new operation task in the database and starts to perform related operations.
  • Operations for clusters include:
  • the administrator For cluster security, the administrator must first cluster all machines linearly before deleting the cluster. During the deletion process, first, verify the parameters, including whether the cluster exists, whether there are still running machines in the cluster, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and the task operation (task_op ) Is set to uninstall, the task status (task-status) is set to pending, and finally the cluster manager completes the real offline operation.
  • the basic information list of the cluster includes the cluster_info table and the cluster_task table.
  • the cluster_info table contains the information of the cluster that has been running online, and the cluster_info table contains the information of the cluster in the process. If the two tables represent the same cluster, and if there are offline operations, the status is based on the status in the cluster_task table.
  • the cluster_info table contains the online clusters, and the node_info table aggregates the required information.
  • the cluster machine list includes node_info table and node_task table.
  • the node_info table obtains the list of online machines
  • the node_info table obtains the list of online machines
  • the node_task table obtains the list of machines in the process.
  • going online is an operation, and the effect of this operation is to expand the capacity of the cluster system.
  • the parameters are verified, including verifying the existence of the cluster, and verifying the validity of the online parameters.
  • write the online task to the node operation task (node_task table), set the task operation (task_op) to install (install), set the task status (task-status) to pending (pending), and write the information about the node to be launched and the node
  • the node information (node_info) table is marked as installing, and finally the node manager (node manager) completes the actual online operation and completes the update of the task and info tables.
  • the machine when the machine is downloaded, the machine is automatically marked as unschedulable first, and then the offline process is executed.
  • the parameters are first verified, including verifying the existence of the cluster, and verifying the validity of the offline parameters. After that, query the node information (node_info) table.
  • the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the state shielding.
  • cluster_info cluster information
  • the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the machine attribution label change.
  • cluster_info cluster information
  • FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application.
  • the device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server.
  • the model training apparatus 100 may include:
  • the receiving unit 11 is configured to receive a first request sent by an application program interface API server, where the first request is resource information required by the API server according to the training target model sent by the first user through the client on the first terminal owned;
  • the processing unit 12 is configured to allocate target resources to the target model according to the resource information
  • the sending unit 13 is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the receiving unit 11 is further configured to receive a management request sent by a second terminal device, and the management request is used to request management of computing nodes in the cluster system;
  • the processing unit 12 is further configured to manage the computing nodes in the cluster system according to the management request.
  • the processing unit 12 when the processing unit 12 manages the computing nodes in the cluster system according to the management request, it calls the cluster open application program interface Open API to authenticate the second user. After the user passes the authentication, the computing nodes in the cluster system are managed according to the management request.
  • the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so
  • the processing unit 12 is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user Management authority;
  • the sending unit 13 is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
  • the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;
  • the cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
  • the device provided in the embodiment of the present application can be used in the method executed by the control node in the above embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
  • FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application.
  • the device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server.
  • the model training device 200 may include:
  • the receiving unit 21 is configured to receive a second request sent by the control node.
  • the second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model.
  • the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
  • the processing unit 22 is configured to use the target resource to train the target model
  • the sending unit 23 is used to send the trained target model to the storage node.
  • the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
  • the receiving unit 21 is further configured to receive a query request sent by the first terminal device, and the query request is used to request display of target resources on the target computing node to train the target model The usage status of the target resource at the time;
  • the sending unit 23 is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the information according to the usage status information. State the usage status of the target resource.
  • the device provided in the embodiment of the present application can be used in the method executed by the target computing node in the above embodiment, and its implementation principles and technical effects are similar, and will not be repeated here.
  • Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.
  • Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.
  • the electronic device includes: one or more processors 31, memory 32, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
  • the various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed.
  • the processor may process instructions executed in the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to an interface).
  • an external input/output device such as a display device coupled to an interface.
  • multiple processors and/or multiple buses can be used with multiple memories and multiple memories.
  • multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system).
  • One processor 31 is taken as an example in FIG. 15.
  • the memory 32 is a non-transitory computer-readable storage medium provided by this application.
  • the memory stores instructions executable by at least one processor, so that the at least one processor executes the model training method provided in this application.
  • the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make a computer execute the model training method provided in the present application.
  • the memory 32 can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method in the embodiment of the present application (for example, The receiving unit 11, the processing unit 12, and the sending unit 13 shown in FIG. 13, and the receiving unit 21, the processing unit 22, and the sending unit 23 shown in FIG. 14).
  • the processor 31 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 32, that is, implementing the method of model training in the foregoing method embodiment.
  • the memory 32 may include a program storage area and a data storage area.
  • the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created based on the use of an electronic device trained by the model.
  • the memory 32 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
  • the memory 32 may optionally include memories remotely provided with respect to the processor 31, and these remote memories may be connected to an electronic device for model training via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the electronic equipment of the model training method may further include: an input device 33 and an output device 34.
  • the processor 31, the memory 32, the input device 33, and the output device 34 may be connected by a bus or in other ways. In FIG. 15, the connection by a bus is taken as an example.
  • the input device 33 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic equipment for model training, such as touch screen, keypad, mouse, track pad, touch pad, indicator stick, a Or multiple mouse buttons, trackballs, joysticks and other input devices.
  • the output device 34 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like.
  • the display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
  • machine-readable medium and “computer-readable medium” refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memory, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described here can be implemented on a computer that has: a display device for displaying information to the user (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) ); and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user can provide input to the computer.
  • a display device for displaying information to the user
  • LCD liquid crystal display
  • keyboard and a pointing device for example, a mouse or a trackball
  • Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • the computer system can include clients and servers.
  • the client and server are generally far away from each other and usually interact through a communication network.
  • the relationship between the client and the server is generated by computer programs that run on the corresponding computers and have a client-server relationship with each other.
  • An embodiment of the present application also provides a cluster system, including: a control node and at least one computing node, wherein the control node establishes a network connection with each computing node of the at least one computing node based on the transmission control protocol TCP;
  • the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
  • the hardware capability of the cluster system is greatly improved, thereby improving the efficiency of model training; in terms of software, the slurm framework is optimized, and the client and super management are introduced. Platform, etc., make the cluster system more convenient to use.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Disclosed are a model training method and apparatus, and a clustering system, which relate to the technical field of artificial intelligence. According to the specific implementation solution, in the aspect of hardware, a control node and at least one compute node are interconnected by means of a network, and a GPU is introduced into the compute node to serve as a compute resource, such that the hardware capacity of a clustering system is greatly improved, and the model training efficiency is thus also improved. In the aspect of software, the clustering system is made to be more convenient to use by means of optimizing a Slurm framework and introducing a client, a super management platform, etc.

Description

模型训练方法、装置及集群系统Model training method, device and cluster system
本申请要求于2020年02月05日提交中国专利局、申请号为202010080825.4、申请名称为“模型训练方法、装置及集群系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 05, 2020, with the application number of 202010080825.4, and the application name of "Model Training Method, Device and Cluster System", the entire content of which is incorporated into this application by reference middle.
技术领域Technical field
本申请涉及人工智能(Artificial Intelligence,AI)技术领域,尤其涉及一种模型训练方法、装置及集群系统。This application relates to the technical field of Artificial Intelligence (AI), and in particular to a model training method, device and cluster system.
背景技术Background technique
随着人工智能的不断发展,对AI模型的训练需求也越来越大。AI模型训练过程中,当训练的数据集较小时,深度学习效果并不理想,甚至不如相对简单的机器学习方法。但是,当数据集增大后,利用深度学习训练出的AI模型的效果开始超过其他机器学习的训练效果。With the continuous development of artificial intelligence, there is an increasing demand for AI model training. In the process of AI model training, when the training data set is small, the effect of deep learning is not ideal, even inferior to relatively simple machine learning methods. However, when the data set increases, the effect of the AI model trained by deep learning begins to exceed the training effect of other machine learning.
常见的深度学习过程中,通过使用高性能计算集群(high performance computing,HPC)对大规模的数据集进行训练,以得到AI模型。HPC总体结构可分为以下几个主要部分:外部网络、主节点(master node)、计算节点(compute node)、存储(stroage)、计算网络(computation network)以及管理网络(management network)等。其中,计算节点的计算资源包括单核中央处理器(central processing unit,CPU)、多核CPU或多CPU等。In a common deep learning process, a large-scale data set is trained by using high performance computing (HPC) to obtain an AI model. The overall structure of HPC can be divided into the following main parts: external network, master node, compute node, storage, computation network, and management network. Among them, the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
上述的HPC中,单个计算节点的计算资源主要以CPU为主,硬件能力有限,导致上述的HPC使用深度学习训练AI模型的效率较低。In the above-mentioned HPC, the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited. As a result, the above-mentioned HPC uses deep learning to train AI models with low efficiency.
发明内容Summary of the invention
本申请实施例提供一种模型训练方法、装置及集群系统,通过使用具有GPU卡的计算节点来提高集群系统的硬件能力,从而提高模型训练的效率。The embodiments of the present application provide a model training method, device, and cluster system, which use computing nodes with GPU cards to improve the hardware capabilities of the cluster system, thereby improving the efficiency of model training.
第一方面,本申请实施例提供一种集群系统,包括:控制节点、至少一个计算节点、存储节点;其中,所述控制节点,与所述至少一个计算节点中的各计算节点建立连接,用于为训练目标模型的任务分配计算资源;所述计算节点包括至少一个中央处 理器CPU和至少一个图形处理器GPU,用于利用所述计算资源训练目标模型;所述存储节点与所述至少一个计算节点中的各计算节点建立网络连接,用于存储训练目标模型所需的数据。In the first aspect, an embodiment of the present application provides a cluster system, including: a control node, at least one computing node, and a storage node; wherein the control node establishes a connection with each of the at least one computing node, and For allocating computing resources for the task of training the target model; the computing node includes at least one central processing unit CPU and at least one graphics processing unit GPU, for using the computing resources to train the target model; the storage node and the at least one Each computing node in the computing node establishes a network connection for storing data required for training the target model.
一种可行的设计中,所述至少一个计算节点中的任意两个计算节点基于无限带宽Infiniband技术互联建立网络连接,所述计算节点内部的CPU与GPU通过高速外围组件互联PCIE连接,所述计算节点内部的GPU与GPU通过NV link连接。In a feasible design, any two computing nodes in the at least one computing node are interconnected to establish a network connection based on the unlimited bandwidth Infiniband technology, and the CPU and GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the computing The GPU and GPU inside the node are connected through NV link.
第二方面,本申请实施例提供一种模型训练方法,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:所述控制节点接收应用程序接口API服务器发送的第一请求,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述控制节点根据所述资源信息,为所述目标模型分配目标资源,所述控制节点向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。In the second aspect, an embodiment of the present application provides a model training method, which is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes: the control node receives the first data sent by the application program interface API server. Request, the first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the control node is the The target model allocates a target resource, and the control node sends a second request to the target computing node, so that the target computing node uses the target resource to train the target model.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,上述的方法还包括:所述控制节点接收第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点,所述控制节点根据所述管理请求管理所述集群系统中的计算节点。In a feasible design, the above method further includes: the control node receives a management request sent by the second terminal device, the management request is used to request management of the computing node in the cluster system, and the control node The management request manages the computing nodes in the cluster system.
一种可行的设计中,所述控制节点根据所述管理请求管理所述集群系统中的计算节点,包括:所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权;若所述第二用户通过鉴权,则所述控制节点根据所述管理请求管理所述集群系统中的计算节点。In a feasible design, the control node manages the computing nodes in the cluster system according to the management request, including: the control node calls the cluster open application program interface Open API to authenticate the second user; The second user passes the authentication, and the control node manages the computing nodes in the cluster system according to the management request.
一种可行的设计中,所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权,包括:所述控制节点调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则所述控制节点确定所述第二用户的管理权限,所述控制节点根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The control node calling the cluster open application program interface Open API to authenticate the second user includes: the control node calling the cluster Open API, using the preset authentication mechanism to generate a second key, if the first secret If the key is the same as the second key, the control node determines the management authority of the second user, and the control node sends authority information to the second terminal device according to the management authority, so that the first The second terminal device displays the authority corresponding to the second user according to the authority information.
一种可行的设计中,所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;或者,所述集群Open API包括机器管理API,所述管理请求用 于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。In a feasible design, the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
第三方面,本申请实施例提供一种模型训练方法,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:目标计算节点接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于所述至少一个计算节点,所述目标计算节点使用所述目标资源训练所述目标模型,所述目标计算节点将训练好的目标模型发送至存储节点。In a third aspect, an embodiment of the present application provides a model training method suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes: a target computing node receives a second request sent by the control node, and The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources for the target model. The first request is sent by the API server through the first terminal according to the first user. The target node is included in the at least one computing node, the target computing node uses the target resource to train the target model, and the target computing node Send the trained target model to the storage node.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,上述的方法还包括:所述目标计算节点接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况,所述目标计算节点向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。In a feasible design, the above method further includes: the target computing node receives a query request sent by the first terminal device, and the query request is used to request to display the target resource on the target computing node to train the When the target model is used for the target resource, the target computing node sends a query response to the first terminal device, and the query response carries information about the target resource usage status, so that the first terminal device is based on The usage status information displays the usage status of the target resource.
第四方面,本申请实施例提供一种模型训练装置,包括:In a fourth aspect, an embodiment of the present application provides a model training device, including:
接收单元,用于接收应用程序接口API服务器发送的第一请求,所述第一请求携带训练目标模型所需的资源信息,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;
处理单元,用于根据所述资源信息,为所述目标模型分配目标资源;;A processing unit, configured to allocate target resources to the target model according to the resource information;
发送单元,用于向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,所述接收单元,还用于接收第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点;In a feasible design, the receiving unit is further configured to receive a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;
所述处理单元,还用于根据所述管理请求管理所述集群系统中的计算节点。The processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
一种可行的设计中,所述处理单元,在根据所述管理请求管理所述集群系统中的 计算节点时,调用集群开放应用程序接口Open API对第二用户鉴权,若所述第二用户通过鉴权,则根据所述管理请求管理所述集群系统中的计算节点。In a feasible design, the processing unit, when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, if the second user Through the authentication, the computing nodes in the cluster system are managed according to the management request.
一种可行的设计中,所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述处理单元,用于调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则确定所述第二用户的管理权限;所述发送单元,还用于根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The processing unit is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user’s Management authority; the sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
一种可行的设计中,所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;或者,所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。In a feasible design, the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster; or, the cluster Open API includes a machine management API, and the management request is used to request access to all Any one of the at least one computing node performs any one of the following operations: going online, offline, restarting, reinstalling, repairing, and shielding.
第五方面,本申请实施例提供一种模型训练装置,包括:In a fifth aspect, an embodiment of the present application provides a model training device, including:
接收单元,用于接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于所述至少一个计算节点;The receiving unit is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. A request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
处理单元,用于使用所述目标资源训练所述目标模型;A processing unit, configured to use the target resource to train the target model;
发送单元,用于将训练好的目标模型发送至存储节点。The sending unit is used to send the trained target model to the storage node.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,所述接收单元,还用于接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;In a feasible design, the receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of target resources on the target computing node when training the target model The usage status of the target resource;
所述发送单元,还用于向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
第六方面,本申请实施例提供一种电子设备,包括:In a sixth aspect, an embodiment of the present application provides an electronic device, including:
至少一个处理器;以及At least one processor; and
与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行第二方面或第二方面任意可能实现的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the second aspect or any possible implementation of the second aspect method.
第七方面,本申请实施例提供一种电子设备,包括:In a seventh aspect, an embodiment of the present application provides an electronic device, including:
至少一个处理器;以及At least one processor; and
与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行第三方面或第三方面任意可能实现的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the third aspect or any possible implementation of the third aspect method.
第八方面,本申请实施例提供一种包含指令的计算机程序产品,当其在电子设备上运行时,使得电子设备计算机执行上述第二方面或第二方面的各种可能的实现方式中的方法。In an eighth aspect, the embodiments of the present application provide a computer program product containing instructions that, when run on an electronic device, cause the electronic device computer to execute the above-mentioned second aspect or the methods in the various possible implementations of the second aspect .
第九方面,本申请实施例提供一种包含指令的计算机程序产品,当其在电子设备上运行时,使得电子设备计算机执行上述第三方面或第三方面的各种可能的实现方式中的方法In a ninth aspect, embodiments of the present application provide a computer program product containing instructions, which when run on an electronic device, cause the electronic device computer to execute the foregoing third aspect or various possible implementation methods of the third aspect
第十方面,本申请实施例提供一种存储有计算机指令的非瞬时计算机可读存储介质,所述非瞬时计算机可读存储介质中存储有指令,当其在电子设备上运行时,使得电子设备执行如上述第二方面或第二方面的各种可能的实现方式中的方法。In a tenth aspect, the embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions. The non-transitory computer-readable storage medium stores instructions that, when running on an electronic device, cause the electronic device to Perform the methods in the foregoing second aspect or various possible implementation manners of the second aspect.
第十一方面,本申请实施例提供一种存储有计算机指令的非瞬时计算机可读存储介质,所述非瞬时计算机可读存储介质中存储有指令,当其在电子设备上运行时,使得电子设备执行如上述第三方面或第三方面的各种可能的实现方式中的方法。In an eleventh aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions. The non-transitory computer-readable storage medium stores instructions that, when run on an electronic device, cause the The device executes the method in the foregoing third aspect or various possible implementation manners of the third aspect.
第十二方面,本申请实施例提供一种集群系统,包括:控制节点和至少一个计算节点,其中,所述控制节点,与所述至少一个计算节点中的各计算节点基于传输控制协议TCP建立网络连接,所述计算节点的计算资源包括至少一个中央处理器CPU和至少一个图形处理器GPU。In a twelfth aspect, an embodiment of the present application provides a cluster system, including: a control node and at least one computing node, wherein the control node and each computing node of the at least one computing node are established based on the transmission control protocol TCP Network connection, the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
上述申请中的一个实施例具有如下优点或有益效果:通过将控制节点和至少一个计算节点通过网络互连,在计算节点中引入GPU作为计算资源,从而大幅度提升集群系统的硬件能力,进而提升模型训练的效率。另外,采用HDFS文件系统来临时存储用户执行环境,并存储最终运行结果的系统,可以避免用于训练模型的数据集存放在计算节点上占用过多存储空间的弊端,也可以避免训练好的模型放置在智能节点上安全的弊端。An embodiment in the above application has the following advantages or beneficial effects: by interconnecting the control node and at least one computing node through a network, the GPU is introduced as a computing resource in the computing node, thereby greatly improving the hardware capabilities of the cluster system, thereby increasing The efficiency of model training. In addition, the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
上述可选方式所具有的其他效果将在下文中结合具体实施例加以说明。Other effects of the above-mentioned optional manners will be described below in conjunction with specific embodiments.
附图说明Description of the drawings
附图用于更好地理解本方案,不构成对本申请的限定。其中:The drawings are used to better understand the solution, and do not constitute a limitation to the application. in:
图1是本申请实施例提供的集群系统的结构示意图;Figure 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application;
图2是本申请实施例提供的集群系统的底层框架的示意图;Figure 2 is a schematic diagram of the underlying framework of a cluster system provided by an embodiment of the present application;
图3是本申请实施例提供的集群系统的网络优化示意图;Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application;
图4是本申请实施例提供的集群系统的系统级性能约束分析示意图;4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application;
图5是本申请实施例提供的集群系统的计算节点的内存监控示意图;FIG. 5 is a schematic diagram of memory monitoring of computing nodes of a cluster system provided by an embodiment of the present application;
图6是本申请实施例提供的模型训练方法的流程图;Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application;
图7是本申请实施例提供的模型训练方法中HGCP的系统架构示意图;FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application;
图8是本申请实施例提供的模型训练方法中提交任务的过程示意图;FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by an embodiment of the present application;
图9是本申请实施例提供的模型训练方法中slurm OPEN API的示意图;FIG. 9 is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application;
图10是本申请实施例提供的模型训练方法中鉴权的过程示意图;FIG. 10 is a schematic diagram of an authentication process in the model training method provided by an embodiment of the present application;
图11是本申请实施例提供的模型训练方法中api server的部署示意图;FIG. 11 is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application;
图12是本申请实施例提供的模型训练方法中超管平台的工作示意图;FIG. 12 is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application; FIG.
图13为本申请实施例提供的模型训练装置的结构示意图;FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application;
图14为本申请实施例提供的模型训练装置的结构示意图;FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application;
图15是用来实现本申请实施例的模型训练方法的电子设备的框图。Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application.
具体实施方式Detailed ways
以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。The exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be regarded as merely exemplary. Therefore, those of ordinary skill in the art should realize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Likewise, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
在人工智能快速发展的今天,由CPU和GPU共同构成的异构计算平台正在发挥着越来越重要的作用。当前大数据时代,在训练的数据集较小时,深度学习的效果并不理想,这也是深度学习没有引起重视的原因之一。在较小数据集上训练的深度学习模型效果还不如一些相对简单的机器学习方法。但是,当数据集比较大时,深度学习的效果开始超过其他机器学习的效果,高性能集群(high performance computing,HPC)有能力使用更大的数据集来训练模型,使得HPC成为人工智能发展的一个重要部分。Today, with the rapid development of artificial intelligence, heterogeneous computing platforms composed of CPUs and GPUs are playing an increasingly important role. In the current era of big data, when the training data set is small, the effect of deep learning is not ideal, which is one of the reasons why deep learning has not attracted attention. Deep learning models trained on smaller data sets are not as effective as some relatively simple machine learning methods. However, when the data set is relatively large, the effects of deep learning begin to exceed the effects of other machine learning. High-performance computing (HPC) has the ability to use larger data sets to train models, making HPC the development of artificial intelligence. An important part.
常见的用于模型训练的HPC,总体结构可分为以下几个主要部分:外部网络、主节点(master node)、计算节点(compute node)、存储(stroage)、计算网络(computation network)以及管理网络(management network)等。其中,计算节点的计算资源包括单核中央处理器(central processing unit,CPU)、多核CPU或多CPU等。Common HPC used for model training, the overall structure can be divided into the following main parts: external network, master node (master node), computing node (compute node), storage (stroage), computing network (computation network) and management Network (management network), etc. Among them, the computing resources of the computing nodes include single-core central processing units (central processing unit, CPU), multi-core CPUs, or multi-CPUs.
上述的HPC中,单个计算节点的计算资源主要以CPU为主,硬件能力有限,导致上述的HPC使用深度学习训练AI模型的效率较低。In the above-mentioned HPC, the computing resources of a single computing node are mainly CPU-based, and the hardware capabilities are limited. As a result, the above-mentioned HPC uses deep learning to train AI models with low efficiency.
同时,高性能计算集群(high performance computing cluster)是计算机科学的一个分支,以解决复杂的可还行计算或数值计算为目的,是由多台节点(服务器)构成的一种松散耦合的计算节点集合。为用户提供高性能计算、网络请求响应或专业的应用程序(包括并行计算、数据库、web)等服务。但是,如何管理大规模计算集群的计算节点、训练任务如何调度,则是一个棘手的问题,虽然业界目前引入简单的Linux实用程序资源(简单的Linux实用程序资源管理,slurm)管理集群系统,但是一般仅仅是针对slurm调度插件的使用方法做了优化,并未跳出slurm框架,即并未对slurm框架进行升级优化。At the same time, high-performance computing cluster is a branch of computer science, which aims to solve complex reciprocal calculations or numerical calculations. It is a loosely coupled computing node composed of multiple nodes (servers). gather. Provide users with high-performance computing, network request response or professional applications (including parallel computing, database, web) and other services. However, how to manage the computing nodes of a large-scale computing cluster and how to schedule training tasks is a thorny issue. Although the industry currently introduces simple Linux utility resources (simple Linux utility resource management, slurm) to manage the cluster system, Generally, it is only optimized for the use of the slurm scheduling plug-in, and has not jumped out of the slurm framework, that is, the slurm framework has not been upgraded and optimized.
有鉴于此,本申请实施例提供一种模型训练方法、装置及集群系统,在硬件方面,通过引入GPU作为计算资源,大幅度提升集群系统的硬件能力,进而提升模型训练的效率;在软件方面,通过对slurm框架进行优化,引入客户端、超级管理平台等,使得集群系统用起来更方便。下面,分别从硬件能力改进和软件能力改进两方面对本申请实施例进行详细说明。In view of this, the embodiments of the present application provide a model training method, device, and cluster system. In terms of hardware, by introducing GPU as a computing resource, the hardware capabilities of the cluster system are greatly improved, thereby improving the efficiency of model training; in terms of software , By optimizing the slurm framework, introducing clients, super management platforms, etc., making the cluster system more convenient to use. Hereinafter, the embodiments of the present application will be described in detail from two aspects of hardware capability improvement and software capability improvement.
首先,硬件能力改进。First, the hardware capabilities are improved.
图1是本申请实施例提供的集群系统的结构示意图。请参照图1,本申请实施例提供的集群系统包括:控制节点、至少一个计算节点、存储节点;其中,所述控制节点,与所述至少一个计算节点中的各计算节点建立连接,如基于传输控制协议(Transmission Control Protocol,TCP)的网络连接等;所述计算节点的计算资源包括至少一个中央处理器(central processing unit,CPU)和至少一个图形处理器(Graphics Processing Unit,GPU);所述存储节点与所述至少一个计算节点中的各计算节点建立网络连接,用于存储训练目标模型所需的数据,该存储节点例如为分布式文件系统(Hadoop Distributed File System,HDFS)等,训练目标模型所需的数据包括客户端、样本数据集等,另外,计算节点训练好目标模型后,该目标模型也被存储在存储节点,所述客户端用于向API服务器提交资源信息等,使得API服务器对资源信息进行整合等,得到第一请求并向所述控制节点提交,图中未示意出API服务器,实际实现时, API服务器与控制节点可以集成设置,也可以独立设置。研发人员可通过第一终端设备登录集群系统,提交用于请求模型训练的第一请求等,管理员可通过第二终端设备登录集群系统,进行创建集群、删除集群、上线机器、下线机器、屏蔽机器等操作,其中,机器即为计算节点。Fig. 1 is a schematic structural diagram of a cluster system provided by an embodiment of the present application. 1, the cluster system provided by the embodiment of the present application includes: a control node, at least one computing node, and a storage node; wherein, the control node establishes a connection with each of the at least one computing node, such as based on Transmission Control Protocol (TCP) network connection, etc.; the computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU); The storage node establishes a network connection with each computing node in the at least one computing node to store data required for training the target model. The storage node is, for example, a distributed file system (Hadoop Distributed File System, HDFS). The data required by the target model includes the client, sample data set, etc. In addition, after the target model is trained by the computing node, the target model is also stored in the storage node. The client is used to submit resource information, etc. to the API server, so that The API server integrates resource information, etc., obtains the first request and submits it to the control node. The API server is not shown in the figure. In actual implementation, the API server and the control node can be integrated or independently set. R&D personnel can log in to the cluster system through the first terminal device, submit the first request for requesting model training, etc., and the administrator can log in to the cluster system through the second terminal device to create clusters, delete clusters, online machines, offline machines, Shield operations such as machines, where machines are computing nodes.
需要说明的是,第一终端设备和第二终端设备可以是相同设备,也可以是不同的终端设备,本申请实施例并不限制。It should be noted that the first terminal device and the second terminal device may be the same device or different terminal devices, which is not limited in the embodiment of the present application.
图1中,每个计算节点的计算资源包括CPU和GPU,计算节点例如为用于AI模型训练的一体机,具有3个CPU和8个GPU,其中,CPU和GPU的可灵活设置。另外,计算节点包含的计算资源也可以是显存可编程门阵列(Field-Programmable Gate Array,FPGA)等,本申请实施例并不限制。In Figure 1, the computing resources of each computing node include CPU and GPU. The computing node is, for example, an all-in-one machine for AI model training, with 3 CPUs and 8 GPUs, where the CPU and GPU can be flexibly set. In addition, the computing resources included in the computing node may also be a Field-Programmable Gate Array (FPGA), etc., which is not limited in the embodiment of the present application.
HDFS文件系统是用来临时存储用户执行环境,并存储最终运行结果的系统,可以避免用于训练模型的数据集存放在计算节点上占用过多存储空间的弊端,也可以避免训练好的模型放置在计算节点上不安全的弊端。The HDFS file system is a system used to temporarily store the user's execution environment and store the final running results. It can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and it can also avoid the placement of the trained model Disadvantages of insecurity on computing nodes.
需要说明的是,本申请实施例中的控制节点的数量并不限定为一个,例如,为了避免控制节点出现故障后导致整个集群系统宕机,本申请实施例可以设置一个主控制节点和一个备控制节点,当主控制器节点出现故障时,可以启动备控制节点。It should be noted that the number of control nodes in the embodiment of the present application is not limited to one. For example, in order to avoid the downtime of the entire cluster system after the control node fails, the embodiment of the present application may set one master control node and one backup. The control node, when the main controller node fails, the standby control node can be started.
本申请实施例提供的集群系统,通过将控制节点和至少一个计算节点通过网络互连,在计算节点中引入GPU作为计算资源,从而大幅度提升集群系统的硬件能力,进而提升模型训练的效率。另外,采用HDFS文件系统来临时存储用户执行环境,并存储最终运行结果的系统,可以避免用于训练模型的数据集存放在计算节点上占用过多存储空间的弊端,也可以避免训练好的模型放置在智能节点上安全的弊端。In the cluster system provided by the embodiments of the present application, the control node and at least one computing node are interconnected through a network, and GPUs are introduced as computing resources in the computing nodes, thereby greatly improving the hardware capabilities of the cluster system and thereby improving the efficiency of model training. In addition, the use of the HDFS file system to temporarily store the user's execution environment and the system that stores the final operating results can avoid the disadvantage of storing the data set used to train the model on the computing node and occupying too much storage space, and can also avoid the trained model Disadvantages of security placed on smart nodes.
以下为清楚起见,将现有的集群系统称之为高性能集群(high performance computing,HPC)系统,将本申请实施例提供的集群系统称之为高性能GPU平台(High Performance GPU Cluster Platform,HGCP)。For the sake of clarity, the existing cluster system is referred to as a high-performance computing (HPC) system, and the cluster system provided in the embodiments of the present application is referred to as a high-performance GPU cluster (High Performance GPU Cluster Platform, HGCP). ).
下面,从底层框架、任务调度、网络优化、性能剖析工具、计算节点、集群实时监控和集群运维管理对如何进行硬件改进进行详细说明。Below, how to improve hardware is explained in detail from the underlying framework, task scheduling, network optimization, performance analysis tools, computing nodes, cluster real-time monitoring, and cluster operation and maintenance management.
A、底层框架。A. The underlying framework.
图2是本申请实施例提供的集群系统的底层框架的示意图。请参照图2,本申请实施例提供的集群系统,由下至上包括芯片、系统设计、性能优化、集群、框架和应用六层。其中,芯片层包括各种计算资源,如CPU、GPU、FPGA、集成电路(Application Specific Integrated Circuit,ASIC)以及其他AI芯片。系统设计层包括云端和边缘AI 一体机、高性能存储池、高速互联架构等。性能优化层包括计算优化、输入输出(inpit output,IO)或通信优化等。集群层包括K8S(Kubernetes)云原生、智能调度、自动扩缩容等。框架层包括一些深度学习框架,如飞桨(Paddle Paddle)、TF、Torch等。应用层包括视频、图像、自然语言理解、搜索、推荐或广告等。Figure 2 is a schematic diagram of the underlying framework of the cluster system provided by an embodiment of the present application. Please refer to FIG. 2, the cluster system provided by the embodiment of the present application includes six layers of chip, system design, performance optimization, cluster, framework, and application from bottom to top. Among them, the chip layer includes various computing resources, such as CPU, GPU, FPGA, integrated circuit (Application Specific Integrated Circuit, ASIC), and other AI chips. The system design layer includes cloud and edge AI all-in-one machines, high-performance storage pools, high-speed interconnection architecture, etc. The performance optimization layer includes calculation optimization, inpit output (IO), or communication optimization. The cluster layer includes K8S (Kubernetes) cloud native, intelligent scheduling, automatic expansion and contraction, etc. The framework layer includes some deep learning frameworks, such as Paddle Paddle, TF, Torch, etc. The application layer includes video, image, natural language understanding, search, recommendation or advertisement, etc.
请参照图2,本申请提供的集群系统基于slurm开源Linux集群资源管理系统,具有良好的扩展性和高度容错性。本申请实施例提供的HGCP,除了具备slurm固有的功能外,还拥有完整的训练任务生命流程管理、机器管理、故障监控能力,自动化程度极高。其中,slurm固有的功能包括资源管理功能和丰富的作业调度功能,如简单的先进先出(First Input First Output,FIFO)、作业优先级计算、资源抢占等功能,能够对多种不同实现的多点接口(Multi Point Interface,MPI)提供良好的支持。另外,本申请实施例提供的集群系统还支持对GPU、网络带宽甚至内存等通用计算资源的分配。Please refer to Figure 2. The cluster system provided by this application is based on the slurm open source Linux cluster resource management system, which has good scalability and high fault tolerance. In addition to the inherent functions of slurm, the HGCP provided by the embodiments of the present application also has complete training task life process management, machine management, and fault monitoring capabilities, with a very high degree of automation. Among them, the inherent functions of slurm include resource management functions and rich job scheduling functions, such as simple first-in-first-out (FIFO), job priority calculation, resource preemption and other functions. Multi Point Interface (MPI) provides good support. In addition, the cluster system provided by the embodiment of the present application also supports the allocation of general computing resources such as GPU, network bandwidth and even memory.
B、任务调度。B. Task scheduling.
现有的HPC只是使用了slurm提供的几个基础的调度逻辑,如FIFO等。本申请实施例中,为了打通AI训练任务在集群系统中的高速流转,HGCP在上层新建了一个高效的任务调度系统,充分考虑各业务拥有高优资源数量及集群中实际运行和待运行的训练任务,将所有资源池化,针对每个业务设置高优逻辑限额(quota),并规定单计算节点任务和多计算节点任务的GPU使用配比,减少资源碎片的影响,有效的降低集群资源闲置,提高GPU集群资源使用效率,降低运营成本。The existing HPC just uses several basic scheduling logic provided by slurm, such as FIFO. In the embodiments of this application, in order to open up the high-speed circulation of AI training tasks in the cluster system, HGCP has built an efficient task scheduling system in the upper layer, taking full account of the number of high-quality resources in each business and the actual running and pending training in the cluster Tasks, pool all resources, set high-quality logic quotas (quota) for each business, and specify the GPU usage ratio for single-computing node tasks and multi-computing node tasks, reducing the impact of resource fragmentation and effectively reducing cluster resource idleness , Improve the efficiency of GPU cluster resource usage and reduce operating costs.
C、网络优化。C. Network optimization.
一般而言,网络通信是深度学习训练的一大瓶颈,深度学习类计算任务,具有计算量大、中间结果多等特点,这要求集群系统具有高效传递的消息传递机制和海量的数据存储访问能力,而这两者的效率很大程度上取决于网络速度。现有技术中的大部分基于slurm的HPC都采用多点接口(Multi Point Interface,MPI)传递消息和并行处理,而使用MPI传递消息和并行处理有两个问题:消息传递慢、系统CPU占用高,同时,计算节点本身的网络硬件也限制了通信的能力。为解决这些问题,本申请实施例提供的HGCP对网络进行优化。优化过程中,所述至少一个计算节点中的任意两个计算节点基于无限带宽Infiniband技术互联建立网络连接,所述计算节点内部的所述CPU与所述GPU通过高速外围组件互联PCIE连接,所述计算节点内部的GPU与GPU通过NV link连接。示例性的,可参见图3。Generally speaking, network communication is a major bottleneck for deep learning training. Deep learning computing tasks have the characteristics of large calculation volume and many intermediate results. This requires the cluster system to have an efficient message transfer mechanism and massive data storage access capabilities. , And the efficiency of both depends largely on the network speed. Most HPCs based on slurm in the prior art use MultiPoint Interface (MPI) to transmit messages and parallel processing, while using MPI to transmit messages and parallel processing has two problems: slow message transmission and high system CPU usage At the same time, the network hardware of the computing node itself also limits the communication capabilities. To solve these problems, the HGCP provided in the embodiments of the present application optimizes the network. During the optimization process, any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and the GPU inside the computing node are connected through the high-speed peripheral component interconnection PCIE, and the The GPU and GPU inside the computing node are connected through NV link. For example, see Figure 3.
图3是本申请实施例提供的集群系统的网络优化示意图。请参照图3,图中示意 出两个计算节点,分别是第一计算节点和第二计算节点,每个计算节点包括CPU节点(node)和GPU盒(BOX),CPU node中包含CPU1和CPU2,GPU BOX中包含三个非易失性存储器快速(Non-Volatile Memory Express,NVMe),简称硬盘,另外,GPU BOX中还包括8个GPU,如图中的0~8,以及网络接口控制器(network interface controller,NIC)、PCIE SW等。图中实线箭头所示为PCIE连接,虚线箭头所示为NVlink连接。需要说明的是,虽然图中为清楚起见,第一计算节点中,GPU部分仅示意出PCIE连接,第二计算节点GPU部分仅示意出NVlink连接,但是实际中,每个计算节点的GPU部分即包括PCIE连接和NVlink连接。Fig. 3 is a schematic diagram of network optimization of a cluster system provided by an embodiment of the present application. Please refer to Figure 3, which shows two computing nodes, the first computing node and the second computing node. Each computing node includes a CPU node (node) and a GPU box (BOX). The CPU node includes CPU1 and CPU2. , GPU BOX contains three non-volatile memory Express (Non-Volatile Memory Express, NVMe), referred to as hard disk, in addition, GPU BOX also includes 8 GPUs, as shown in the figure 0-8, and network interface controller (network interface controller, NIC), PCIE SW, etc. The solid arrow in the figure shows the PCIE connection, and the dashed arrow shows the NVlink connection. It should be noted that although in the figure for clarity, the GPU part of the first computing node only indicates the PCIE connection, and the GPU part of the second computing node only indicates the NVlink connection, but in reality, the GPU part of each computing node is Including PCIE connection and NVlink connection.
本申请实施例提供的集群系统使用基于全双工、交换式串行传输的新型I/O总线技术无限带宽(Infiniband,IB),取代了现有集群系统中常用的MPI通信方式,简化并提高了计算节点间的连接速度。同时,一个计算节点内部的CPU与GPU使用PCIE连线,GPU和GPU间采用NVlink高速互联,极大的提高了计算节点内部GPU卡间的通信。同时,PCIE、NVlink、以太网(Ethernet)/远程数据存取(Remote Direct Memory Access,RDMA)网络带宽、延时差异大,需要分配最优资源组合,本申请实施例提供的HGCP采用拓扑感知调度,优化通信带宽。The cluster system provided by the embodiments of this application uses a new I/O bus technology infinite bandwidth (Infiniband, IB) based on full-duplex, switched serial transmission, which replaces the MPI communication method commonly used in existing cluster systems, simplifies and improves Calculate the connection speed between nodes. At the same time, the CPU and GPU in a computing node are connected by PCIE, and the GPU and GPU are interconnected by NVlink, which greatly improves the communication between the GPU cards in the computing node. At the same time, PCIE, NVlink, Ethernet/Remote Direct Memory Access (RDMA) network bandwidths and delays vary widely, and the optimal resource combination needs to be allocated. The HGCP provided in the embodiments of this application adopts topology-aware scheduling , Optimize communication bandwidth.
D、性能剖析工具。D. Performance analysis tools.
通常情况下,集群利用率是核心的考核指标,提高利用率相当于降低及其使用陈本,同时帮助业务训练程序进行资料搜集(profiling)和针对性能优化取得良好效果。但是现有的HPC无系统级的细粒度的性能分析工具,为实现性能分析,通常使用的方式是与业务协商好之后在单个节点上进行性能分析,从启动到采集到数据分析都需要人为介入,并需要和业务方面启动训练做好协调,只能具体问题具体分析(case by case),效率低不适合大面积推广。Under normal circumstances, cluster utilization is the core evaluation indicator. Increasing utilization is equivalent to reducing the cost of its use. At the same time, it helps business training programs to perform data profiling and achieve good results for performance optimization. However, the existing HPC has no system-level fine-grained performance analysis tools. In order to achieve performance analysis, the usual method is to perform performance analysis on a single node after consulting with the business. Human intervention is required from startup to collection to data analysis. , And need to coordinate with the business to start training, only specific problems can be analyzed (case by case), and the efficiency is low and it is not suitable for large-scale promotion.
本申请实施例提供的HGCP,采用深度学习系统性能分析器(Deep learning system Performance profiler,Dperf)对HGCP进行性能分析,Dperf是面向深度学习训练常见的系统级的一站式性能剖析与瓶颈定位系统。该工具通过将NET、IO、H2D、P2P等数据通路上的关键计算节点的流量信息与CPU、双倍速率(Double Data Rate,DDR)、图形用双倍数据传输率存储器(Graphics Double Data Rate,GDDR)等关键计算资源的利用率信息统一抓取并同轴显示,方便业务定位程序瓶颈并进行针对性的优化。同时,把Dperf训练工具与集群任务调度结合起来,对GPU训练集群的任务进行普查性质的自动监控。一方面,可以帮助集群管理者来了解各个业务的使用情况和瓶颈所在,提高集群整体利用率。另一方面,帮助开发者监控资源利用率,指导参数调整,增强扩 展能力,同时帮助定位硬件约束,调优硬件配置,本申请实施例提供的Dperf具有低开销、多维度、易扩展、细粒度和可视化等优势。示例性的,可参照图4。The HGCP provided in the embodiments of this application uses a deep learning system performance profiler (Dperf) to perform performance analysis on HGCP. Dperf is a common system-level one-stop performance analysis and bottleneck positioning system for deep learning training. . This tool combines the flow information of key computing nodes on NET, IO, H2D, P2P and other data paths with the CPU, Double Data Rate (DDR), and Graphics Double Data Rate memory (Graphics Double Data Rate, The utilization information of key computing resources such as GDDR) is uniformly captured and displayed on the same axis, which is convenient for business positioning program bottlenecks and targeted optimization. At the same time, the Dperf training tool is combined with the cluster task scheduling to automatically monitor the tasks of the GPU training cluster. On the one hand, it can help cluster managers to understand the usage and bottlenecks of each business, and improve the overall utilization of the cluster. On the other hand, to help developers monitor resource utilization, guide parameter adjustments, enhance scalability, and at the same time help locate hardware constraints and optimize hardware configuration, the Dperf provided in the embodiments of the present application has low overhead, multi-dimensionality, easy scalability, and fine-grained And visualization and other advantages. For example, refer to FIG. 4.
图4是本申请实施例提供的集群系统的系统级性能约束分析示意图。请参照图4,深度学习训练整个流程涉及环境准备、数据读取、数据预处理、前向训练、后向训练、参数更新,数据存储受CPU、主存和硬盘IO约束,而训练过程受上下行链路,显存等因素影响。通过Dperf系统级性能分析工具,分析程序受哪方面硬件影响。例如,若数据读取和预处理时间长,同时系统可用CPU、磁盘资源较多,则可以多开数据处理进程提高数据处理速度。若训练程序等待训练数据时间较长,则可以将数据处理和训练异步执行,减少等待时间。Fig. 4 is a schematic diagram of system-level performance constraint analysis of a cluster system provided by an embodiment of the present application. Please refer to Figure 4. The entire process of deep learning training involves environment preparation, data reading, data preprocessing, forward training, backward training, and parameter update. Data storage is constrained by CPU, main memory and hard disk IO, while the training process is subject to upper and lower limits. The influence of factors such as line link, video memory and so on. Through the Dperf system-level performance analysis tool, analyze which hardware affects the program. For example, if the data reading and preprocessing time is long, and the system has more available CPU and disk resources, you can open more data processing processes to increase the data processing speed. If the training program waits for the training data for a long time, the data processing and training can be executed asynchronously to reduce the waiting time.
E、计算节点。E. Computing nodes.
目前的HPC的计算节点受限于GPU卡数、通信、功耗、散热等问题的影响,算力密度低,无法承受模型训练任务的需求。而本申请实施例提供的HGCP,利用具备GPU的计算节点,拥有高计算密度、高散热效率,支持硬件模块系统化、互联接口标准化、互联拓扑弹性化,引领了AI计算的硬件发展方向,参与并主导AI硬件平台的发展,有效支撑集群的AI训练任务。The current HPC computing nodes are limited by the number of GPU cards, communication, power consumption, heat dissipation and other issues, and the computing power density is low, and they cannot withstand the needs of model training tasks. The HGCP provided in the embodiments of the present application utilizes computing nodes with GPUs, has high computing density, high heat dissipation efficiency, supports the systemization of hardware modules, the standardization of interconnection interfaces, and the flexibility of interconnection topology, leading the hardware development direction of AI computing. Participate It also leads the development of AI hardware platforms and effectively supports cluster AI training tasks.
F、集群实时监控。F. Real-time monitoring of clusters.
目前的HPC吴实时细粒度监控,缺失每个计算节点、计算任务的细粒度监控,做到CPU、DDR、GPU和GDDR等关键资源的利用率信息同意抓取并同轴显示,用户和管理只能登陆物理节点查看机器运行状态,或者被动由业务告知故障信息,极大的影响了集群运行效率。而本申请实施例提供的HGCP,为了监控和分析集群系统的运行情况,同时为系统调度收集参数,在HGCP集群部署了监控平台以及硬件监控插件(Hadoop Authentication Service,HAS)等,通过实时监控采集HGCP集群的控制节点、计算节点等功能组件的CPU、GPU、内存、网络和存储等关键性能数据,然后以图形化的方式直观显示,可以了解硬件环境的运行状况,及时发现HGCP中可能隐含的故障问题,进而在第一时间对故障给出解决方案。示例性的,可参见图5,图5是本申请实施例提供的集群系统的计算节点的内存监控示意图。请参照图5,从14:40开始至15:40,一个计算节点的内存的占用如图中波形所示。The current HPC Wu real-time fine-grained monitoring, lack of fine-grained monitoring of each computing node, computing task, so that the utilization information of key resources such as CPU, DDR, GPU, and GDDR agree to capture and display coaxially, users and management only Being able to log in to the physical node to view the operating status of the machine, or passively inform the fault information from the business, greatly affects the efficiency of the cluster operation. In the HGCP provided by the embodiments of this application, in order to monitor and analyze the operation of the cluster system, and at the same time collect parameters for system scheduling, a monitoring platform and a hardware monitoring plug-in (Hadoop Authentication Service, HAS) are deployed in the HGCP cluster to monitor and collect data in real time. The key performance data such as CPU, GPU, memory, network and storage of functional components such as control nodes and computing nodes of the HGCP cluster can be visually displayed in a graphical manner to understand the operating status of the hardware environment and discover in time that may be hidden in HGCP The problem of failure, and then provide solutions to the failure in the first time. Exemplarily, refer to FIG. 5, which is a schematic diagram of memory monitoring of a computing node of a cluster system provided in an embodiment of the present application. Please refer to Figure 5. From 14:40 to 15:40, the memory occupation of a computing node is shown in the waveform in the figure.
G、集群运维管理。G. Cluster operation and maintenance management.
目前,随着HPC的集群规模的持续扩大,计算节点的不断扩充,计算节点的标准运行环境部署将会变化成一个常态化、费时费力的工作,目前的HPC没有给出一个高效的、标准的运维解决方案,故障发现、定位、报修和恢复上线均需要人工接入, 效率低下,浪费精力。同时,故障计算节点可视为被闲置的计算节点,闲置等于浪费。表1列出了一般运维操作种类及操作时间。At present, with the continuous expansion of the HPC cluster scale and the continuous expansion of computing nodes, the standard operating environment deployment of computing nodes will change into a normalized, time-consuming and laborious task. The current HPC does not provide an efficient and standard Operation and maintenance solutions, fault discovery, location, repair, and restoration all require manual access, which is inefficient and wastes energy. At the same time, a failed computing node can be regarded as an idle computing node, and being idle is equal to waste. Table 1 lists the types and operating hours of general operation and maintenance operations.
表1Table 1
操作operate 操作方式Operation method 平均操作时间Average operating time
新机器环境配置New machine environment configuration 自动化脚本Automation script 20min20min
机器上线集群Machine online cluster 手动操作Manual operation 30min30min
集群资源队列调整Cluster resource queue adjustment 手动操作Manual operation 10min10min
机器故障维修Machine breakdown repair 维修人员接入Maintenance personnel access 1day1day
故障信息统计Fault information statistics 手动操作Manual operation 1hour1hour
集群环境升级Cluster environment upgrade 手动操作Manual operation 1day1day
本申请实施例提供的HGCP,建设之初先顺利运维流程,需要实现流程化、流程标准化、标准自动化。同时运维自动化解决不了所有问题,不能为了自动化而自动化,20%的重复工作消耗80%的时间、精力,只需要集中精力把20%的重复性工作做好,基本就能达到很好的状态。集群自动化运维工具旨在管理大量的计算节点,同时具有单一的图形用户界面。本申请实施例提供的HGCP集群通过超管平台系统进行机器管理。The HGCP provided in the embodiments of this application has a smooth operation and maintenance process at the beginning of its construction, and needs to realize processization, process standardization, and standard automation. At the same time, operation and maintenance automation cannot solve all problems. It cannot be automated for the sake of automation. 20% of repetitive tasks consume 80% of time and energy. You only need to concentrate on doing 20% of repetitive tasks, and you can basically achieve a good state. . The cluster automated operation and maintenance tool is designed to manage a large number of computing nodes and has a single graphical user interface. The HGCP cluster provided by the embodiment of the present application performs machine management through a super-management platform system.
其次,软件能力改进。Secondly, software capabilities are improved.
图6是本申请实施例提供的模型训练方法的流程图,本实施例是从控制节点和计算节点交互的角度,对本申请实施例所述的模型训练方法进行详细说明的,本实施例包括:Fig. 6 is a flowchart of a model training method provided by an embodiment of the present application. This embodiment explains in detail the model training method described in the embodiment of the present application from the perspective of the interaction between the control node and the computing node. The present embodiment includes:
100、第一终端上的客户端向API服务器发送训练目标模型所需的资源信息。100. The client on the first terminal sends resource information required for training the target model to the API server.
101、控制节点接收应用程序接口API服务器发送的第一请求。101. The control node receives the first request sent by the application program interface API server.
其中,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的。Wherein, the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal.
示例性的,目前的HPC没有用户客户端,第一用户使用HCP训练模型过程繁琐,这是因为训练脚本的配置、训练数据的访问、训练结果的获取都需要直接到HCP中获取,HPC并未对自身的功能做很好的封装,极大了浪费了第一用户的时间。其中,第一用户通常指研发人员等训练模型的研究人员,模型可以是各种AI模型,如人脸识别模型、人脸检测模型等,本申请实施例并不限制。而本申请实施例提供的模型训练方法,预先对HGCP进行封装,得到一个客户端,该客户端存储在HDFS上,供第一用户下载。第一用户在第一终端设备上下载并安装客户端,该客户端用于向HGCP提交 训练任务。Exemplarily, the current HPC does not have a user client, and the first user uses the HCP training model to be cumbersome. This is because the configuration of the training script, the access to the training data, and the acquisition of the training results need to be obtained directly from the HCP, and HPC does not Good packaging of its own functions greatly wastes the time of the first user. Among them, the first user usually refers to a researcher who trains a model such as a researcher, and the model may be various AI models, such as a face recognition model, a face detection model, etc., which are not limited in the embodiment of the present application. The model training method provided by the embodiment of the present application encapsulates the HGCP in advance to obtain a client, which is stored on HDFS for download by the first user. The first user downloads and installs the client on the first terminal device, and the client is used to submit training tasks to HGCP.
102、所述控制节点根据所述资源信息,为所述目标模型分配目标资源。102. The control node allocates target resources to the target model according to the resource information.
目前的HPC训练任务管理粗放,其虽然能够面对多租户,即能够被多个第一用户同时使用,不同的第一用户训练不同的目标模型,但是不同的第一用户的使用需求却又存在波峰波谷,现有的基于slurm的HPC大多数默认使用FIFO排队机制,没有优先级限制,不支持超发,使得某些第一用户资源闲置,而其他的第一用户无资源可用。而本申请实施例中,HGCP的计算资源包括CPU、GPU、内存、FPGA等,第一终端设备的显示界面上显示配置界面,供第一用户配置训练目标模型所需计算节点的个数,对于每个计算节点,需要占用该计算节点的哪些CPU、GPU等,第一终端设备根据用户输入的配置,生成训练目标模型所需的资源信息等,并将该资源信息等发送给API服务器,API服务器对该资源信息等进行整合生成第一请求并发送给控制节点。控制节点在接收到API服务器发送的第一请求后,根据该第一请求为目标模型分配计算资源。例如,第一请求携带的资源信息为4个计算节点和16个GPU,则控制节点为该目标模型分配4个计算节点,假设每个计算节点上有8个GPU,则该4个计算节点分别为该目标模型提供4个GPU,或者,该4个计算节点依次提供4个、6个、2个、2个GPU。The current HPC training task management is extensive. Although it can face multi-tenancy, that is, it can be used by multiple first users at the same time. Different first users train different target models, but different first users have usage requirements. At peaks and troughs, most of the existing slurm-based HPCs use the FIFO queuing mechanism by default, there is no priority limit, and no over-transmission is supported, which makes some first user resources idle, while other first users have no resources available. In the embodiment of this application, the computing resources of HGCP include CPU, GPU, memory, FPGA, etc. The configuration interface is displayed on the display interface of the first terminal device for the first user to configure the number of computing nodes required for training the target model. For each computing node, which CPU, GPU, etc. of the computing node need to be occupied, the first terminal device generates the resource information required for training the target model according to the configuration input by the user, and sends the resource information to the API server, API The server integrates the resource information, etc., generates a first request and sends it to the control node. After receiving the first request sent by the API server, the control node allocates computing resources for the target model according to the first request. For example, the resource information carried in the first request is 4 computing nodes and 16 GPUs, and the control node allocates 4 computing nodes to the target model. Assuming that there are 8 GPUs on each computing node, the 4 computing nodes are respectively Provide 4 GPUs for the target model, or the 4 computing nodes provide 4, 6, 2, and 2 GPUs in sequence.
103、所述控制节点向目标计算节点发送第二请求。103. The control node sends a second request to the target computing node.
其中,所述目标计算节点是包含所述目标资源的计算节点。Wherein, the target computing node is a computing node containing the target resource.
示例性的,控制节点为目标模型配置好目标资源后,向包含目标资源的计算节点发送第二请求,以触发目标计算节点训练目标模型。Exemplarily, after the control node configures the target resource for the target model, it sends a second request to the computing node containing the target resource to trigger the target computing node to train the target model.
104、所述目标计算节点使用所述目标资源训练目标模型。104. The target computing node uses the target resource to train a target model.
105、所述目标计算节点将训练好的目标模型存储至所述存储节点。105. The target computing node stores the trained target model in the storage node.
继续沿用步骤102中的例子,则步骤103~105中,继续沿用上述步骤102中的例子,假设提供16个GPU的目标计算节点分别为计算节点1、计算节点2、计算节点3和计算节点4,则该4个计算节点作为目标计算节点,分布式对目标模型进行训练,信令完毕后,将各自训练好的部分存储至存储节点,如存储至HDFS。Continue to use the example in step 102, then in steps 103 to 105, continue to use the example in step 102 above, assuming that the target computing nodes that provide 16 GPUs are computing node 1, computing node 2, computing node 3, and computing node 4 , The four computing nodes are used as target computing nodes, and the target model is trained in a distributed manner. After the signaling is completed, the respective trained parts are stored in the storage node, such as in HDFS.
本申请实施例提供的模型训练方法,控制节点接收到API服务器发送的第一请求后,根据该第一请求为目标模型分配目标资源,并向包含目标资源的目标计算节点发送第二请求,以触发目标计算节点执行模型训练,并将训练好的模型存储至HDFS系统。采用该种方案,用户通过使用预先封装好了的客户端提交训练任务,无需通过命令行编辑脚本等,过程简单,极大程度上提高了模型训练的效率。In the model training method provided by the embodiment of the present application, after receiving the first request sent by the API server, the control node allocates target resources to the target model according to the first request, and sends a second request to the target computing node containing the target resource to Trigger the target computing node to perform model training, and store the trained model in the HDFS system. With this solution, users submit training tasks by using a pre-packaged client without editing scripts through command lines. The process is simple and the efficiency of model training is greatly improved.
本申请实施例中,软件改进方面,大致包括系统架构的改进、slurm开放(open)应用程序接口(Application Programming Interface,API)的改进,下面,对该两项改进进行详细说明。In the embodiments of the present application, software improvements generally include system architecture improvements and slurm open application programming interface (Application Programming Interface, API) improvements. The two improvements will be described in detail below.
第一、系统架构。First, the system architecture.
图7是本申请实施例提供的模型训练方法中HGCP的系统架构示意图。请参照图7,本申请实施例提供的HGCP系统实现了用户和资源的完全隔离。第一用户从HDFS系统下载并安装客户端(client),通过客户端向API发送训练目标模型所需的资源信息,使得API服务器对该资源信息等进行整合得到第一请求,并向控制节点提交用于训练模型的第一请求。当目标任务在目标节点上运行时,第一用户可以通过第一终端设备向目标计算节点发送查询请求,查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时,所述目标资源的使用状况;目标计算节点接收到该查询请求后,获取训练目标模型这项任务的运行状况,并下载允许过程中产生的数据,之后,目标计算节点向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。训练好目标模型后,该目标模型被保持到HDFS系统上,第一用户或其他用户可以从HDFS系统上下载最终结果。下面,对图7中的各个模型进行详细说明。FIG. 7 is a schematic diagram of the system architecture of HGCP in the model training method provided by an embodiment of the present application. Please refer to FIG. 7, the HGCP system provided by the embodiment of the present application realizes complete isolation of users and resources. The first user downloads and installs the client from the HDFS system, and sends the resource information required for training the target model to the API through the client, so that the API server integrates the resource information, etc. to obtain the first request, and submits it to the control node The first request for training the model. When the target task is running on the target node, the first user can send a query request to the target computing node through the first terminal device. The query request is used to request to display the target resource on the target computing node when training the target model. The usage status of the target resource; after receiving the query request, the target computing node obtains the running status of the task of training the target model, and downloads the data generated during the permission process. After that, the target computing node sends to the first terminal device A query response, where the query response carries the usage status information of the target resource, so that the first terminal device displays the usage status of the target resource according to the usage status information. After the target model is trained, the target model is maintained on the HDFS system, and the first user or other users can download the final result from the HDFS system. Hereinafter, each model in FIG. 7 will be described in detail.
a、客户端。a. Client.
本申请实施例中,第一用户可以在任何地方下载并按照存储在HDFS系统上的客户端,通过客户端向API服务器发送资源信息等,使得API服务器对资源信息等进行整合得到第一请求,并向控制节点发送第一请求,一个第一请求可以视为一项任务。第一请求携带的资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量、所述HDFS系统的路径、所述HDFS的用户名或密码。客户端对应的后台通过本申请实施例所述的slurm OPEN API进行任务的提交、查看、终止、训练数据的获取等操作,作业提交采用异步提交模式。实例性的,可参见图8。In the embodiment of the present application, the first user can download anywhere and send resource information to the API server through the client according to the client stored on the HDFS system, so that the API server can integrate the resource information to obtain the first request. And send the first request to the control node, a first request can be regarded as a task. The resource information carried in the first request includes at least one of the following information: the number of target computing nodes, the number of GPUs that are occupied when the target computing node is used to train the target model, and the target computing node is used to train the The number of CPUs occupied in the target model, the path of the HDFS system, and the user name or password of the HDFS. The background corresponding to the client uses the slurm OPEN API described in the embodiments of this application to perform tasks such as submitting, viewing, terminating, and obtaining training data, and the job submission adopts an asynchronous submission mode. For an example, see Figure 8.
图8是本申请实施例提供的模型训练方法中提交任务的过程示意图。请参照图8,第一用户通过第一终端设备上的客户端向上层提交作业,API服务(server)进行请求鉴权,鉴权通过后作业入库。之后,运行在控制节点上的作业管理(job manager)从数据库获取带提交作业,将作业提提交至HGCP,运行在计算节点上的工作同步(Job  SyncController)同步作业运行状态至监视服务器(Monitor server)和slurm资源管理系统。FIG. 8 is a schematic diagram of the process of submitting tasks in the model training method provided by the embodiment of the present application. Referring to FIG. 8, the first user submits the job to the upper layer through the client on the first terminal device, the API service (server) performs request authentication, and the job is stored in the database after the authentication is passed. After that, the job manager running on the control node obtains the submitted job from the database and submits the job to HGCP. The job synchronization (Job SyncController) running on the computing node synchronizes the running status of the job to the monitor server (Monitor server). ) And slurm resource management system.
b、HDFS系统。b. HDFS system.
本申请实施例提供的HGCP中,HDFS系统是用来临时存储用户执行环境并存储最终训练好的模型的系统,其中,用户执行环境即为上述的客户端。另外,本申请实施例并不限制必须时HDFS系统,在其他可行的实现方式中,也可以是第一用户私有的文件系统。In the HGCP provided by the embodiments of the present application, the HDFS system is a system used to temporarily store the user execution environment and store the final trained model, where the user execution environment is the aforementioned client. In addition, the embodiment of the present application does not limit the HDFS system when necessary, and in other feasible implementation manners, it may also be a file system private to the first user.
c、资源调度器。c. Resource scheduler.
实例性的,该资源调度器是控制节点上的一个模块,用于根据第一请求为目标模型分配目标资源。资源分配粒度是以GPU为单位而非以计算节点为单位。如果第一用户的一个模型训练任务不能将目标计算节点上的GPU全部用完,该目标计算节点及其剩余的GPU会被分配给其他训练任务。该调度器可以同时支持CPU和GPU的混合调度。例如,第一用户在提交所需资源为GPU的训练任务时,如果不将全部的GPU用完,则其他用户还可以通过剩余的GPU提交训练任务。Exemplarily, the resource scheduler is a module on the control node, which is used to allocate target resources to the target model according to the first request. The granularity of resource allocation is based on GPU instead of computing node. If a model training task of the first user cannot use up all the GPUs on the target computing node, the target computing node and its remaining GPUs will be allocated to other training tasks. The scheduler can support mixed scheduling of CPU and GPU at the same time. For example, when the first user submits a training task whose required resource is a GPU, if all the GPUs are not used up, other users can also submit training tasks through the remaining GPUs.
d、资源。d. Resources.
本申请实施例中,资源以计算节点和GPU为粒度进行划分,一个训练任务可以运行在不同的计算节点的不同GPU上。In the embodiments of the present application, resources are divided at the granularity of computing nodes and GPUs, and one training task can be run on different GPUs of different computing nodes.
第二、slurm OPEN API。Second, slurm OPEN API.
e、整体架构。e. The overall structure.
示例性的,可参见图9,图9是本申请实施例提供的模型训练方法中slurm OPEN API的示意图。请参照图9,该架构包括:For example, refer to FIG. 9, which is a schematic diagram of the slurm OPEN API in the model training method provided by the embodiment of the present application. Please refer to Figure 9, the architecture includes:
第三方平台,指一些深度学习平台,如飞桨云(paddle clound)等平台;Third-party platforms refer to some deep learning platforms, such as paddle clound and other platforms;
集群组件,指slurm集群客户端;Cluster component, refers to the slurm cluster client;
API server,指slurm OPEN API的统一入口,负责路由解析和请求处理等;API server refers to the unified entrance of slurm OPEN API, responsible for route analysis and request processing, etc.;
鉴权(authentication),指slurm集群认证服务模块;Authentication, refers to the slurm cluster authentication service module;
数据库(database),指XDB数据平台,存储用户权限、作业信息、队列额度(quota)等数据;Database (database) refers to the XDB data platform, which stores data such as user permissions, job information, and queue quota (quota);
作业管理(job manager),用于作业管理控制,负责作业排队和提交控制;Job management (job manager), used for job management control, responsible for job queuing and submission control;
作业同步控制器(job sync Controller),负责同步作业状态、GPU利用率、GPU槽位、节点秩(rank)以及时间等数据;The job synchronization controller (job sync Controller) is responsible for synchronizing data such as job status, GPU utilization, GPU slot, node rank, and time;
队列同步控制器(Queue SyncController),负责推送队列更新事件至第三方平台 (新建队列、队列Quota更新等事件);Queue synchronization controller (Queue SyncController), responsible for pushing queue update events to third-party platforms (new queue, queue Quota update, etc.);
节点监控服务(MonitorServer):部署在每个计算节点上,提供训练作业在该计算节点的运行数据。Node Monitoring Service (MonitorServer): Deployed on each computing node, providing running data of training jobs on that computing node.
f、接口鉴权。f. Interface authentication.
Open API接口鉴权主要用于请求身份认证,判断当前请求的合法性,常用方法有令牌(Token)认证,AK/SK认证;为了接口访问安全性,本文采用AK/SK认证方法。一种可行的实现方式中,控制节点接收第二用户利用第二终端设备发送的管理请求,该管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权时,控制节点调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则所述控制节点确定所述第二用户的管理权限,控制节点根据所述管理权限向所述第二终端设备发送用于更新所述管理平台图形界面的数据流,使得所述第二终端设备更新所述管理平台图形界面并显示,以供所述第二用户通过更新后的管理平台图形界面管理所述集群系统。Open API interface authentication is mainly used for requesting identity authentication and judging the legitimacy of the current request. Common methods include token authentication and AK/SK authentication; for interface access security, this article uses AK/SK authentication method. In a feasible implementation manner, the control node receives a management request sent by a second user using a second terminal device, and the management request carries the access key identifier of the second user and the first key, and the first key It is generated by the second terminal device using a preset authentication mechanism. When the control node calls the cluster open application program interface Open API to authenticate the second user, the control node calls the cluster Open API to use the preset authentication mechanism Generate a second key. If the first key and the second key are the same, the control node determines the management authority of the second user, and the control node issues the management authority to the second terminal according to the management authority. The device sends a data stream for updating the graphical interface of the management platform, so that the second terminal device updates and displays the graphical interface of the management platform, so that the second user can manage the graphical interface of the management platform through the updated management platform graphical interface. Cluster system.
示例性的,采用AK/SK认证时,访问密钥标识(access key ID)用于标识第二用户,第一密钥例如为秘密访问密钥(Secret Access Key,SK),是第二用户用于加密认证字符串和服务用来验证认证字符串的密钥,其中,SK必须保密。控制节点接收到第二用户发送的管理请求后,使用相同的预设认证机制生成认证字符串,以下称之为第二密钥。之后,控制节点比对管理请求中的第一密钥和生成的第二密钥,如果该两个密钥相同,则为第二用户指定管理权限,并执行相关操作,如果该两个密钥不相同,则控制节点将忽略该操作并向第二终端设备返回错误代码。Exemplarily, when AK/SK authentication is adopted, the access key ID (access key ID) is used to identify the second user, and the first key is, for example, the secret access key (Secret Access Key, SK), which is used by the second user. For encrypting the authentication string and the key used by the service to verify the authentication string, SK must be kept secret. After receiving the management request sent by the second user, the control node uses the same preset authentication mechanism to generate an authentication string, which is referred to as the second key below. After that, the control node compares the first key in the management request with the generated second key, and if the two keys are the same, it specifies the management authority for the second user and performs related operations. If the two keys are the same If they are not the same, the control node will ignore the operation and return an error code to the second terminal device.
图10是本申请实施例提供的模型训练方法中鉴权的过程示意图。请参照图10,第二用户通过第二终端设备上的客户端向控制节点上的鉴权服务发送AK/SK,鉴权服务向第二终端设备返回令牌(token);之后,第二用户通过第二终端设备上的客户端向控制节点上的API服务发送管理请求和令牌,API服务根据该管理请求和令牌向第二终端设备发送管理响应。FIG. 10 is a schematic diagram of the authentication process in the model training method provided by the embodiment of the present application. 10, the second user sends AK/SK to the authentication service on the control node through the client on the second terminal device, and the authentication service returns a token to the second terminal device; after that, the second user The client on the second terminal device sends a management request and token to the API service on the control node, and the API service sends a management response to the second terminal device according to the management request and token.
本申请实施例中,第二用户为管理员,可以分为多个等级,如集群管理员、部门管理员、普通用户等,示例性的,可参见表2。In the embodiment of this application, the second user is an administrator, which can be divided into multiple levels, such as cluster administrators, department administrators, ordinary users, etc., for example, see Table 2.
表2Table 2
Figure PCTCN2020117723-appb-000001
Figure PCTCN2020117723-appb-000001
Figure PCTCN2020117723-appb-000002
Figure PCTCN2020117723-appb-000002
根据表2可知:可以针对不同的第二用户设置不同的权限。According to Table 2, it can be seen that different permissions can be set for different second users.
g、API部署。g. API deployment.
示例性的,可参见图11,图11是本申请实施例提供的模型训练方法中api server的部署示意图。请参照图11,为了服务稳定性,api server部署在服务器(server)1、server2和server3等3台服务器上,同时,server1部署job_manager、job_sync_controller 和4个apiserver实例,server2和server3部署1个nginx实例和8个api server实例,api server和nginx绑定,nginx再和BGW绑定。Exemplarily, refer to FIG. 11, which is a schematic diagram of the deployment of api server in the model training method provided by the embodiment of the present application. Please refer to Figure 11. For service stability, api server is deployed on three servers including server 1, server2 and server3. At the same time, server1 deploys job_manager, job_sync_controller and 4 apiserver instances, and server2 and server3 deploy 1 nginx instance With 8 api server instances, api server is bound to nginx, and nginx is bound to BGW.
h、超管平台。h. Supertube platform.
本申请实施例中,针对HGCP设置超管平台进行机器管理、集群管理等,主要为管理员和用户提供一下几个主要特征:In the embodiments of this application, a super-management platform is set up for HGCP for machine management, cluster management, etc., mainly to provide the following main features for administrators and users:
1)便捷管理:通过HGCP超管平台,管理员可以上线、暂停、启动、重启、下线选定的任何节点,同时管理员还可以批量选定计算节点,通过一次鼠标点击以广播的形式向被选中的节点发出命令;1) Convenient management: Through the HGCP super-management platform, the administrator can go online, pause, start, restart, and offline any node selected. At the same time, the administrator can also select computing nodes in batches, and broadcast to The selected node issues a command;
2)模块化:HGCP超管平台系统运行于LINUX服务器上,使用My SQL数据库存储统计、监控、配置、日志等数据,后端整合为具备通用功能的模块,使用超级文本预处理(Hypertext Preprocessor,PHP)、Python、Ansible、Shell开发,通过超管平台API接口对数据库数据以及计算节点进行操作。前端展示页面面向集群管理员与普通用户,尽可能简化操作,提高效率;2) Modularity: The HGCP super-management platform system runs on a LINUX server, and uses MySQL database to store statistics, monitoring, configuration, logs and other data. The back-end is integrated into a module with general functions, and Hypertext Preprocessor is used. PHP), Python, Ansible, Shell development, through the super management platform API interface to operate database data and computing nodes. The front-end display page is for cluster administrators and ordinary users, simplifying operations as much as possible and improving efficiency;
3)高效并发:针对计算节点的环境安装、软件升级更新,管理员能够向集群中所有或者部分节点下发标准环境配置包;3) Efficient concurrency: For the environment installation and software upgrade and update of computing nodes, the administrator can issue standard environment configuration packages to all or some nodes in the cluster;
4)可靠:本申请实施例提供的HGCP配有多台控制节点,保证管理系统的容灾性、服务连续性,这些控制节点通过Ansile远程管理集群计算节点,进行环境配置、升级调整、系统检查等操作。4) Reliability: The HGCP provided in the embodiments of this application is equipped with multiple control nodes to ensure the disaster tolerance and service continuity of the management system. These control nodes use Ansile to remotely manage the cluster computing nodes to perform environmental configuration, upgrade adjustments, and system inspections. And so on.
集群系统的管理过程中,控制节点接收第二用户利用第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点,所述管理请求是所述第二终端设备根据用户对管理平台图形界面的操作得到的,然后,控制节点调用集群开放应用程序接口Open API对第二用户鉴权;接着,若所述第二用户通过鉴权,则所述控制节点根据所述管理请求管理所述集群系统中的计算节点。示例性的,请参照图12,图12是本申请实施例提供的模型训练方法中超管平台的工作示意图。请参照图12,集群Open API包括集群管理API和机器管理API,第二终端设备的屏幕上显示超管平台的管理平台图形界面,集群管理员通过超过平台图形界面进行集群操作,向下调用集群管理API或机器管理API。其中,调用集群管理API时,管理请求用于请求创建或删除集群,基于该调用,配置数据库中集群信息,底层集群管理模块(cluster_manager)探测到数据库中有新操作任务,开始进行相关操作;调用机器管理API时,管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽,基于该调用,配置数据库 中节点信息,节点管理模块(node_manager)探测到数据库中有新操作任务,开始进行相关操作。In the management process of the cluster system, the control node receives a management request sent by a second user using a second terminal device, the management request is used to request management of the computing node in the cluster system, and the management request is the second terminal The device is obtained according to the user's operation on the graphical interface of the management platform, and then the control node calls the cluster open application program interface Open API to authenticate the second user; then, if the second user passes the authentication, the control node The management request manages computing nodes in the cluster system. Exemplarily, please refer to FIG. 12, which is a working schematic diagram of the super tube platform in the model training method provided by the embodiment of the present application. Please refer to Figure 12, the cluster Open API includes cluster management API and machine management API. The screen of the second terminal device displays the management platform graphical interface of the super-management platform. The cluster administrator performs cluster operations through the over-platform graphical interface and calls down the cluster. Management API or machine management API. Among them, when the cluster management API is called, the management request is used to request the creation or deletion of a cluster. Based on the call, the cluster information in the database is configured, and the underlying cluster management module (cluster_manager) detects that there is a new operation task in the database and starts to perform related operations; In the machine management API, the management request is used to request any one of the at least one computing node to perform any one of the following operations: online, offline, restart, reinstall, repair, and shield. Based on the call, configure For node information in the database, the node management module (node_manager) detects that there is a new operation task in the database and starts to perform related operations.
接下来,对针对集群的操作和节点的操作分别进详细说明。Next, the operation for the cluster and the operation of the node are separately described in detail.
针对集群的操作包括:Operations for clusters include:
1、创建集群。1. Create a cluster.
创建过程中,首先,对参数进行校验,包括集群是否已存在、上线参数是否合法等;然后,将集群任务写入集群操作任务(cluster_task表),任务操作(task_op)设置为安装(install),任务状态(task-status)设置为待定(pending),最终由集群管理(cluster manager)完成真正的上线操作。During the creation process, first, verify the parameters, including whether the cluster already exists, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and set the task operation (task_op) to install (install) , The task status (task-status) is set to pending (pending), and finally the cluster manager (cluster manager) completes the real online operation.
2、删除集群。2. Delete the cluster.
为了集群安全,管理员必须先线性集群所有机器,才能删除集群。删除过程中,首先,对参数进行校验,包括集群是否存在,集群是否还存在运行中机器,上线参数是否合法等;然后,将集群任务写入集群操作任务(cluster_task表),任务操作(task_op)设置为卸载(uninstall),任务状态(task-status)设置为待定(pending),最终由集群管理(cluster manager)完成真正的下线操作。For cluster security, the administrator must first cluster all machines linearly before deleting the cluster. During the deletion process, first, verify the parameters, including whether the cluster exists, whether there are still running machines in the cluster, whether the online parameters are legal, etc.; then, write the cluster task into the cluster operation task (cluster_task table), and the task operation (task_op ) Is set to uninstall, the task status (task-status) is set to pending, and finally the cluster manager completes the real offline operation.
3、集群基本信息列表3. List of basic cluster information
该集群基本信息列表包括cluster_info表和cluster_task表,cluster_info表中包含已上线运行的集群信息,cluster_info表中包含处于流程中的集群信息。若该两个表表示同一个集群,如果有下线操作,则状态以cluster_task表中的状态为主。The basic information list of the cluster includes the cluster_info table and the cluster_task table. The cluster_info table contains the information of the cluster that has been running online, and the cluster_info table contains the information of the cluster in the process. If the two tables represent the same cluster, and if there are offline operations, the status is based on the status in the cluster_task table.
4、集群详细信息列表4. List of cluster details
本申请实施例中,只有running状态的集群可调用详情接口。cluster_info表中包含已上线的集群,node_info表中汇聚所需要的信息。In the embodiment of this application, only clusters in the running state can call the detailed interface. The cluster_info table contains the online clusters, and the node_info table aggregates the required information.
5、集群机器列表展示5. Display of cluster machine list
该集群机器列表包括node_info表和node_task表,node_info表获取去已上线的机器列表,node_info表获取已上线的机器列表,node_task表获取流程中的机器列表。The cluster machine list includes node_info table and node_task table. The node_info table obtains the list of online machines, the node_info table obtains the list of online machines, and the node_task table obtains the list of machines in the process.
6、上线机器6. On-line machine
本申请实施例中,上线是一种操作,该操作的效果是对集群系统扩容。上线过程中,首先,对参数进行校验,包括校验集群是否存在、对上线参数的合法性进行校验等。之后,将上线任务写入节点操作任务(node_task表),任务操作(task_op)设置为安装(install),任务状态(task-status)设置为待定(pending),将待上线的 及节点信息写入节点信息(node_info)表,标记状态为安装(installing),最终由节点管理(node manager)完成真正的上线操作,以及完成更新task和info表。In the embodiment of the present application, going online is an operation, and the effect of this operation is to expand the capacity of the cluster system. During the online process, first, the parameters are verified, including verifying the existence of the cluster, and verifying the validity of the online parameters. After that, write the online task to the node operation task (node_task table), set the task operation (task_op) to install (install), set the task status (task-status) to pending (pending), and write the information about the node to be launched and the node The node information (node_info) table is marked as installing, and finally the node manager (node manager) completes the actual online operation and completes the update of the task and info tables.
7、下线机器7. Offline machines
本申请实施例中,下行机器时,自动先将机器标记为不可调度,再执行下线流程。下线过程中,首先对参数进行校验,包括校验集群是否存在、对下线参数的合法性进行校验等。之后,查询节点信息(node_info)表,如果及诶单之前流程出错,则直接从node_info表中删除节点即可,将下线任务写入节点操作任务(node_task表),任务操作(task_op)设置为卸载(uninstall),任务状态(task-status)设置为待定(pending),将待上线的及节点信息写入节点信息(node_info)表,最终由节点管理(node manager)完成真正的下线操作。In the embodiment of the present application, when the machine is downloaded, the machine is automatically marked as unschedulable first, and then the offline process is executed. During the offline process, the parameters are first verified, including verifying the existence of the cluster, and verifying the validity of the offline parameters. After that, query the node information (node_info) table. If there is an error in the process before the order, delete the node directly from the node_info table, write the offline task to the node operation task (node_task table), and set the task operation (task_op) to Uninstall, the task status (task-status) is set to pending (pending), the to-be-online and node information is written into the node information (node_info) table, and finally the node manager completes the real offline operation.
8、变更机器屏蔽状态8. Change the shielding state of the machine
变更过程中,首先对参数进行校验,包括校验机器是否加入集群、对参数的合法性进行校验等。之后,查询集群信息(cluster_info)表获取集群apiserver地址,调用apiserver接口完成状态屏蔽。During the change process, the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the state shielding.
9、变更机器归属9. Change machine ownership
变更过程中,首先对参数进行校验,包括校验机器是否加入集群、对参数的合法性进行校验等。之后,查询集群信息(cluster_info)表获取集群apiserver地址,调用apiserver接口完成机器归属标签变更。During the change process, the parameters are first verified, including verifying whether the machine joins the cluster, and verifying the validity of the parameters. After that, query the cluster information (cluster_info) table to obtain the cluster apiserver address, and call the apiserver interface to complete the machine attribution label change.
上述介绍了本申请实施例提到的模型训练方法的具体实现,下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The foregoing describes the specific implementation of the model training method mentioned in the embodiment of the present application. The following are device embodiments of the present application, which can be used to implement the method embodiments of the present application. For details that are not disclosed in the device embodiments of this application, please refer to the method embodiments of this application.
图13为本申请实施例提供的模型训练装置的结构示意图。该装置可以集成在电子设备中或通过电子设备实现,电子设备可以终端设备或服务器等。如图13所示,在本实施例中,该模型训练装置100可以包括:FIG. 13 is a schematic structural diagram of a model training device provided by an embodiment of the application. The device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server. As shown in FIG. 13, in this embodiment, the model training apparatus 100 may include:
接收单元11,用于接收应用程序接口API服务器发送的第一请求,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The receiving unit 11 is configured to receive a first request sent by an application program interface API server, where the first request is resource information required by the API server according to the training target model sent by the first user through the client on the first terminal owned;
处理单元12,用于根据所述资源信息,为所述目标模型分配目标资源;The processing unit 12 is configured to allocate target resources to the target model according to the resource information;
发送单元13,用于向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The sending unit 13 is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点 的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,所述接收单元11,还用于接收第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点;In a feasible design, the receiving unit 11 is further configured to receive a management request sent by a second terminal device, and the management request is used to request management of computing nodes in the cluster system;
所述处理单元12,还用于根据所述管理请求管理所述集群系统中的计算节点。The processing unit 12 is further configured to manage the computing nodes in the cluster system according to the management request.
一种可行的设计中,所述处理单元12,在根据所述管理请求管理所述集群系统中的计算节点时,调用集群开放应用程序接口Open API对第二用户鉴权,若所述第二用户通过鉴权,则根据所述管理请求管理所述集群系统中的计算节点。In a feasible design, when the processing unit 12 manages the computing nodes in the cluster system according to the management request, it calls the cluster open application program interface Open API to authenticate the second user. After the user passes the authentication, the computing nodes in the cluster system are managed according to the management request.
一种可行的设计中,所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述处理单元12,用于调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则确定所述第二用户的管理权限;In a feasible design, the management request carries the access key identifier of the second user and the first key, and the first key is generated by the second terminal device using a preset authentication mechanism, so The processing unit 12 is configured to call the cluster Open API to generate a second key using the preset authentication mechanism, and if the first key and the second key are the same, determine the second user Management authority;
所述发送单元13,还用于根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。The sending unit 13 is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
一种可行的设计中,所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;In a feasible design, the cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;
或者,or,
所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
本申请实施例提供的装置,可用于如上实施例中控制节点执行的方法,其实现原理和技术效果类似,在此不再赘述The device provided in the embodiment of the present application can be used in the method executed by the control node in the above embodiment, and its implementation principle and technical effect are similar, and will not be repeated here.
图14为本申请实施例提供的模型训练装置的结构示意图。该装置可以集成在电子设备中或通过电子设备实现,电子设备可以终端设备或服务器等。如图14所示,在本实施例中,该模型训练装置200可以包括:FIG. 14 is a schematic structural diagram of a model training device provided by an embodiment of the application. The device can be integrated in an electronic device or realized by an electronic device, and the electronic device can be a terminal device or a server. As shown in FIG. 14, in this embodiment, the model training device 200 may include:
接收单元21,用于接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于所述至少一个计算节点;The receiving unit 21 is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. The first request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
处理单元22,用于使用所述目标资源训练所述目标模型;The processing unit 22 is configured to use the target resource to train the target model;
发送单元23,用于将训练好的目标模型发送至存储节点。The sending unit 23 is used to send the trained target model to the storage node.
一种可行的设计中,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。In a feasible design, the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target computing node is used to train the target model, and the target computing node The number of CPUs occupied by the node when training the target model.
一种可行的设计中,所述接收单元21,还用于接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;In a feasible design, the receiving unit 21 is further configured to receive a query request sent by the first terminal device, and the query request is used to request display of target resources on the target computing node to train the target model The usage status of the target resource at the time;
所述发送单元23,还用于向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The sending unit 23 is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the information according to the usage status information. State the usage status of the target resource.
本申请实施例提供的装置,可用于如上实施例中目标计算节点执行的方法,其实现原理和技术效果类似,在此不再赘述。The device provided in the embodiment of the present application can be used in the method executed by the target computing node in the above embodiment, and its implementation principles and technical effects are similar, and will not be repeated here.
图15是用来实现本申请实施例的模型训练方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。Fig. 15 is a block diagram of an electronic device used to implement the model training method of an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples, and are not intended to limit the implementation of the application described and/or required herein.
如图15所示,该电子设备包括:一个或多个处理器31、存储器32,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中,若需要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个电子设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图15中以一个处理器31为例。As shown in FIG. 15, the electronic device includes: one or more processors 31, memory 32, and interfaces for connecting various components, including a high-speed interface and a low-speed interface. The various components are connected to each other using different buses, and can be installed on a common motherboard or installed in other ways as needed. The processor may process instructions executed in the electronic device, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to an interface). In other embodiments, if necessary, multiple processors and/or multiple buses can be used with multiple memories and multiple memories. Similarly, multiple electronic devices can be connected, and each device provides part of the necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). One processor 31 is taken as an example in FIG. 15.
存储器32即为本申请所提供的非瞬时计算机可读存储介质。其中,所述存储器存储有可由至少一个处理器执行的指令,以使所述至少一个处理器执行本申请所提供的模型训练的方法。本申请的非瞬时计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行本申请所提供的模型训练的方法。The memory 32 is a non-transitory computer-readable storage medium provided by this application. Wherein, the memory stores instructions executable by at least one processor, so that the at least one processor executes the model training method provided in this application. The non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to make a computer execute the model training method provided in the present application.
存储器32作为一种非瞬时计算机可读存储介质,可用于存储非瞬时软件程序、 非瞬时计算机可执行程序以及模块,如本申请实施例中的模型训练的方法对应的程序指令/模块(例如,附图13所示的接收单元11、处理单元12和发送单元13,以及附图14所示的接收单元21、处理单元22和发送单元23)。处理器31通过运行存储在存储器32中的非瞬时软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例中的模型训练的方法。As a non-transitory computer-readable storage medium, the memory 32 can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the model training method in the embodiment of the present application (for example, The receiving unit 11, the processing unit 12, and the sending unit 13 shown in FIG. 13, and the receiving unit 21, the processing unit 22, and the sending unit 23 shown in FIG. 14). The processor 31 executes various functional applications and data processing of the server by running non-transitory software programs, instructions, and modules stored in the memory 32, that is, implementing the method of model training in the foregoing method embodiment.
存储器32可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据模型训练的电子设备的使用所创建的数据等。此外,存储器32可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中,存储器32可选包括相对于处理器31远程设置的存储器,这些远程存储器可以通过网络连接至模型训练的电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 32 may include a program storage area and a data storage area. The program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created based on the use of an electronic device trained by the model. In addition, the memory 32 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 32 may optionally include memories remotely provided with respect to the processor 31, and these remote memories may be connected to an electronic device for model training via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
模型训练的方法的电子设备还可以包括:输入装置33和输出装置34。处理器31、存储器32、输入装置33和输出装置34可以通过总线或者其他方式连接,图15中以通过总线连接为例。The electronic equipment of the model training method may further include: an input device 33 and an output device 34. The processor 31, the memory 32, the input device 33, and the output device 34 may be connected by a bus or in other ways. In FIG. 15, the connection by a bus is taken as an example.
输入装置33可接收输入的数字或字符信息,以及产生与模型训练的电子设备的用户设置以及功能控制有关的键信号输入,例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置34可以包括显示设备、辅助照明装置(例如,LED)和触觉反馈装置(例如,振动电机)等。该显示设备可以包括但不限于,液晶显示器(LCD)、发光二极管(LED)显示器和等离子体显示器。在一些实施方式中,显示设备可以是触摸屏。The input device 33 can receive input digital or character information, and generate key signal input related to the user settings and function control of the electronic equipment for model training, such as touch screen, keypad, mouse, track pad, touch pad, indicator stick, a Or multiple mouse buttons, trackballs, joysticks and other input devices. The output device 34 may include a display device, an auxiliary lighting device (for example, LED), a tactile feedback device (for example, a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device may be a touch screen.
此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein can be implemented in digital electronic circuit systems, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, the programmable processor It can be a dedicated or general-purpose programmable processor that can receive data and instructions from the storage system, at least one input device, and at least one output device, and transmit the data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令,并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的,术语“机器可读介质”和“计算机可读介质” 指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如,磁盘、光盘、存储器、可编程逻辑装置(PLD)),包括,接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。These computing programs (also referred to as programs, software, software applications, or codes) include machine instructions for programmable processors, and can be implemented using high-level procedures and/or object-oriented programming languages, and/or assembly/machine language Calculation program. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and/or device used to provide machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memory, programmable logic devices (PLD)), including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。In order to provide interaction with the user, the systems and techniques described here can be implemented on a computer that has: a display device for displaying information to the user (for example, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) ); and a keyboard and a pointing device (for example, a mouse or a trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (for example, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, voice input, or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and technologies described herein can be implemented in a computing system that includes back-end components (for example, as a data server), or a computing system that includes middleware components (for example, an application server), or a computing system that includes front-end components (for example, A user computer with a graphical user interface or a web browser, through which the user can interact with the implementation of the system and technology described herein), or includes such back-end components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be connected to each other through any form or medium of digital data communication (for example, a communication network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。The computer system can include clients and servers. The client and server are generally far away from each other and usually interact through a communication network. The relationship between the client and the server is generated by computer programs that run on the corresponding computers and have a client-server relationship with each other.
本申请实施例还提供一种集群系统,包括:控制节点和至少一个计算节点,其中,所述控制节点,与所述至少一个计算节点中的各计算节点基于传输控制协议TCP建立网络连接;所述计算节点的计算资源包括至少一个中央处理器CPU和至少一个图形处理器GPU。An embodiment of the present application also provides a cluster system, including: a control node and at least one computing node, wherein the control node establishes a network connection with each computing node of the at least one computing node based on the transmission control protocol TCP; The computing resources of the computing node include at least one central processing unit (CPU) and at least one graphics processing unit (GPU).
根据本申请实施例提供的技术方案,通过引入GPU作为计算资源,大幅度提升集群系统的硬件能力,进而提升模型训练的效率;在软件方面,通过对slurm框架进行优化,引入客户端、超级管理平台等,使得集群系统用起来更方便。According to the technical solution provided by the embodiments of this application, by introducing GPU as a computing resource, the hardware capability of the cluster system is greatly improved, thereby improving the efficiency of model training; in terms of software, the slurm framework is optimized, and the client and super management are introduced. Platform, etc., make the cluster system more convenient to use.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps described in the present application can be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present application can be achieved, this is not limited herein.
上述具体实施方式,并不构成对本申请保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等,均应包含在本申请保护范围之内。The foregoing specific implementations do not constitute a limitation on the protection scope of the present application. Those skilled in the art should understand that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent replacement and improvement made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims (23)

  1. 一种集群系统,其特征在于,包括:控制节点、至少一个计算节点、存储节点;其中,A cluster system is characterized by comprising: a control node, at least one computing node, and a storage node; wherein,
    所述控制节点,与所述至少一个计算节点中的各计算节点建立连接,用于为训练目标模型的任务分配计算资源;The control node establishes a connection with each computing node of the at least one computing node, and is used to allocate computing resources for the task of training the target model;
    所述计算节点包括至少一个中央处理器CPU和至少一个图形处理器GPU,用于利用所述计算资源训练目标模型;The computing node includes at least one central processing unit (CPU) and at least one graphics processing unit (GPU), and is configured to use the computing resources to train a target model;
    所述存储节点与所述至少一个计算节点中的各计算节点建立网络连接,用于存储训练目标模型所需的数据。The storage node establishes a network connection with each of the at least one computing node, and is used to store data required for training the target model.
  2. 根据权利要求1所述的系统,其特征在于,The system of claim 1, wherein:
    所述至少一个计算节点中的任意两个计算节点基于无限带宽Infiniband技术互联建立网络连接,所述计算节点内部的CPU与GPU通过高速外围组件互联PCIE连接,所述计算节点内部的GPU与GPU通过NV link连接。Any two computing nodes in the at least one computing node establish a network connection based on the infinite bandwidth Infiniband technology interconnection, the CPU and GPU inside the computing node are connected through high-speed peripheral component interconnection PCIE, and the GPU and GPU inside the computing node are connected through NV link connection.
  3. 一种模型训练方法,其特征在于,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:
    所述控制节点接收应用程序接口API服务器发送的第一请求,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The control node receives a first request sent by an application program interface API server, and the first request is obtained by the API server according to the resource information required by the training target model sent by the first user through the client on the first terminal ;
    所述控制节点根据所述资源信息,为所述目标模型分配目标资源;The control node allocates target resources to the target model according to the resource information;
    所述控制节点向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The control node sends a second request to the target computing node, so that the target computing node uses the target resource to train a target model.
  4. 根据权利要求3所述的方法,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The method according to claim 3, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
  5. 根据权利要求3或4所述的方法,其特征在于,还包括:The method according to claim 3 or 4, further comprising:
    所述控制节点接收第二终端设备发送的管理请求,所述管理请求用于请求管理所述集群系统中的计算节点;The control node receives a management request sent by a second terminal device, where the management request is used to request management of computing nodes in the cluster system;
    所述控制节点根据所述管理请求管理所述集群系统中的计算节点。The control node manages the computing nodes in the cluster system according to the management request.
  6. 根据权利要求5所述的方法,其特征在于,所述控制节点根据所述管理请求管理所述集群系统中的计算节点,包括:The method according to claim 5, wherein the control node managing the computing nodes in the cluster system according to the management request comprises:
    所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权;The control node calls the cluster open application program interface Open API to authenticate the second user;
    若所述第二用户通过鉴权,则所述控制节点根据所述管理请求管理所述集群系统中的计算节点。If the second user passes the authentication, the control node manages the computing nodes in the cluster system according to the management request.
  7. 根据权利要求6所述的方法,其特征在于,所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述控制节点调用集群开放应用程序接口Open API对第二用户鉴权,包括:The method according to claim 6, wherein the management request carries the access key identifier of the second user and a first key, and the first key is the second terminal device using a preset The authentication mechanism generated by the control node invoking the cluster open application program interface Open API to authenticate the second user includes:
    所述控制节点调用所述集群Open API,利用所述预设认证机制生成第二密钥;The control node calls the cluster Open API, and generates a second key by using the preset authentication mechanism;
    若所述第一密钥和所述第二密钥相同,则所述控制节点确定所述第二用户的管理权限;If the first key and the second key are the same, the control node determines the management authority of the second user;
    所述控制节点根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。The control node sends authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
  8. 根据权利要求6所述的方法,其特征在于,The method of claim 6, wherein:
    所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;
    或者,or,
    所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
  9. 一种模型训练方法,其特征在于,适用于控制节点、至少一个计算节点、存储节点的集群系统,所述方法包括:A model training method, characterized in that it is suitable for a cluster system of a control node, at least one computing node, and a storage node. The method includes:
    目标计算节点接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于所述至少一个计算节点;The target computing node receives a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. The first request Is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in the at least one computing node;
    所述目标计算节点使用所述目标资源训练所述目标模型;The target computing node uses the target resource to train the target model;
    所述目标计算节点将训练好的目标模型发送至存储节点。The target computing node sends the trained target model to the storage node.
  10. 根据权利要求9所述的方法,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The method according to claim 9, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
  11. 根据权利要求9或10所述的方法,其特征在于,还包括:The method according to claim 9 or 10, further comprising:
    所述目标计算节点接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;Receiving, by the target computing node, a query request sent by the first terminal device, where the query request is used to request to display the target resource usage status of the target resource on the target computing node when the target model is trained;
    所述目标计算节点向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The target computing node sends a query response to the first terminal device, and the query response carries the usage status information of the target resource, so that the first terminal device displays the status of the target resource according to the usage status information. Usage status.
  12. 一种模型训练装置,其特征在于,包括:A model training device is characterized in that it comprises:
    接收单元,用于接收应用程序接口API服务器发送的第一请求,所述第一请求携带训练目标模型所需的资源信息,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的;The receiving unit is configured to receive a first request sent by an application program interface API server, the first request carrying resource information required for training the target model, and the first request is that the API server passes through the first terminal according to the first user Obtained from the resource information required by the training target model sent by the client on the Internet;
    处理单元,用于根据所述资源信息,为所述目标模型分配目标资源;A processing unit, configured to allocate target resources to the target model according to the resource information;
    发送单元,用于向目标计算节点发送第二请求,使得所述目标计算节点使用所述目标资源训练目标模型。The sending unit is configured to send a second request to a target computing node, so that the target computing node uses the target resource to train a target model.
  13. 根据权利要求12所述的装置,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The apparatus according to claim 12, wherein the resource information includes at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
  14. 根据权利要求12或13所述的装置,其特征在于,The device according to claim 12 or 13, characterized in that:
    所述接收单元,还用于接收第二终端设备发送的管理请求,所述管理请求用于请求管理集群系统中的计算节点;The receiving unit is further configured to receive a management request sent by the second terminal device, where the management request is used to request management of computing nodes in the cluster system;
    所述处理单元,还用于根据所述管理请求管理所述集群系统中的计算节点。The processing unit is further configured to manage the computing nodes in the cluster system according to the management request.
  15. 根据权利要求14所述的装置,其特征在于,The device of claim 14, wherein:
    所述处理单元,在根据所述管理请求管理所述集群系统中的计算节点时,调用集群开放应用程序接口Open API对第二用户鉴权,若所述第二用户通过鉴权,则根据所述管理请求管理所述集群系统中的计算节点。The processing unit, when managing the computing nodes in the cluster system according to the management request, calls the cluster open application program interface Open API to authenticate the second user, and if the second user passes the authentication, then The management request manages the computing nodes in the cluster system.
  16. 根据权利要求15所述的装置,其特征在于,The device of claim 15, wherein:
    所述管理请求携带所述第二用户的访问密钥标识和第一密钥,所述第一密钥是所述第二终端设备利用预设认证机制生成的,所述处理单元,用于调用所述集群Open API,利用所述预设认证机制生成第二密钥,若所述第一密钥和所述第二密钥相同,则确定所述第二用户的管理权限;The management request carries the access key identifier of the second user and a first key, the first key is generated by the second terminal device using a preset authentication mechanism, and the processing unit is configured to call The cluster Open API uses the preset authentication mechanism to generate a second key, and if the first key and the second key are the same, determine the management authority of the second user;
    所述发送单元,还用于根据所述管理权限向所述第二终端设备发送权限信息,以使得所述第二终端设备根据所述权限信息显示所述第二用户对应的权限。The sending unit is further configured to send authority information to the second terminal device according to the management authority, so that the second terminal device displays the authority corresponding to the second user according to the authority information.
  17. 根据权利要求15所述的装置,其特征在于,The device of claim 15, wherein:
    所述集群Open API包括集群管理API,所述管理请求用于请求创建或删除集群;The cluster Open API includes a cluster management API, and the management request is used to request creation or deletion of a cluster;
    或者,or,
    所述集群Open API包括机器管理API,所述管理请求用于请求对所述至少一个计算节点中的任意一个计算节点执行下述任一项操作:上线、下线、重启、重装、维修、屏蔽。The cluster Open API includes a machine management API, and the management request is used to request any one of the at least one computing node to perform any of the following operations: online, offline, restart, reinstall, maintenance, shield.
  18. 一种模型训练装置,其特征在于,包括:A model training device is characterized in that it comprises:
    接收单元,用于接收控制节点发送的第二请求,所述第二请求是所述控制节点接收到应用程序接口API服务器发送的第一请求并为目标模型分配目标资源后发送的,所述第一请求是所述API服务器根据第一用户通过第一终端上的客户端发送的训练目标模型所需的资源信息得到的,所述目标节点包含于至少一个计算节点;The receiving unit is configured to receive a second request sent by the control node. The second request is sent after the control node receives the first request sent by the application program interface API server and allocates target resources to the target model. A request is obtained by the API server according to the resource information required for training the target model sent by the first user through the client on the first terminal, and the target node is included in at least one computing node;
    处理单元,用于使用所述目标资源训练所述目标模型;A processing unit, configured to use the target resource to train the target model;
    发送单元,用于将训练好的目标模型发送至存储节点。The sending unit is used to send the trained target model to the storage node.
  19. 根据权利要求18所述的装置,其特征在于,所述资源信息包括下述信息中的至少一个:目标计算节点的数量、利用所述目标计算节点训练所述目标模型时被占用的GPU的数量、利用所述目标计算节点训练所述目标模型时被占用的CPU的数量。The device according to claim 18, wherein the resource information comprises at least one of the following information: the number of target computing nodes, the number of GPUs occupied when the target model is trained using the target computing nodes The number of CPUs occupied when training the target model by using the target computing node.
  20. 根据权利要求18或19所述的装置,其特征在于,The device according to claim 18 or 19, wherein:
    所述接收单元,还用于接收所述第一终端设备发送的查询请求,所述查询请求用于请求展示所述目标计算节点上的目标资源训练所述目标模型时所述目标资源的使用状况;The receiving unit is further configured to receive a query request sent by the first terminal device, where the query request is used to request display of the target resource on the target computing node when the target model is trained. ;
    所述发送单元,还用于向所述第一终端设备发送查询响应,所述查询响应携带所述目标资源的使用状况信息,以使得所述第一终端设备根据所述使用状况信息显示所述目标资源的使用状况。The sending unit is further configured to send a query response to the first terminal device, where the query response carries the usage status information of the target resource, so that the first terminal device displays the The usage status of the target resource.
  21. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    至少一个处理器;以及At least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,A memory communicatively connected with the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求3-8中任一项所述的方法;或者,以使所述至少一个处理器能够执行权利要求9-11任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute any one of claims 3-8的方法; Or, to enable the at least one processor to execute the method of any one of claims 9-11.
  22. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行权利要求3-8中任一项所述的方法;或者,所述计算机指令用于使所述计算机执行权利要求9-11中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make the computer execute the method according to any one of claims 3-8; or, the computer instructions It is used to make the computer execute the method according to any one of claims 9-11.
  23. 一种集群系统,其特征在于,包括:控制节点和至少一个计算节点,其中,A cluster system is characterized by comprising: a control node and at least one computing node, wherein,
    所述控制节点,与所述至少一个计算节点中的各计算节点基于传输控制协议TCP建立网络连接;The control node establishes a network connection with each of the at least one computing node based on the transmission control protocol TCP;
    所述计算节点的计算资源包括至少一个中央处理器CPU和至少一个图形处理器GPU。The computing resources of the computing node include at least one central processing unit CPU and at least one graphics processing unit GPU.
PCT/CN2020/117723 2020-02-05 2020-09-25 Model training method and apparatus, and clustering system WO2021155667A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010080825.4A CN111327692A (en) 2020-02-05 2020-02-05 Model training method and device and cluster system
CN202010080825.4 2020-02-05

Publications (1)

Publication Number Publication Date
WO2021155667A1 true WO2021155667A1 (en) 2021-08-12

Family

ID=71172573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117723 WO2021155667A1 (en) 2020-02-05 2020-09-25 Model training method and apparatus, and clustering system

Country Status (2)

Country Link
CN (1) CN111327692A (en)
WO (1) WO2021155667A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system
CN111984744B (en) * 2020-08-13 2021-03-19 北京陌陌信息技术有限公司 Information processing method based on remote communication and artificial intelligence and cloud service platform
CN112087506B (en) * 2020-09-01 2023-02-07 北京火山引擎科技有限公司 Cluster node management method and device and computer storage medium
CN112241321A (en) * 2020-09-24 2021-01-19 北京影谱科技股份有限公司 Computing power scheduling method and device based on Kubernetes
CN113033098B (en) * 2021-03-26 2022-05-17 山东科技大学 Ocean target detection deep learning model training method based on AdaRW algorithm
CN113159284A (en) * 2021-03-31 2021-07-23 华为技术有限公司 Model training method and device
CN114584455B (en) * 2022-03-04 2023-06-30 吉林大学 Small and medium-sized high-performance cluster monitoring system based on enterprise WeChat

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480717A (en) * 2017-08-16 2017-12-15 北京奇虎科技有限公司 Train job processing method and system, computing device, computer-readable storage medium
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN108564164A (en) * 2018-01-08 2018-09-21 中山大学 A kind of parallelization deep learning method based on SPARK platforms
US20180314926A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Smart memory handling and data management for machine learning networks
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN109409738A (en) * 2018-10-25 2019-03-01 平安科技(深圳)有限公司 Method, the electronic device of deep learning are carried out based on block platform chain
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN110413294A (en) * 2019-08-06 2019-11-05 中国工商银行股份有限公司 Service delivery system, method, apparatus and equipment
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180314926A1 (en) * 2017-04-28 2018-11-01 Intel Corporation Smart memory handling and data management for machine learning networks
CN107480717A (en) * 2017-08-16 2017-12-15 北京奇虎科技有限公司 Train job processing method and system, computing device, computer-readable storage medium
CN107766148A (en) * 2017-08-31 2018-03-06 北京百度网讯科技有限公司 A kind of isomeric group and task processing method and device
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN108564164A (en) * 2018-01-08 2018-09-21 中山大学 A kind of parallelization deep learning method based on SPARK platforms
CN109086134A (en) * 2018-07-19 2018-12-25 郑州云海信息技术有限公司 A kind of operation method and device of deep learning operation
CN109409738A (en) * 2018-10-25 2019-03-01 平安科技(深圳)有限公司 Method, the electronic device of deep learning are carried out based on block platform chain
CN110413294A (en) * 2019-08-06 2019-11-05 中国工商银行股份有限公司 Service delivery system, method, apparatus and equipment
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system

Also Published As

Publication number Publication date
CN111327692A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
WO2021155667A1 (en) Model training method and apparatus, and clustering system
US11150952B2 (en) Accelerating and maintaining large-scale cloud deployment
JP7421511B2 (en) Methods and apparatus, electronic devices, readable storage media and computer programs for deploying applications
US9606824B2 (en) Administering virtual machines in a distributed computing environment
US10255097B2 (en) Administering virtual machines in a distributed computing environment
JP7170768B2 (en) Development machine operation task processing method, electronic device, computer readable storage medium and computer program
US9503515B2 (en) Administering virtual machines in a distributed computing environment
US9612857B2 (en) Administering virtual machines in a distributed computing environment
US9703587B2 (en) Administering virtual machines in a distributed computing environment
US8977752B2 (en) Event-based dynamic resource provisioning
WO2013135016A1 (en) Version construction system and method
CN114579250A (en) Method, device and storage medium for constructing virtual cluster
CN110019059B (en) Timing synchronization method and device
Li et al. Improving spark performance with zero-copy buffer management and RDMA
Liu et al. BSPCloud: A hybrid distributed-memory and shared-memory programming model
WO2021174791A1 (en) Task migration method and apparatus, and electronic device and storage medium
Zhou et al. Software-defined streaming-based code scheduling for transparent computing
CN115242786B (en) Multi-mode big data job scheduling system and method based on container cluster
Lascu et al. IBM zEnterprise EC12 technical guide
Li et al. Collaborative Management System Driven by Task Flow in Supercomputing Environment
Zou et al. Structural finite element method based on cloud computing
KR20210043523A (en) Data mining system, method, apparatus, electronic device and storage medium
CN117742891A (en) Virtual machine creation method, device and equipment with vDPA equipment and storage medium
CN118260036A (en) Method, system and medium for processing Flink operation
CN112783610A (en) Saltstack-based Ceph deployment host node

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20917276

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20917276

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/03/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20917276

Country of ref document: EP

Kind code of ref document: A1